The R package dplyr, written by Hadley Wickham is only a few months old but has already become an important part of our data analysis/manipulation workflow, replacing functions that we have used for years. There are several reasons why dplyr is such a valuable tool but most important from my perspective are the following:
- Speed. You’ll see below that dplyr is much, much faster than other, more traditional, functions.
- Direct connection to and analysis within external databases permitting simpler handling of large data
- Function chaining that allows us to avoid cluttering our workspace with interim objects
- Syntax simplicity and ease of use. The code is easy to write and to follow.
To illustrate these enhancements and convince you that dplyr is worth trying out, I’m going to illustrate with an example of population data from US Census blocks. In total the table I will be using, which is stored in a PostgreSQL database, has more than 11 million records. Each record has a population value and Census blocks are wholly contained in US counties so we will illustrate the nice features of dplyr
with the simple example of grouping blocks by county and summing population. The size of this database and the fact that it’s housed in an external database will help to show dplyr functionality.
Census blocks are relatively small geographic units and, in city settings, tend to encapsulate actual city blocks. Here is what they look like in lower Manhattan:
1. Speed – dplyr
functions are much faster than other functions
To demonstrate speed differences I will first I will conduct the calculations using our pre-dplyr
approach – by connecting to the database with the package RPostgreSQL
, then pulling the entire table into R in one big chunk and conducting the analysis with the aggregate
function in the stats
library.
Note that from purely a speed perspective, others have shown that the package data.table
can outperform dplyr in some settings (see, for example, Tal Galili’s post here) but the performance boost is minimal and probably not meaningful is most real-world setting.
To make things a little more interesting let’s get the mean, max and count of blocks within the counties (in addition to sum). By the way, the unique ID for a county is the combination of state ID (statefp10) and county ID (countyfp10) so we need to aggregate by both variables.
How long do the calculations take using aggregate
?
I want to keep the elapsed time involved in importing the data separate from the elapsed time for the calculations so I can compare more easily so I’ll set two timers.
library(RPostgreSQL)
drv<- dbDriver("PostgreSQL") # set up connection to database
con<-dbConnect(drv, user="postgres", password="spatial", dbname="census") # set up connection to database
q<-"SELECT statefp10, countyfp10, pop10 FROM geo_tabblock2010_all_pophu" # our query
start<-proc.time() #timer for data import
fulltable<-fetch(dbSendQuery(con, q), n=-1)# fetch the data en masse
dim(fulltable)# dimensions of the full block table
## [1] 11078297 3
getdata.timer<-proc.time()-start
getdata.timer[3] # elapsed time for data import
## elapsed
## 55.44
Reading in the full table took 55.44 seconds. Now we’re ready to compare the actual calculations using the in-memory table fulltable
.
start<-proc.time() #timer for aggregate
# aggregate the traditional way
use.aggregate<-aggregate(pop10~statefp10+countyfp10,data=fulltable,
FUN=function(x) c(sum=sum(x), mean=mean(x), max=max(x), cnt=length(x))) #summarize
use.aggregate<-use.aggregate[order(use.aggregate$statefp10, use.aggregate$countyfp10),] # order
aggregate.timer<-proc.time()-start # stop timer
dim(fulltable)
## [1] 11078297 3
head(use.aggregate)
## statefp10 countyfp10 pop10.sum pop10.mean pop10.max pop10.cnt
## 1 01 001 54571.00 28.92 1322.00 1887.00
## 50 01 003 182265.00 21.12 1117.00 8628.00
## 99 01 005 27457.00 15.09 1626.00 1820.00
## 149 01 007 22915.00 12.90 2093.00 1777.00
## 196 01 009 57322.00 20.84 635.00 2750.00
## 244 01 011 10914.00 13.31 1523.00 820.00
dim(use.aggregate)
## [1] 3143 3
aggregate.timer[3] # ELAPSED TIME USING aggregate (seconds)
## elapsed
## 23.4
We can see that the query using aggregate
took 23.4 seconds. Now let’s see how dplyr
handles the data. We will start by using dplyr on the full table in R memory, the same table used by aggregate
.
How long do the calculations take using dplyr
?
library(dplyr)
start<-proc.time() # start dplyr timer
grp<-group_by(fulltable, statefp10, countyfp10) # set up the grouping
use.dplyr<-summarise(grp, pop10.tot=sum(pop10),
pop10.m=mean(pop10), pop10.max=max(pop10), pop10.cnt=n()) #set up aggregation by groups
use.dplyr<-arrange(use.dplyr, statefp10, countyfp10) # order the data
use.dplyr<-collect(use.dplyr) # grab the results
dplyr.timer<-proc.time()-start
dim(use.dplyr)
## [1] 3143 6
head(use.dplyr)
## Source: local data frame [6 x 6]
## Groups: statefp10
##
## statefp10 countyfp10 pop10.tot pop10.m pop10.max pop10.cnt
## 1 01 001 54571 28.92 1322 1887
## 2 01 003 182265 21.12 1117 8628
## 3 01 005 27457 15.09 1626 1820
## 4 01 007 22915 12.90 2093 1777
## 5 01 009 57322 20.84 635 2750
## 6 01 011 10914 13.31 1523 820
dplyr.timer[3]
## elapsed
## 0.91
Based on the timer we see that dplyr is 25.71 times faster, a significant time saving. This is due in part to the fact that ‘key pieces’ of dplyr are written in Rcpp, a package written to accelerate computations by by integrating R with C++.
2. Direct connection to external databases
As you saw above, the original table we’re working with is in a PostgreSQL database and what we did above is read the entire table into memory. For cleanliness and memory-reasons, we, in this office, keep our largest databases – generally those in the 10s of millions of records – in external databases and we can’t (or would prefer not to) bring whole tables into R. R has some nice packages (RMySQL, RPostgreSQL) we use regularly to directly access these databases but dplyr
has built this direct access in. So instead of bringing the large table into R let’s leave the table where it is. Now we can show the approach for the same tallies using dplyr
on the database.
Note that, as described in the vignette on dplyr
and databases here dplyr
‘tries to be as lazy as possible’ by not pulling data back to R unless the user asks for it (which we are doing with the collect
function and delays doing work until the last minute so that it can send a single request to the database).
Run the same procedure but keep the table in the external database
One thing to keep in mind here is that we can avoid the 55.44 seconds above that it took to bring the table into memory since the table is being left in the database.
con<-src_postgres(dbname="census", host="localhost", user="postgres", password="spatial")
start<-proc.time() # start dplyr DB timer
tbl<-tbl(con, sql(q)) #set up table
use.dplyrDB1.grp<-group_by(tbl, statefp10, countyfp10) # set up the grouping
use.dplyrDB1.sum<-summarise(use.dplyrDB1.grp, pop10.tot=sum(pop10),
pop10.m=mean(pop10), pop10.max=max(pop10), pop10.min=min(pop10), pop10.cnt=n())
use.dplyrDB1.arr<-arrange(use.dplyrDB1.sum, statefp10, countyfp10) # do ordering
use.dplyrDB1<-collect(use.dplyrDB1.arr) # grab the final result
dplyrDB1.timer<-proc.time()-start # ELAPSED TIME USING dplyr on database (seconds)
dim(use.dplyrDB1)
## [1] 3143 6
head(use.dplyrDB1)
## Source: local data frame [6 x 6]
## Groups: statefp10
## statefp10 countyfp10 pop10.tot pop10.m pop10.max pop10.cnt
## 1 01 001 54571 28.91945 1322 1887
## 2 01 003 182265 21.12483 1117 8628
## 3 01 005 27457 15.08626 1626 1820
## 4 01 007 22915 12.89533 2093 1777
## 5 01 009 57322 20.84436 635 2750
## 6 01 011 10914 13.30976 1523 820
dplyrDB1.timer[3]
## elapsed
## 208.42
In total, the database calculations took 208.42 seconds. Frankly I was a little surprised that the calculations using the external SQL database were significantly slower than the previous calculations (in fact, the in-memory calculations were 229 times faster). But Hadley reminded me that in-memory calculations will be faster than even a well indexed database on disk.
3. Chain functions to reduce clutter and coding
An additional aspect of dplyr that I love is the ability to chain functions together. Although technically there is some chain functionality in base R (try, for example, sum(c(3,3))%.%sum(c(12,11))
) dplyr
takes chaining to a new level. Below you’ll see that the code in Section 2 above is re-written using the %.% operator which allows us to chain together the various dplyr commands. The results are the same, but the amount of coding is reduced and we get rid of the intermittent objects.
Clean up the code using chaining
library(dplyr)
start<-proc.time() # start dplyr DB timer (chain)
use.dplyrDB2<-tbl(con, sql(q))%.%
group_by(statefp10, countyfp10)%.%
summarise(pop10.tot=sum(pop10), pop10.m=mean(pop10), pop10.max=max(pop10), pop10.min=min(pop10), pop10.cnt=n())%.%
arrange(statefp10, countyfp10)%.%
collect()
dplyrDB2.timer<-proc.time()-start # ELAPSED TIME USING dplyr on database (chained) (seconds)
dim(use.dplyrDB2)
## [1] 3143 6
head(use.dplyrDB2)
## Source: local data frame [6 x 6]
## Groups: statefp10
## statefp10 countyfp10 pop10.tot pop10.m pop10.max pop10.cnt
## 1 01 001 54571 28.91945 1322 1887
## 2 01 003 182265 21.12483 1117 8628
## 3 01 005 27457 15.08626 1626 1820
## 4 01 007 22915 12.89533 2093 1777
## 5 01 009 57322 20.84436 635 2750
## 6 01 011 10914 13.30976 1523 820
dplyrDB2.timer[3]
## elapsed
## 209.45
The chaining took 209.45 seconds, the same as the result without chaining (208.42 seconds) but with a lot less coding. (Note that when I first published this tutorial I reported that the chaining approach was slower and Hadley pointed out that this should not be the case. I re-ran the calculations in the same way as I had done previously and got the same results. I then ran the non-chained and then the chained code in separate R instances at separate times and found, as Hadley suggested, the nearly identical results reported here now. Given that the original calculations were done sequentially, perhaps the original difference in speed was attributed to reduced memory available to subsequent calculations (suggestions are welcome).)
Let’s look at the result using another library written by Hadley Wickham, ggplot2.
library(ggplot2)
ggplot(use.dplyrDB2, aes(x=pop10.tot))+geom_histogram(color='white', fill='cadetblue')+
xlab("County Population (2010) Computed by Summing Blocks")
4. Syntax simplicity and ease of use
Finally, I’ve found that, especially when compared to package data.table
, the syntax is much easier to follow and memorize. I still find myself referring to cheat sheets for data.table
while the transition to dplyr
has been smoother. In addition to the relative simplicity, there are a few nice flourishes to the code that have simplified coding. Here are a couple of small examples.
If you want to create, for example, two new columns, the function mutate
allows, say, the second column to refer to the first not-yet-created column. Hard to describe easier to show:
Easy creation of new columns
library(datasets)
data(cars)
head(cars)
## speed dist
## 1 4 2
## 2 4 10
## 3 7 4
## 4 7 22
## 5 8 16
## 6 9 10
newcars<-mutate(cars,
col1 = speed*dist,
col2 = col1*10,
col3 = col1+col2)
head(newcars)
## speed dist col1 col2 col3
## 1 4 2 8 80 88
## 2 4 10 40 400 440
## 3 7 4 28 280 308
## 4 7 22 154 1540 1694
## 5 8 16 128 1280 1408
## 6 9 10 90 900 990
I also find that selecting columns by name is significantly easier with dplyr. Here are examples using the same table we just created:
Easy column selection
# select two columns
head(select(newcars, speed, col2))
## speed col2
## 1 4 80
## 2 4 400
## 3 7 280
## 4 7 1540
## 5 8 1280
## 6 9 900
# all but 1 column
head(select(newcars, -col2))
## speed dist col1 col3
## 1 4 2 8 88
## 2 4 10 40 440
## 3 7 4 28 308
## 4 7 22 154 1694
## 5 8 16 128 1408
## 6 9 10 90 990
# sequence of columns
head(select(newcars, speed, col1:col3))
## speed col1 col2 col3
## 1 4 8 80 88
## 2 4 40 400 440
## 3 7 28 280 308
## 4 7 154 1540 1694
## 5 8 128 1280 1408
## 6 9 90 900 990
Conclusions
The dplyr package is an excellent addition to our workflow. It speeds up, cleans up and simplifies data analysis and manipulation. If you would like to learn more about dplyr
please be sure to check out the well done introduction to dplyr
on the RStudio blog at this site. Definitely don’t forget to follow the links at that bottom to useful vignettes including one on working with databases.
“Frankly I was a little surprised that the calculations using the external SQL database were significantly slower than the previous calculations (in fact, the in-memory calculations were 229 times faster). But Hadley reminded me that in-memory calculations will be slower than even a well indexed database on disk.”
Should that be *faster*?
Absolutely! Thanks for the correction, made the change.