Working with the R package data.table

For a recent project we have been working with a relatively large database of all historical air pollution data for New York City going all way back to the 1950s. Including both daily and hourly measurements the database includes about 10 million records and 20 variables. Using traditional R functions working with a database of this size can sometimes be cumbersome and time consuming so we took advantage of a nice package from Matthew Dowle called data.table.

The functions in data.table can speed up computations substantially, in some cases we’ve found improvements of 100x. Below we’re showing the speed savings using this package for importing data, deleting fields and aggregating on a big air pollution database.

Compare Import Speed

Each year of data is a separate comma-separated text file so we need to cycle through the files, import each one and bind them together. Instead of using the more traditional combination of the functions read.csv and rbind we can use data.tables fread and rbindlist to improve import and bind speeds. Below we show the code for each.

# traditional approach using read.csv and rbind
ptm <- proc.time()  # set the timer
yrs <- 1957:2012  # each year is a separate CSV file
myDF <- NULL
for (i in yrs) {
    tmppath <- paste(rawpath, i, ".csv", sep = "")  #rawpath is a directory on a computer
    tmpdat <- read.csv(tmppath, as.is = T)
    myDF <- rbind(myDF, tmpdat)
}
(proc.time() - ptm)[3]  # this is elapsed time
## elapsed 
##   472.7
# data.table approach using fread and rbindlist
library(data.table)
ptm <- proc.time()
yrs <- 1957:2012
myDT <- NULL
for (i in yrs) {
    tmppath <- paste(rawpath, i, ".csv", sep = "")  #rawpath is a directory on my computer
    tmpdat <- fread(tmppath)  #fread is a function in data.table
    myDT <- rbindlist(list(myDT, tmpdat))

}
(proc.time() - ptm)[3]
## elapsed 
##   143.6

# how many records are we working with
dim(myDT)
## [1] 6795788      25

The combination of fread and rbindlist from the package data.table improves speeds 3.3 times over the more traditional combination of read.csv and rbind. Now we have a traditional data.frame (myDF) and a data.table (myDT) that we can use to compare speeds doing a couple of additional computations.

Compare Speed Deleting Fields

You can see from above that there are 25 fields in the data. Many of these are not useful for the kinds of analysis we’re doing so we can compare the speed of deleting fields. We will delete the latitude and longitude fields as the example using both the traditional and data.table approach.

fieldsToDelete <- c("latitude", "longitude")
# delete fields the traditional way
ptm <- proc.time()
myDF <- myDF[, !names(myDF) %in% fieldsToDelete]
(proc.time() - ptm)[3]
## elapsed 
##    2.12

# delete fields using data.table
ptm <- proc.time()
set(myDT, j = (which(colnames(myDT) %in% fieldsToDelete)), value = NULL)
(proc.time() - ptm)[3]
## elapsed 
##       0

Even on a relatively straightforward calculation, removing columns, data.table offers some substantial time savings. Comparing the times you can see that removing columns in this example is definitely faster using data.table.

Compare Speed in Aggregation

Let’s look at a slightly more complex computation. In the example below we will aggregate all hourly data into daily averages using the commonly used aggregate function and then we will aggregate using data.table syntax. Note that in order to simplify the code I have removed all columns except those used in this computation. I have also simplified the computations here by, for example, removing tests for completeness (you wouldn’t want a daily average based on a single hour of data) among other things.

# compute average by monitor/parameter within site the traditional way
ptm <- proc.time()
dailyDF <- aggregate(myDF$val, list(myDF$date, myDF$state, myDF$county, myDF$site, 
    myDF$param, myDF$monitor), mean, na.rm = T)
(proc.time() - ptm)[3]
## elapsed 
##   33.68
# compute average by monitor/parameter within site using data.table
ptm <- proc.time()
dailyDT <- myDT[, list(dayavg = mean(val)), by = "date,state,county,site,param,monitor"]
(proc.time() - ptm)[3]
## elapsed 
##    2.14

dailyDT <- dailyDT[order(date, state, county, site, param, monitor)]
head(dailyDT)
##          date state county site param monitor dayavg
## 1: 1957-01-07    36    061 0001 11101       1    163
## 2: 1957-01-20    36    061 0001 11101       1    325
## 3: 1957-02-02    36    061 0001 11101       1    230
## 4: 1957-02-14    36    061 0001 11101       1    187
## 5: 1957-03-11    36    061 0001 11101       1    323
## 6: 1957-03-15    36    061 0001 11101       1    183

You can see a fairly dramatic improvement in speeds when comparing the traditional approach vs the data.table approach. Using data.table speeds are improved 15.74 times.

Final Point, Adding a Key Can Improve Speeds Even More

One of the powerful features of data.table is the ability to add a key similar to keys used in SQL databases. With keys in place on a table computations that involve the key can be sped up significantly. Here we do the same computation as that above except that we add a key to the table first.

# here we add a key to improve speeds (the key is essentially a
# concatenation of date, state, county, site, param and monitor)
setkey(myDT, date, state, county, site, param, monitor)
ptm <- proc.time()
dailyDT <- myDT[, list(dayavg = mean(val)), by = key(myDT)]
(proc.time() - ptm)[3]
## elapsed 
##    0.95

dailyDT <- dailyDT[order(date, state, county, site, param, monitor)]
head(dailyDT)
##          date state county site param monitor dayavg
## 1: 1957-01-07    36    061 0001 11101       1    163
## 2: 1957-01-20    36    061 0001 11101       1    325
## 3: 1957-02-02    36    061 0001 11101       1    230
## 4: 1957-02-14    36    061 0001 11101       1    187
## 5: 1957-03-11    36    061 0001 11101       1    323
## 6: 1957-03-15    36    061 0001 11101       1    183

Adding the key speeds up the computation from above to 35.5 times faster than the traditional approach and 2.3 times faster than the data.table approach without a key.

Posted in R

Leave a Reply

Your email address will not be published. Required fields are marked *