For large tables in R dplyr's function inner_join() is much faster than merge()

Using the merge() function in R on big tables can be time consuming. Luckily the join functions in the new package dplyr are much faster. The package offers four different joins:

  • inner_join (similar to merge with all.x=F and all.y=F)
  • left_join (similar to merge with all.x=T and all.y=F)
  • semi_join (not really an equivalent in merge() unless y only includes join fields)
  • anti_join (no equivalent in merge(), this is all x without a match in y)

I can’t find a great discussion of the advantages of the dplyr join functions but I do see a help response from Hadley Wickham, dplyr's creator, here that briefly lists the advantages:

  • rows are kept in existing order
  • much faster
  • tells you what keys you’re merging by (if you don’t supply)
  • also work with database tables.

Of course, the advantage that matters most is the speed. In this tiny example using a table with more than 6 million records the inner_join function is 43 times faster!

nrow(dat) 
# 6131037
nrow(dat2) 
# 6

ptm<-proc.time()
x<-merge(dat, dat2, by=c("Parameter_Name", "Pollutant_Standard", "Sample_Duration"))
proc.time()[3]-ptm[3]
#elapsed 
#65.56


ptm2<-proc.time()
y<-inner_join(dat, dat2, by=c("Parameter_Name", "Pollutant_Standard", "Sample_Duration"))
proc.time()[3]-ptm2[3]
#elapsed 
#1.53 

# 65.56/1.53 = Wow! 42.8 x faster using inner_join

Posted in R