For large tables in R dplyr's function inner_join() is much faster than merge()

Using the merge() function in R on big tables can be time consuming. Luckily the join functions in the new package dplyr are much faster. The package offers four different joins:

inner_join (similar to merge with all.x=F and all.y=F)
left_join (similar to merge with all.x=T and all.y=F)
semi_join (not really an equivalent in merge() unless y only includes join fields)
anti_join (no equivalent in merge(), this is all x without a match in y)

I can’t find a great discussion of the advantages of the dplyr join functions but I do see a help response from Hadley Wickham, dplyr's creator, here that briefly lists the advantages:

rows are kept in existing order
much faster
tells you what keys you’re merging by (if you don’t supply)
also work with database tables.

Of course, the advantage that matters most is the speed. In this tiny example using a table with more than 6 million records the inner_join function is 43 times faster!

nrow(dat) 
# 6131037
nrow(dat2) 
# 6

ptm<-proc.time()
x<-merge(dat, dat2, by=c("Parameter_Name", "Pollutant_Standard", "Sample_Duration"))
proc.time()[3]-ptm[3]
#elapsed 
#65.56


ptm2<-proc.time()
y<-inner_join(dat, dat2, by=c("Parameter_Name", "Pollutant_Standard", "Sample_Duration"))
proc.time()[3]-ptm2[3]
#elapsed 
#1.53 

# 65.56/1.53 = Wow! 42.8 x faster using inner_join

Technical Tidbits From Spatial Analysis & Data Science

For large tables in R `dplyr's` function `inner_join()` is much faster than `merge()`

Share this: