For large tables in R dplyr's function inner_join() is much faster than merge()

Using the merge() function in R on big tables can be time consuming. Luckily the join functions in the new package dplyr are much faster. The package offers four different joins:

  • inner_join (similar to merge with all.x=F and all.y=F)
  • left_join (similar to merge with all.x=T and all.y=F)
  • semi_join (not really an equivalent in merge() unless y only includes join fields)
  • anti_join (no equivalent in merge(), this is all x without a match in y)

I can’t find a great discussion of the advantages of the dplyr join functions but I do see a help response from Hadley Wickham, dplyr's creator, here that briefly lists the advantages:

  • rows are kept in existing order
  • much faster
  • tells you what keys you’re merging by (if you don’t supply)
  • also work with database tables.

Of course, the advantage that matters most is the speed. In this tiny example using a table with more than 6 million records the inner_join function is 43 times faster!

# 6131037
# 6

x<-merge(dat, dat2, by=c("Parameter_Name", "Pollutant_Standard", "Sample_Duration"))

y<-inner_join(dat, dat2, by=c("Parameter_Name", "Pollutant_Standard", "Sample_Duration"))

# 65.56/1.53 = Wow! 42.8 x faster using inner_join

Posted in R