Here's a simple timing test of aggregation functions in R, using 1.3 million rows and 80,000 groups of real data on a 1.8GHz Intel Core i5. Thanks to Arun Srinivasan for helpful comments.
The fastest function to run through the data.frame
benchmark is data.table, which runs twice faster than dplyr, which runs ten times faster than base R.
For a benchmark that includes plyr, see this earlier Gist for a computationally more intensive test on half a million rows, where dplyr still runs 1.5 times faster than aggregate
in base R.
Both tests confirm what W. Andrew Barr blogged on dplyr:
the 2 most important improvements in dplyr are
- a MASSIVE increase in speed, making dplyr useful on big data sets
- the ability to chain operations together in a natural order
Tony Fischetti has clear examples of the latter, and Erick Gregory shows that easy access to SQL databases should also be added to the list.
That's because the benchmark is against the sole
data.frame
class, as was the case in the previous benchmark. I testeddata.table
per request to see if DT was faster than DF, which it is, but the test goes fromdata.frame
todata.frame
, so in order to drop the dual S3 class from the DT object, the genuine test fordata.table
would be:The code, however, makes little sense from a user perspective. The same would happen if I were to code the first benchmark with
data.table
, which I did not do when I was actually working on that project from which the test spawned, because I had no idea how to do it.Addition: since you mentioned on Twitter that dplyr also adds dual classes, I have wrapped the dplyr benchmark in the same
as.data.frame
coercer, and data.table is now faster in comparison. Adding data.table to the benchmark, though is forcing it to reflect code that makes little sense in a user workflow.