Here's a simple timing test of aggregation functions in R, using 1.3 million rows and 80,000 groups of real data on a 1.8GHz Intel Core i5. Thanks to Arun Srinivasan for helpful comments.
The fastest function to run through the data.frame
benchmark is data.table, which runs twice faster than dplyr, which runs ten times faster than base R.
For a benchmark that includes plyr, see this earlier Gist for a computationally more intensive test on half a million rows, where dplyr still runs 1.5 times faster than aggregate
in base R.
Both tests confirm what W. Andrew Barr blogged on dplyr:
the 2 most important improvements in dplyr are
- a MASSIVE increase in speed, making dplyr useful on big data sets
- the ability to chain operations together in a natural order
Tony Fischetti has clear examples of the latter, and Erick Gregory shows that easy access to SQL databases should also be added to the list.
I see what you're trying to say now. A couple of points here:
The point of requesting
data.table
comparison is, if it's faster, for people to switch todata.table
. Especially, whenfread
provides a fast reading of files directly as adata.table
andsetDT
function allows conversion ofdata.frame
in almost 0-time (by reference). Therefore, it doesn't make sense to convert it back todata.frame
, as the idea is to stick to thedata.table
object.A
data.table
is adata.frame
as well. It just inherits fromdata.frame
.is.data.frame(DT)
would give backTRUE
. Having said that, there are some differences, especially when one wants to subset thedata.frame
way, we've to usewith=FALSE
. That is,DT[, c("x", "y"), with=FLASE]
is the equivalent ofDF[, c("x", "y")]
. But that's a small difference.dplyr
inherits from adata.frame
as well and addstbl_df
class todata.frame
objects;tbl_dt
todata.table
objects etc. It's pretty normal and there's no need to convert them back to adata.frame
.