Skip to content

Instantly share code, notes, and snippets.

@briatte
Last active August 29, 2015 13:57
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save briatte/9428660 to your computer and use it in GitHub Desktop.
Save briatte/9428660 to your computer and use it in GitHub Desktop.
aggregation functions, test #2: base, dplyr, data.table

Here's a simple timing test of aggregation functions in R, using 1.3 million rows and 80,000 groups of real data on a 1.8GHz Intel Core i5. Thanks to Arun Srinivasan for helpful comments.

The fastest function to run through the data.frame benchmark is data.table, which runs twice faster than dplyr, which runs ten times faster than base R.

For a benchmark that includes plyr, see this earlier Gist for a computationally more intensive test on half a million rows, where dplyr still runs 1.5 times faster than aggregate in base R.

Both tests confirm what W. Andrew Barr blogged on dplyr:

the 2 most important improvements in dplyr are

  1. a MASSIVE increase in speed, making dplyr useful on big data sets
  2. the ability to chain operations together in a natural order

Tony Fischetti has clear examples of the latter, and Erick Gregory shows that easy access to SQL databases should also be added to the list.

> # data
> system.time(load("integritate.rda"))
user system elapsed
14.716 0.273 15.173
> # base
> system.time(aggregate(URL ~ Functie, length, data = data))
user system elapsed
26.118 0.284 26.510
> # dplyr
> system.time(as.data.frame(summarise(group_by(data, Functie), n = length(URL))))
user system elapsed
0.242 0.011 0.254
> system.time(summarise(group_by(data, Functie), n = length(URL)))
user system elapsed
0.249 0.006 0.257
> system.time(tbl <- group_by(data, Functie))
user system elapsed
0.183 0.005 0.187
> system.time(summarise(tbl, n = length(URL)))
user system elapsed
0.050 0.001 0.050
> # data.table
> system.time(as.data.frame(as.data.table(data)[, .N, by = Functie]))
> library(data.table)
user system elapsed
1.173 0.038 1.233
> system.time(as.data.table(data)[, .N, by = Functie])
user system elapsed
0.080 0.048 0.128
> system.time(data.table(data)[, .N, by = Functie])
user system elapsed
3.300 0.171 3.508
> system.time(data <- as.data.table(data))
user system elapsed
0.037 0.032 0.069
> system.time(data <- data.table(data))
user system elapsed
0.258 0.094 0.353
> system.time(data[, .N, by = Functie])
user system elapsed
0.031 0.002 0.034
> # versions
> sessionInfo()
R version 3.0.3 (2014-03-06)
Platform: x86_64-apple-darwin10.8.0 (64-bit)
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] data.table_1.9.2 dplyr_0.1.2
loaded via a namespace (and not attached):
[1] assertthat_0.1 plyr_1.8.1 Rcpp_0.11.0 reshape2_1.2.2 stringr_0.6.2
[6] tools_3.0.3
setwd("/Users/fr/Documents/Code/R/integritate")
# data
system.time(load("integritate.rda"))
str(data[, c("Functie", "URL")])
# base
length(unique(data$Functie))
system.time(aggregate(URL ~ Functie, length, data = data))
# plyr (far too long)
# library(plyr)
# system.time(ddply(data, .(Functie), summarise, n = length(URL)))
library(dplyr)
system.time(as.data.frame(summarise(group_by(data, Functie), n = length(URL))))
system.time(summarise(group_by(data, Functie), n = length(URL)))
system.time(tbl <- group_by(data, Functie))
system.time(summarise(tbl, n = length(URL)))
# data.table
library(data.table)
system.time(as.data.frame(as.data.table(data)[, .N, by = Functie]))
system.time(as.data.table(data)[, .N, by = Functie])
system.time(data.table(data)[, .N, by = Functie])
system.time(data <- as.data.table(data))
system.time(data <- data.table(data))
system.time(data[, .N, by = Functie])
# versions
sessionInfo()
@arunsrinivasan
Copy link

Your function is as.data.table(data)[, .N, by=Functie)]. This includes creation of the data.table and the aggregation. And your benchmark results indicate 0.112s which is 2.3x faster than dplyr. I don't understand how you say dplyr is fastest and that data.table is fastest if you first convert to data.table. It seems quite straightforward to me that dplyr is slower here.. What am I missing?

@briatte
Copy link
Author

briatte commented Mar 8, 2014

That's because the benchmark is against the sole data.frame class, as was the case in the previous benchmark. I tested data.table per request to see if DT was faster than DF, which it is, but the test goes from data.frame to data.frame, so in order to drop the dual S3 class from the DT object, the genuine test for data.table would be:

system.time(as.data.frame(as.data.table(data)[, .N, by = Functie]))
   user  system elapsed 
  1.173   0.038   1.233 

The code, however, makes little sense from a user perspective. The same would happen if I were to code the first benchmark with data.table, which I did not do when I was actually working on that project from which the test spawned, because I had no idea how to do it.

Addition: since you mentioned on Twitter that dplyr also adds dual classes, I have wrapped the dplyr benchmark in the same as.data.frame coercer, and data.table is now faster in comparison. Adding data.table to the benchmark, though is forcing it to reflect code that makes little sense in a user workflow.

@arunsrinivasan
Copy link

I see what you're trying to say now. A couple of points here:

  1. The point of requesting data.table comparison is, if it's faster, for people to switch to data.table. Especially, when fread provides a fast reading of files directly as a data.table and setDT function allows conversion of data.frame in almost 0-time (by reference). Therefore, it doesn't make sense to convert it back to data.frame, as the idea is to stick to the data.table object.

  2. A data.table is a data.frame as well. It just inherits from data.frame. is.data.frame(DT) would give back TRUE. Having said that, there are some differences, especially when one wants to subset the data.frame way, we've to use with=FALSE. That is, DT[, c("x", "y"), with=FLASE] is the equivalent of DF[, c("x", "y")]. But that's a small difference.

  3. dplyr inherits from a data.frame as well and adds tbl_df class to data.frame objects; tbl_dt to data.table objects etc. It's pretty normal and there's no need to convert them back to a data.frame.

@arunsrinivasan
Copy link

"Adding data.table to the benchmark, though is forcing it to reflect code that makes little sense in a user workflow."

Yes, it wouldn't make sense if you're not wanting to stick to the data.table object, of course. That's the misunderstanding part (from my side). Then you'd have to convert back and forth. My reason for requesting is, since people are interested in speed, to show that there are much faster options than dplyr.

I've also benchmarked against dplyr (for quite sometime now) and will be putting it up on the webpage tonight or tomorrow. It's on half a billion rows. When you see the speed-up there against dplyr, it's quite enticing ;).

Anyhow, thanks a lot for taking the time (esp. out of your weekend) and accepting my request in doing this benchmark.

@briatte
Copy link
Author

briatte commented Mar 8, 2014

  1. fread will certainly justify a new benchmark when it is ready for use.
  2. Yes, I read the AllS4.r file of the code and saw that, and noticed the same for dplyr after reading your tweet.

Also, I have not explored doBy or ff. Could using ff speed up loading the data?

@briatte
Copy link
Author

briatte commented Mar 8, 2014

I expected you would have already ran lower-level benchmarks, but wanted to benchmark a user workflow more than subscripting alone. It was my mistake not to make it clearer in the text, which was pasted from the tweets and badly updated between updates of the timers.

@arunsrinivasan
Copy link

@briatte, yes that'd be equally interesting as well. However, in that case, there should be one conversion to data.table at first, and then sticking to data.table - meaning, no conversion to as.data.frame(.) after that at all.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment