Skip to content

Instantly share code, notes, and snippets.

@grantmcdermott
Created February 15, 2022 02:18
Show Gist options
  • Star 6 You must be signed in to star a gist
  • Fork 3 You must be signed in to fork a gist
  • Save grantmcdermott/f9af3b7ce3e4aaa6ec6e02443af41a71 to your computer and use it in GitHub Desktop.
Save grantmcdermott/f9af3b7ce3e4aaa6ec6e02443af41a71 to your computer and use it in GitHub Desktop.
Benchmarking collapse_mask
## Context: https://twitter.com/grant_mcdermott/status/1493400952878952448
options(collapse_mask = "all") # NB: see `help('collapse-options')`
library(dplyr)
library(data.table)
library(collapse) # Needs to come after library(dplyr) for collapse_mask to work
flights = fread('https://raw.githubusercontent.com/Rdatatable/data.table/master/vignettes/flights14.csv')
vars = c('dep_delay', 'arr_delay', 'air_time', 'distance', 'hour')
## Note we explicitly call dplyr::<function> for the 1st line in this benchmark,
## since we've masked the regular dplyr operations with their collapse
## equivalents (i.e. 2nd line).
library(microbenchmark)
microbenchmark(
dplyr = flights |> dplyr::group_by(month, day, origin, dest) |> dplyr::summarise(across(vars, sum)),
collapse = flights |> group_by(month, day, origin, dest) |> summarise(across(vars, sum)),
data.table = flights[, lapply(.SD, sum), by=.(month, day, origin, dest), .SDcols=vars],
times = 2
)
#> Unit: milliseconds
#> expr min lq mean median uq max neval cld
#> dplyr 1041.243193 1041.243193 1061.813553 1061.813553 1082.383912 1082.383912 2 b
#> collapse 10.350356 10.350356 10.428991 10.428991 10.507626 10.507626 2 a
#> data.table 9.615242 9.615242 9.778382 9.778382 9.941521 9.941521 2 a
@grantmcdermott
Copy link
Author

Aside for anyone reading this after the fact: You don't need to load library(dplyr) above. All of the above code will work with that line omitted or commented out, since group_by etc. have been aliased and masked. But you might not want to do this for longer scripts, since collapse doesn't cover every dplyr operation (e.g. joins), or you might have loaded the whole tidyverse instead of just dplyr.

@PMassicotte
Copy link

I do not know what kind of magic it is, but it is fast! Thank you for the package.

@grantmcdermott
Copy link
Author

We have @SebKrantz to thank for the package, but stoked to hear it's working on your system now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment