Skip to content

Instantly share code, notes, and snippets.

@arunsrinivasan
Last active August 29, 2015 14:04
Show Gist options
  • Save arunsrinivasan/37a051c65275b58861fe to your computer and use it in GitHub Desktop.
Save arunsrinivasan/37a051c65275b58861fe to your computer and use it in GitHub Desktop.
SO_25066925

Here's another comparison between dplyr and data.table on relatively large data, with different number of unique groups.

require(dplyr)
require(data.table)
N = 20e6L # 20 million rows, UPDATE: also ran for 50 million rows (see table below)
K = 10L # other values tested are 25L, 50L, 100L
DT <- data.table(Insertion = sample(K, N, TRUE), 
                 Unit      = sample(paste("V", 1:K, sep=""), N, TRUE),
                 Channel   = sample(K, N, TRUE), 
                 Value     = runif(N))
DF <- as.data.frame(DT)

cols = c("MeanValue", "Residuals")
system.time(ans1 <- DT[, (cols) := { m = mean(Value); list(m, Value-m)}, by=list(Insertion, Unit, Channel)])
system.time(ans2 <- DF %>% group_by(Insertion, Unit, Channel) %>% mutate(MeanValue = mean(Value), Residuals = Value-MeanValue))

# all.equal(ans1, ans2, check.attributes=FALSE) # [1] TRUE
# timings are in seconds:
#   N     K    ~groups  data.table    dplyr
# 20m    10      1,000        4.31     6.60
# 20m    25     15,625        5.14     8.76
# 20m    50    125,000        6.82    20.11
# 20m   100  1,000,000       12.56    42.45
# 50m    10      1,000       12.01    17.54
# 50m    25     15,625       17.61    29.12
# 50m    50    125,000       19.41    56.00
# 50m   100  1,000,000       26.37    84.05
# ...see TODO...

TODO: Invstigate timings on same set of unique groups, but on increasing data size as well: 100m, 500m, 1b rows (m=million, b=billion).

PS: Note that the mean here for data.table is not yet optimised to run on GForce. When that's implemented, this benchmark should be updated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment