Here are the new benchmarks for setkey
updated for v 1.9.2
. Let's generate some data.
require(data.table)
set.seed(1)
N <- 2e7 # size of DT
foo <- function() paste(sample(letters, sample(5:9, 1), TRUE), collapse="")
ch <- replicate(1e5, foo())
ch <- unique(ch)
DT <- data.table(a = as.numeric(sample(c(NA, Inf, -Inf, NaN, rnorm(1e6)*1e6), N, TRUE)),
b = as.numeric(sample(rnorm(1e6), N, TRUE)),
c = sample(c(NA_integer_, 1e5:1e6), N, TRUE),
d = sample(ch, N, TRUE))
print(object.size(DT), units="MB")
# 538.9 Mb
We'll copy DT
on to another object DT.copy
so as to benchamrk setkey(DT, .)
on different columns (and combinations) and then use DT.copy
to restore unsorted DT
back.
DT.copy = copy(DT)
## on numeric column 'a'
> system.time(setkey(DT, a))
user system elapsed
4.861 0.334 5.316
## reset by DT = copy(DT.copy)
## on integer column 'c'
> system.time(setkey(DT, c))
user system elapsed
3.432 0.325 3.889
## reset again
## on numeric, numeric column 'a,b'
> system.time(setkey(DT, a,b))
user system elapsed
6.321 0.229 6.872
## reset again
## on character column 'd'
> system.time(setkey(DT, d))
user system elapsed
3.992 0.182 4.253
DT = copy(DT.copy)
system.time(ans <- DT[, mean(b), by=c])
# user system elapsed
# 2.943 0.234 3.237
Also load reshape2
. But it runs melt.data.table
method instead.
require(reshape2)
> system.time(melt(DT, id="d", measure=1:2))
user system elapsed
1.117 0.534 1.677
Note that older version of reshape2
took about 190 seconds to accomplish this. But Kevin Ushey's implemented a C++ version of melt
in reshape2
recently, which is also available on CRAN since recently. And here are the timings on that new version.
> system.time(reshape2:::melt.data.frame(DT, id="d", measure=1:2))
user system elapsed
3.445 0.587 4.095
It's much faster than reshape2:::melt
's previous version, but still a bit slower (but I guess this is because of haviing character column with too many unique strings, not sure why this huge difference) than data.table
's melt
here.
In addition, melt.data.table
's implementation of na.rm
argument is quite efficient (it avoids making another copy by checking and removing NAs at the C-side). Here's a comparison on using na.rm=TRUE
on the same data set.
## melt.data.table
> system.time(melt(DT, id="d", measure=1:2, na.rm=TRUE))
user system elapsed
2.072 0.587 2.722
## reshape2:::melt
> system.time(reshape2:::melt.data.frame(DT, id="d", measure=1:2, na.rm=TRUE))
user system elapsed
27.316 4.000 39.465
We'll add one more column e
first:
smple <- sample(letters[1:10], 2e7, TRUE)
set(DT, i=NULL, j="e", value=smple)
Let's run dcast.data.table
now:
## data.table version
system.time(dcast.data.table(DT, d ~ e, value.var="b", fun=sum))
user system elapsed
6.926 0.517 7.599
reshape2:::dcast
has no new implementation (like that of melt
from Kevin Ushey, yet) and is therefore quite slow and doesn't scale that well. For this example, it takes 76.738 seconds.
I'm curious why
data.table
's melt is faster in the 'base' case, too. If you convert the character column column to a factor, the speeds are nearly identical. Perhaps it'sdata.table
's usage of#define USE_RINTERNALS
providing direct access to string elements, whereas inreshape2
we're forced to go through function calls (which maybe cannot be inlined?)But as you pointed out we do nothing to handle the
na.rm = TRUE
case in a speedy way (we just perform that removal in R after populating the melted output, which is of course slow)