Setkey on v1.9.2:
Here are the new benchmarks for
setkey updated for
v 1.9.2. Let's generate some data.
require(data.table) set.seed(1) N <- 2e7 # size of DT foo <- function() paste(sample(letters, sample(5:9, 1), TRUE), collapse="") ch <- replicate(1e5, foo()) ch <- unique(ch) DT <- data.table(a = as.numeric(sample(c(NA, Inf, -Inf, NaN, rnorm(1e6)*1e6), N, TRUE)), b = as.numeric(sample(rnorm(1e6), N, TRUE)), c = sample(c(NA_integer_, 1e5:1e6), N, TRUE), d = sample(ch, N, TRUE)) print(object.size(DT), units="MB") # 538.9 Mb
DT on to another object
DT.copy so as to benchamrk
setkey(DT, .) on different columns (and combinations) and then use
DT.copy to restore unsorted
DT.copy = copy(DT) ## on numeric column 'a' > system.time(setkey(DT, a)) user system elapsed 4.861 0.334 5.316 ## reset by DT = copy(DT.copy) ## on integer column 'c' > system.time(setkey(DT, c)) user system elapsed 3.432 0.325 3.889 ## reset again ## on numeric, numeric column 'a,b' > system.time(setkey(DT, a,b)) user system elapsed 6.321 0.229 6.872 ## reset again ## on character column 'd' > system.time(setkey(DT, d)) user system elapsed 3.992 0.182 4.253
Cold (or ad-hoc) grouping:
DT = copy(DT.copy) system.time(ans <- DT[, mean(b), by=c]) # user system elapsed # 2.943 0.234 3.237
Melt on v1.9.2:
reshape2. But it runs
melt.data.table method instead.
require(reshape2) > system.time(melt(DT, id="d", measure=1:2)) user system elapsed 1.117 0.534 1.677
Note that older version of
reshape2 took about 190 seconds to accomplish this. But Kevin Ushey's implemented a C++ version of
reshape2 recently, which is also available on CRAN since recently. And here are the timings on that new version.
> system.time(reshape2:::melt.data.frame(DT, id="d", measure=1:2)) user system elapsed 3.445 0.587 4.095
It's much faster than
reshape2:::melt's previous version, but still a bit slower (but I guess this is because of haviing character column with too many unique strings, not sure why this huge difference) than
melt.data.table's implementation of
na.rm argument is quite efficient (it avoids making another copy by checking and removing NAs at the C-side). Here's a comparison on using
na.rm=TRUE on the same data set.
## melt.data.table > system.time(melt(DT, id="d", measure=1:2, na.rm=TRUE)) user system elapsed 2.072 0.587 2.722 ## reshape2:::melt > system.time(reshape2:::melt.data.frame(DT, id="d", measure=1:2, na.rm=TRUE)) user system elapsed 27.316 4.000 39.465
dcast on v1.9.2:
We'll add one more column
smple <- sample(letters[1:10], 2e7, TRUE) set(DT, i=NULL, j="e", value=smple)
## data.table version system.time(dcast.data.table(DT, d ~ e, value.var="b", fun=sum)) user system elapsed 6.926 0.517 7.599
reshape2:::dcast has no new implementation (like that of
melt from Kevin Ushey, yet) and is therefore quite slow and doesn't scale that well. For this example, it takes 76.738 seconds.