Skip to content

Instantly share code, notes, and snippets.

@arunsrinivasan
Last active August 29, 2015 14:02
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
Star You must be signed in to star a gist
Save arunsrinivasan/451056660118628befff to your computer and use it in GitHub Desktop.
Some old benchmarking repeated on 1.9.2

Setkey on v1.9.2:

Here are the new benchmarks for setkey updated for v 1.9.2. Let's generate some data.

require(data.table)
set.seed(1)
N <- 2e7 # size of DT
foo <- function() paste(sample(letters, sample(5:9, 1), TRUE), collapse="")
ch <- replicate(1e5, foo())
ch <- unique(ch)
DT <- data.table(a = as.numeric(sample(c(NA, Inf, -Inf, NaN, rnorm(1e6)*1e6), N, TRUE)), 
                 b = as.numeric(sample(rnorm(1e6), N, TRUE)), 
                 c = sample(c(NA_integer_, 1e5:1e6), N, TRUE), 
                 d = sample(ch, N, TRUE))
                 
print(object.size(DT), units="MB")
# 538.9 Mb

We'll copy DT on to another object DT.copy so as to benchamrk setkey(DT, .) on different columns (and combinations) and then use DT.copy to restore unsorted DT back.

DT.copy = copy(DT)

## on numeric column 'a'
> system.time(setkey(DT, a))
   user  system elapsed 
  4.861   0.334   5.316 

## reset by DT = copy(DT.copy)

## on integer column 'c'
> system.time(setkey(DT, c))
   user  system elapsed 
  3.432   0.325   3.889 
  
## reset again

## on numeric, numeric column 'a,b'
> system.time(setkey(DT, a,b))
   user  system elapsed 
  6.321   0.229   6.872 

## reset again

## on character column 'd'
> system.time(setkey(DT, d))
   user  system elapsed 
  3.992   0.182   4.253 

Cold (or ad-hoc) grouping:

DT = copy(DT.copy)
system.time(ans <- DT[, mean(b), by=c])
#    user  system elapsed 
#   2.943   0.234   3.237 

Melt on v1.9.2:

Also load reshape2. But it runs melt.data.table method instead.

require(reshape2)
> system.time(melt(DT, id="d", measure=1:2))
   user  system elapsed 
  1.117   0.534   1.677 

Note that older version of reshape2 took about 190 seconds to accomplish this. But Kevin Ushey's implemented a C++ version of melt in reshape2 recently, which is also available on CRAN since recently. And here are the timings on that new version.

> system.time(reshape2:::melt.data.frame(DT, id="d", measure=1:2))
   user  system elapsed 
  3.445   0.587   4.095 

It's much faster than reshape2:::melt's previous version, but still a bit slower (but I guess this is because of haviing character column with too many unique strings, not sure why this huge difference) than data.table's melt here.

In addition, melt.data.table's implementation of na.rm argument is quite efficient (it avoids making another copy by checking and removing NAs at the C-side). Here's a comparison on using na.rm=TRUE on the same data set.

## melt.data.table
> system.time(melt(DT, id="d", measure=1:2, na.rm=TRUE))
   user  system elapsed 
  2.072   0.587   2.722 

## reshape2:::melt
> system.time(reshape2:::melt.data.frame(DT, id="d", measure=1:2, na.rm=TRUE))
   user  system elapsed 
 27.316   4.000  39.465 

dcast on v1.9.2:

We'll add one more column e first:

smple <- sample(letters[1:10], 2e7, TRUE)
set(DT, i=NULL, j="e", value=smple)

Let's run dcast.data.table now:

## data.table version
system.time(dcast.data.table(DT, d ~ e, value.var="b", fun=sum))
   user  system elapsed 
  6.926   0.517   7.599 

reshape2:::dcast has no new implementation (like that of melt from Kevin Ushey, yet) and is therefore quite slow and doesn't scale that well. For this example, it takes 76.738 seconds.

@kevinushey
Copy link

I'm curious why data.table's melt is faster in the 'base' case, too. If you convert the character column column to a factor, the speeds are nearly identical. Perhaps it's data.table's usage of #define USE_RINTERNALS providing direct access to string elements, whereas in reshape2 we're forced to go through function calls (which maybe cannot be inlined?)

But as you pointed out we do nothing to handle the na.rm = TRUE case in a speedy way (we just perform that removal in R after populating the melted output, which is of course slow)

@kevinushey
Copy link

Either way, data.table definitely has more machinery inside to make sure all these things happen as quickly as possible. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment