arunsrinivasan/tweet_reply.md

## tweet_reply.md

      
    Raw
  

              tweet_reply.md
            
          
    Updated June 16 with latest devel

data.table's automatic indexing:

Generating some data first:
# R version 3.3.0
require(data.table) ## 1.9.7, commit 2433, github
require(dplyr)      ## devel, commit 3189, github

set.seed(45L)
DT = data.table(x=sample(1e4, 5e7, TRUE), y=sample(letters, 5e7, TRUE))
DF = as.data.frame(DT)
Vector scans:

When we do:
options(datatable.auto.index = FALSE)

DF[DF$x %in% 4:10, ]                   ## (a) ---- 1.398 seconds
# or
DT[x %in% 4:10, ]                      ## (b) ---- 1.202 seconds
# or
DT[x %between% c(4L, 10L), ]           ## (c) ---- 0.572 seconds
# or
filter(DF, x %in% 4:10)                ## (d) ---- 1.215 seconds
# or
filter(DF, dplyr::between(x, 4L, 10L)) ## (e) ---- 0.556 seconds
we perform a vector scan. That is, the column x is searched for all the values that occur in the closed interval [4,10] by comparing each and every row. And to do that, it also requires a logical vector the size = nrow(DF).
The syntax shown in (b) is a vector scan, but using data.table. It's identical to data.frame, except minus the DF$. dplyr implements the logic x >= . & x <= . in C++ IIUC.

Note that here I show %in% because for integer ranges, it's straightforward (and efficient) to use %in% than x >= . & x <= ..

Can we do better?

Absolutely. We can do better with data.table, using binary search based subset. It is substantially faster, as it's computational complexity is O(log n) as opposed to O(n), where n = nrow(DF).
In versions <= 1.9.2, the only requirement to using this feature of data.table was to setkey() on the column prior to fast-subset. For example (illustrated on a copy here):
DT2 = copy(DT)
setkey(DT2, x)   ## 4.9 seconds
DT2[.(4:10)]  ## 0.002 seconds
This is especially useful in case of repeated subset. Why? Because we've to setkey() just once! But a vector scan goes through all the rows of the data.frame or data.table each time we subset.

Note that most of the time spent in 4.9 seconds is not in finding the order, rather to reorder the data by reference, in a memory efficient manner.

Even better?

In versions 1.9.3+, Matt has taken this one step further and implemented automatic indexing - as a result of implementing a long standing very useful feature set2key. Now you don't need to even setkey(). The first time you subset, an index is automatically generated while performing a vector scan, and therefore subsequent subsets are very fast. Things to note here are:


It all happens under the hood, seamlessly. You can use the normal base syntax you've already used. No change to existing code.


And since we don't rearrange the data even the first time (no need to setkey()) is much faster. We only require an extra time to create the index, and that too, just once. Usually, the index creation is incredibly fast (as shown below).
options(datatable.auto.index = TRUE) # enable auto indexing back to illustrate the benefits

## First run: construct index + binary search based subset implicitly and automatically
DT[ x %in% 4:10, verbose=TRUE ]
# Creating new index 'x'
# Starting bmerge ...done in 0 secs
#  user  system elapsed 
# 1.045   0.132   1.182 

## Second run: reusing existing index
DT[ x %in% 4:10, verbose=TRUE ]
# Using existing index 'x'
# Starting bmerge ...done in 0 secs
#  user  system elapsed 
# 0.007   0.001   0.004 


The data set is not rearranged - which in itself might be required in some cases.


At the moment, only basic subsets are supported (this'll be expanded):
DT[ x == . ] # and
DT[ x %in% . ]
How much would we benefit?

dplyr_between <- function(df, left, right) {
    df %>% filter(dplyr::between(x, left, right))
}

dt_autoidx <- function(dt, left, right) {
    dt[x %in% left:right]
}

require(rbenchmark)
benchmark(dplyr_between(DF, 4L, 10L), dt_autoidx(DT, 4L, 10L))

                      test replications elapsed relative user.self sys.self user.child sys.child
1 dplyr_between(DF, 4, 10)          100  62.026   95.132    41.018   20.156          0         0
2    dt_autoidx(DT, 4, 10)          100   0.652    1.000     1.924    0.231          0         0
So if we run subset this data set a 100 times, we get ~95x speedup (0.65s vs 62s)!!