Instantly share code, notes, and snippets.

# arunsrinivasan/tweet_reply.md

Last active Jul 27, 2018
automatic indexing vs between() on integer ranges

# data.table's automatic indexing:

Generating some data first:

```# R version 3.3.0
require(data.table) ## 1.9.7, commit 2433, github
require(dplyr)      ## devel, commit 3189, github

set.seed(45L)
DT = data.table(x=sample(1e4, 5e7, TRUE), y=sample(letters, 5e7, TRUE))
DF = as.data.frame(DT)```

### Vector scans:

When we do:

```options(datatable.auto.index = FALSE)

DF[DF\$x %in% 4:10, ]                   ## (a) ---- 1.398 seconds
# or
DT[x %in% 4:10, ]                      ## (b) ---- 1.202 seconds
# or
DT[x %between% c(4L, 10L), ]           ## (c) ---- 0.572 seconds
# or
filter(DF, x %in% 4:10)                ## (d) ---- 1.215 seconds
# or
filter(DF, dplyr::between(x, 4L, 10L)) ## (e) ---- 0.556 seconds```

we perform a vector scan. That is, the column `x` is searched for all the values that occur in the closed interval [4,10] by comparing each and every row. And to do that, it also requires a logical vector the size = `nrow(DF)`.

The syntax shown in (b) is a vector scan, but using `data.table`. It's identical to `data.frame`, except minus the `DF\$`. `dplyr` implements the logic `x >= . & x <= .` in C++ IIUC.

Note that here I show `%in%` because for integer ranges, it's straightforward (and efficient) to use `%in%` than `x >= . & x <= .`.

### Can we do better?

Absolutely. We can do better with `data.table`, using binary search based subset. It is substantially faster, as it's computational complexity is `O(log n)` as opposed to `O(n)`, where `n = nrow(DF)`.

In versions <= 1.9.2, the only requirement to using this feature of `data.table` was to `setkey()` on the column prior to fast-subset. For example (illustrated on a copy here):

```DT2 = copy(DT)
setkey(DT2, x)   ## 4.9 seconds
DT2[.(4:10)]  ## 0.002 seconds```

This is especially useful in case of repeated subset. Why? Because we've to `setkey()` just once! But a vector scan goes through all the rows of the `data.frame` or `data.table` each time we subset.

Note that most of the time spent in 4.9 seconds is not in finding the order, rather to reorder the data by reference, in a memory efficient manner.

### Even better?

In versions `1.9.3+`, Matt has taken this one step further and implemented automatic indexing - as a result of implementing a long standing very useful feature `set2key`. Now you don't need to even `setkey()`. The first time you subset, an index is automatically generated while performing a vector scan, and therefore subsequent subsets are very fast. Things to note here are:

• It all happens under the hood, seamlessly. You can use the normal base syntax you've already used. No change to existing code.

• And since we don't rearrange the data even the first time (no need to `setkey()`) is much faster. We only require an extra time to create the index, and that too, just once. Usually, the index creation is incredibly fast (as shown below).

```options(datatable.auto.index = TRUE) # enable auto indexing back to illustrate the benefits

## First run: construct index + binary search based subset implicitly and automatically
DT[ x %in% 4:10, verbose=TRUE ]
# Creating new index 'x'
# Starting bmerge ...done in 0 secs
#  user  system elapsed
# 1.045   0.132   1.182

## Second run: reusing existing index
DT[ x %in% 4:10, verbose=TRUE ]
# Using existing index 'x'
# Starting bmerge ...done in 0 secs
#  user  system elapsed
# 0.007   0.001   0.004 ```
• The data set is not rearranged - which in itself might be required in some cases.

At the moment, only basic subsets are supported (this'll be expanded):

```DT[ x == . ] # and
DT[ x %in% . ]```

### How much would we benefit?

```dplyr_between <- function(df, left, right) {
df %>% filter(dplyr::between(x, left, right))
}

dt_autoidx <- function(dt, left, right) {
dt[x %in% left:right]
}

require(rbenchmark)
benchmark(dplyr_between(DF, 4L, 10L), dt_autoidx(DT, 4L, 10L))

test replications elapsed relative user.self sys.self user.child sys.child
1 dplyr_between(DF, 4, 10)          100  62.026   95.132    41.018   20.156          0         0
2    dt_autoidx(DT, 4, 10)          100   0.652    1.000     1.924    0.231          0         0```

So if we run subset this data set a 100 times, we get ~95x speedup (0.65s vs 62s)!!