kar9222/join_update_setkey.md

## join_update_setkey.md

      
    Raw
  

              join_update_setkey.md
            
          
    data.table join + update-by-reference + setkey

A basic/naive test

Found gem about join + update-by-reference from Left join using data.table
Saw a tweet reply by Michael chirico about using setkey during join
Combine them together

"I want to do fast & memory efficient joins but I don't want to read the performance benchmark...". You can skip to Keyfindings & Examples sections below.
Related Twitter link
# NOTE: Performance and memory efficiency also depend on data/hardware/systems/etc

library(data.table) ; set.seed(1)
x    <- data.table(a = 1:1e8,            b = sample(12:19, 1e7, TRUE))
y    <- data.table(a = sample(x$a, 2e5), c = sample(2:8,   2e5, TRUE))
x_2  <- copy(x) ; y_2  <- copy(y)
x_3  <- copy(x) ; y_3  <- copy(y)
x_4  <- copy(x) ; y_4  <- copy(y)

benchmark <- bench::mark( # TODO validate with `microbenchmark`
  time_unit = 'ms', check = FALSE, # Turn off `check` for `setkey`

  normal_join          = x <- y[x, on = 'a'],
  update_by_reference  = x_2[y_2, on = 'a', c := c],
  setkey_n_update      = setkey(x_3, a) [ setkey(y_3, a), on = 'a', c := c ],
  # Wrapper for complex cases (ie many cols). See codes below.
  wrap_setkey_n_update = setjoin(x_4, y_4, on = 'a')
)
benchmark[, c('expression', 'median', 'mem_alloc')]
# # A tibble: 4 x 3
#   expression           median mem_alloc
#   <bch:expr>            <dbl> <bch:byt>
# 1 normal_join          5593.      4.1GB
# 2 update_by_reference   724.    767.6MB
# 3 setkey_n_update        48.5   767.7MB
# 4 wrap_setkey_n_update   56.9   772.2MB

# NOTE: `dplyr::left_join` was also tested and it's the slowest with ~9,000 ms,
# use more memory than both `update_by_reference` and `setkey_n_update`
# but use less memory than normal_join. It consumed about ~2.0GB of memory.
# I didn't include it as I want to focus solely on {data.table}

# Wrapper
setjoin <- function(dt, dt_join, on, cols, drop) {

  # TODO (WORK IN PROGRESS ONLY) Get cols
  # Focus on the main codes about `join + update-by-reference + setkey` below, not these 'aesthetic' elements
  cols <- {
    if ( ! deparse(substitute(on)) %like% '\\.\\(' ) {
      if (is.character(on)) {
          if ( missing(cols) & missing(drop) )
            setdiff( names(dt_join), on )
          else if ( ! missing(drop) )
            setdiff( names(dt_join), c(on, drop) )
          else cols
      }
    }
    # TODO Case: complex NSE. Currently, if `on` is complex, `cols` must be specified.
    else cols
  }
  # NSE
  expr <- quote({
    setkey(dt, on) ; setkey(dt_join, on)
    dt[dt_join, on = on, (cols) := mget( paste0('i.', cols) ) ]
    # dt[dt_join, on = on, (cols) := dt_join[, ..cols] ]
  })
  eval( do.call(
    substitute,
    list(expr, list(on      = substitute(on),
                    cols    = substitute(cols)))
  ))
}
Test again with system time

I thought the performance gains were unrealistic and decided to test again using system time. Two rounds were tested, where the first round was the same as above and the second round was tested for two rounds of similar joins to investigate the performance gains of setkey (ie setkey is only set for the first time and used for subsequent joins). Wrapper of setjoin was not tested here, for brevity. It has similar performance as setkey + update.
# `bench::system_time` ---------------------------

library(data.table) ; set.seed(1)
x    <- data.table(a = 1:1e8,            b = sample(12:19, 1e7, TRUE))
y    <- data.table(a = sample(x$a, 2e5), c = sample(2:8,   2e5, TRUE))
z    <- data.table(a = sample(x$a, 2e5), c = sample(2:8,   2e5, TRUE))
x_2  <- copy(x) ; y_2  <- copy(y) ; z_2 <- copy(z)
x_3  <- copy(x) ; y_3  <- copy(y) ; z_3 <- copy(z)

# _ Join once ------------------------------------

normal_join <-         bench::system_time({
  x <- y[x, on = 'a']
})
update_by_reference <- bench::system_time({
  x_2[y_2, on = 'a', c := c]
})
update_n_setkey <-     bench::system_time({
  setkey(x_3, a) [ setkey(y_3, a), on = 'a', c := c ]
})
join_once <- rbind(normal_join, update_by_reference, update_n_setkey)[, 'real']

# _ Join twice -----------------------------------

# Same chunk as above
library(data.table) ; set.seed(1)
x    <- data.table(a = 1:1e8,            b = sample(12:19, 1e7, TRUE))
y    <- data.table(a = sample(x$a, 2e5), c = sample(2:8,   2e5, TRUE))
z    <- data.table(a = sample(x$a, 2e5), c = sample(2:8,   2e5, TRUE))
x_2  <- copy(x) ; y_2  <- copy(y) ; z_2 <- copy(z)
x_3  <- copy(x) ; y_3  <- copy(y) ; z_3 <- copy(z)

# Different chunk 
normal_join <-         bench::system_time({
  x <- y[x, on = 'a']
  y <- z[x, on = 'a']
})
update_by_reference <- bench::system_time({
  x_2[y_2, on = 'a', c := c]
  x_2[z_2, on = 'a', c := c]
})
update_n_setkey <-     bench::system_time({
  setkey(x_3, a) [ setkey(y_3, a), on = 'a', c := c ]
  x_3[ setkey(z_3, a), on = 'a', c := c ]    # NOTE -------- No `setkey` for x_3 here --------
})
join_twice <- rbind(normal_join, update_by_reference, update_n_setkey)[, 'real']

# _ Compare --------------------------------------

data <- data.table(n_join    = c( rep('join_once', 3), rep('join_twice', 3) ),
                   type      = names(join_once),
                   seconds   = c(join_once, join_twice)) [,
        seconds := round(seconds, 1)]
data
#        n_join                type seconds
#        <char>              <char>   <num>
# 1:  join_once         normal_join     5.6
# 2:  join_once update_by_reference     0.8
# 3:  join_once     update_n_setkey     0.7
# 4: join_twice         normal_join    10.2
# 5: join_twice update_by_reference     1.5
# 6: join_twice     update_n_setkey     0.9

library(ggplot2)
ggplot(data, aes(n_join, seconds, fill = type)) +
  geom_col(position = 'dodge', width = .5) +
  geom_text(aes(label = seconds), position = position_dodge(width = 0.9), vjust = - .25, color = 'grey')
  # Personal package was used below for theming
Somehow bench::mark didn't capture the overhead of setkey during the initial test, hence the unrealistic of ~100x gains were shown.
Key findings


setkey + update and update are ~11 and ~6.5 times faster than normal join, respectively
on first join, performance of setkey + update is similar to update as overhead of setkey largely offsets its own performance gains
on second and subsequent joins, as setkey is not required, setkey + update is faster than update by ~1.8 times (or faster than normal join by ~11 times)


Examples

For performant & memory efficient joins, use either update or setkey + update, where the latter is faster at the cost of more codes.
Let's see some pseudo codes, for brevity. The logics are the same.
For one or a few columns
a <- data.table(x = ..., y = ..., z = ..., ...)
b <- data.table(x = ..., y = ..., z = ..., ...)

# `update`
a[b, on = .(x), y := y]
a[b, on = .(x),  `:=` (y = y, z = z, ...)]
# `setkey + update`
setkey(a, x) [ setkey(b, x), on = .(x), y := y ]
setkey(a, x) [ setkey(b, x), on = .(x),  `:=` (y = y, z = z, ...) ]
For many columns
cols <- c('x', 'y', ...)
# `update`
a[b, on = .(x), (cols) := mget( paste0('i.', cols) )]
# `setkey + update`
setkey(a, x) [ setkey(b, x), on = .(x), (cols) := mget( paste0('i.', cols) ) ]
Wrapper for fast & memory efficient joins...many of them...with similar join-pattern, wrap them like setjoin above

with update
with or without setkey

setjoin(a, b, on = ...)  # join all columns
setjoin(a, b, on = ..., select = c('columns_to_be_included', ...))
setjoin(a, b, on = ..., drop   = c('columns_to_be_excluded', ...))
# With that, you can even use it with `magrittr` pipe
a %>%
  setjoin(...) %>%
  setjoin(...)
With setkey, argument on can be omitted. It can also be included for readability, especially for collaborating with others.
Have a nice day ☺☺☺♥♣♠
Acknowledgement

The above are not my ideas at all. Needed some fast join for some projects, did some benchmark and just sharing with #RStats community ☺. Please give all credits to the people below.

The R Foundation
CRAN
75 contributors of data.table team, more than 4,000 commits, since 2008
Left join using data.table
Tweet reply by Michael chirico about using setkey during join
#RStats
and more! (too tired typing...you know ☺)