Skip to content

Instantly share code, notes, and snippets.

@kar9222
Last active July 4, 2020 23:39
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save kar9222/c2bf12eb9b4142dd97c5e151b8977bc1 to your computer and use it in GitHub Desktop.
Save kar9222/c2bf12eb9b4142dd97c5e151b8977bc1 to your computer and use it in GitHub Desktop.
data.table `join + update-by-reference + setkey`

data.table join + update-by-reference + setkey

A basic/naive test

  • Found gem about join + update-by-reference from Left join using data.table
  • Saw a tweet reply by Michael chirico about using setkey during join
  • Combine them together

"I want to do fast & memory efficient joins but I don't want to read the performance benchmark...". You can skip to Keyfindings & Examples sections below.

Related Twitter link

# NOTE: Performance and memory efficiency also depend on data/hardware/systems/etc

library(data.table) ; set.seed(1)
x    <- data.table(a = 1:1e8,            b = sample(12:19, 1e7, TRUE))
y    <- data.table(a = sample(x$a, 2e5), c = sample(2:8,   2e5, TRUE))
x_2  <- copy(x) ; y_2  <- copy(y)
x_3  <- copy(x) ; y_3  <- copy(y)
x_4  <- copy(x) ; y_4  <- copy(y)

benchmark <- bench::mark( # TODO validate with `microbenchmark`
  time_unit = 'ms', check = FALSE, # Turn off `check` for `setkey`

  normal_join          = x <- y[x, on = 'a'],
  update_by_reference  = x_2[y_2, on = 'a', c := c],
  setkey_n_update      = setkey(x_3, a) [ setkey(y_3, a), on = 'a', c := c ],
  # Wrapper for complex cases (ie many cols). See codes below.
  wrap_setkey_n_update = setjoin(x_4, y_4, on = 'a')
)
benchmark[, c('expression', 'median', 'mem_alloc')]
# # A tibble: 4 x 3
#   expression           median mem_alloc
#   <bch:expr>            <dbl> <bch:byt>
# 1 normal_join          5593.      4.1GB
# 2 update_by_reference   724.    767.6MB
# 3 setkey_n_update        48.5   767.7MB
# 4 wrap_setkey_n_update   56.9   772.2MB

# NOTE: `dplyr::left_join` was also tested and it's the slowest with ~9,000 ms,
# use more memory than both `update_by_reference` and `setkey_n_update`
# but use less memory than normal_join. It consumed about ~2.0GB of memory.
# I didn't include it as I want to focus solely on {data.table}

# Wrapper
setjoin <- function(dt, dt_join, on, cols, drop) {

  # TODO (WORK IN PROGRESS ONLY) Get cols
  # Focus on the main codes about `join + update-by-reference + setkey` below, not these 'aesthetic' elements
  cols <- {
    if ( ! deparse(substitute(on)) %like% '\\.\\(' ) {
      if (is.character(on)) {
          if ( missing(cols) & missing(drop) )
            setdiff( names(dt_join), on )
          else if ( ! missing(drop) )
            setdiff( names(dt_join), c(on, drop) )
          else cols
      }
    }
    # TODO Case: complex NSE. Currently, if `on` is complex, `cols` must be specified.
    else cols
  }
  # NSE
  expr <- quote({
    setkey(dt, on) ; setkey(dt_join, on)
    dt[dt_join, on = on, (cols) := mget( paste0('i.', cols) ) ]
    # dt[dt_join, on = on, (cols) := dt_join[, ..cols] ]
  })
  eval( do.call(
    substitute,
    list(expr, list(on      = substitute(on),
                    cols    = substitute(cols)))
  ))
}

Test again with system time

I thought the performance gains were unrealistic and decided to test again using system time. Two rounds were tested, where the first round was the same as above and the second round was tested for two rounds of similar joins to investigate the performance gains of setkey (ie setkey is only set for the first time and used for subsequent joins). Wrapper of setjoin was not tested here, for brevity. It has similar performance as setkey + update.

# `bench::system_time` ---------------------------

library(data.table) ; set.seed(1)
x    <- data.table(a = 1:1e8,            b = sample(12:19, 1e7, TRUE))
y    <- data.table(a = sample(x$a, 2e5), c = sample(2:8,   2e5, TRUE))
z    <- data.table(a = sample(x$a, 2e5), c = sample(2:8,   2e5, TRUE))
x_2  <- copy(x) ; y_2  <- copy(y) ; z_2 <- copy(z)
x_3  <- copy(x) ; y_3  <- copy(y) ; z_3 <- copy(z)

# _ Join once ------------------------------------

normal_join <-         bench::system_time({
  x <- y[x, on = 'a']
})
update_by_reference <- bench::system_time({
  x_2[y_2, on = 'a', c := c]
})
update_n_setkey <-     bench::system_time({
  setkey(x_3, a) [ setkey(y_3, a), on = 'a', c := c ]
})
join_once <- rbind(normal_join, update_by_reference, update_n_setkey)[, 'real']

# _ Join twice -----------------------------------

# Same chunk as above
library(data.table) ; set.seed(1)
x    <- data.table(a = 1:1e8,            b = sample(12:19, 1e7, TRUE))
y    <- data.table(a = sample(x$a, 2e5), c = sample(2:8,   2e5, TRUE))
z    <- data.table(a = sample(x$a, 2e5), c = sample(2:8,   2e5, TRUE))
x_2  <- copy(x) ; y_2  <- copy(y) ; z_2 <- copy(z)
x_3  <- copy(x) ; y_3  <- copy(y) ; z_3 <- copy(z)

# Different chunk 
normal_join <-         bench::system_time({
  x <- y[x, on = 'a']
  y <- z[x, on = 'a']
})
update_by_reference <- bench::system_time({
  x_2[y_2, on = 'a', c := c]
  x_2[z_2, on = 'a', c := c]
})
update_n_setkey <-     bench::system_time({
  setkey(x_3, a) [ setkey(y_3, a), on = 'a', c := c ]
  x_3[ setkey(z_3, a), on = 'a', c := c ]    # NOTE -------- No `setkey` for x_3 here --------
})
join_twice <- rbind(normal_join, update_by_reference, update_n_setkey)[, 'real']

# _ Compare --------------------------------------

data <- data.table(n_join    = c( rep('join_once', 3), rep('join_twice', 3) ),
                   type      = names(join_once),
                   seconds   = c(join_once, join_twice)) [,
        seconds := round(seconds, 1)]
data
#        n_join                type seconds
#        <char>              <char>   <num>
# 1:  join_once         normal_join     5.6
# 2:  join_once update_by_reference     0.8
# 3:  join_once     update_n_setkey     0.7
# 4: join_twice         normal_join    10.2
# 5: join_twice update_by_reference     1.5
# 6: join_twice     update_n_setkey     0.9

library(ggplot2)
ggplot(data, aes(n_join, seconds, fill = type)) +
  geom_col(position = 'dodge', width = .5) +
  geom_text(aes(label = seconds), position = position_dodge(width = 0.9), vjust = - .25, color = 'grey')
  # Personal package was used below for theming

Somehow bench::mark didn't capture the overhead of setkey during the initial test, hence the unrealistic of ~100x gains were shown.

Key findings

  • setkey + update and update are ~11 and ~6.5 times faster than normal join, respectively
  • on first join, performance of setkey + update is similar to update as overhead of setkey largely offsets its own performance gains
  • on second and subsequent joins, as setkey is not required, setkey + update is faster than update by ~1.8 times (or faster than normal join by ~11 times)

Image

Examples

For performant & memory efficient joins, use either update or setkey + update, where the latter is faster at the cost of more codes.

Let's see some pseudo codes, for brevity. The logics are the same.

For one or a few columns

a <- data.table(x = ..., y = ..., z = ..., ...)
b <- data.table(x = ..., y = ..., z = ..., ...)

# `update`
a[b, on = .(x), y := y]
a[b, on = .(x),  `:=` (y = y, z = z, ...)]
# `setkey + update`
setkey(a, x) [ setkey(b, x), on = .(x), y := y ]
setkey(a, x) [ setkey(b, x), on = .(x),  `:=` (y = y, z = z, ...) ]

For many columns

cols <- c('x', 'y', ...)
# `update`
a[b, on = .(x), (cols) := mget( paste0('i.', cols) )]
# `setkey + update`
setkey(a, x) [ setkey(b, x), on = .(x), (cols) := mget( paste0('i.', cols) ) ]

Wrapper for fast & memory efficient joins...many of them...with similar join-pattern, wrap them like setjoin above

  • with update
  • with or without setkey
setjoin(a, b, on = ...)  # join all columns
setjoin(a, b, on = ..., select = c('columns_to_be_included', ...))
setjoin(a, b, on = ..., drop   = c('columns_to_be_excluded', ...))
# With that, you can even use it with `magrittr` pipe
a %>%
  setjoin(...) %>%
  setjoin(...)

With setkey, argument on can be omitted. It can also be included for readability, especially for collaborating with others.

Have a nice day ☺☺☺♥♣♠

Acknowledgement

The above are not my ideas at all. Needed some fast join for some projects, did some benchmark and just sharing with #RStats community ☺. Please give all credits to the people below.

  • The R Foundation
  • CRAN
  • 75 contributors of data.table team, more than 4,000 commits, since 2008
  • Left join using data.table
  • Tweet reply by Michael chirico about using setkey during join
  • #RStats
  • and more! (too tired typing...you know ☺)
@jangorecki
Copy link

jangorecki commented Feb 24, 2020

using setkey and repeated runs seems pointless because it may escape sorting evaluation when it detects object is already sorted sorry, just realized it is not repeated evaluation, but you just use extra package for measuring time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment