Skip to content

Instantly share code, notes, and snippets.

@arunsrinivasan
Last active April 12, 2016 13:40
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save arunsrinivasan/ee2d9ef43bdc02c32958 to your computer and use it in GitHub Desktop.
Save arunsrinivasan/ee2d9ef43bdc02c32958 to your computer and use it in GitHub Desktop.
SO_17844143
require(qdapTools)
set.seed(1L)
x = data.frame(zip=sample(1e6), market=0L)
y = data.frame(market=sample(20, 2000, TRUE), zip=sample(1e6, 2000, FALSE))
#### this line takes a long time... Tyler any ideas why?
x$market = lookup(x$zip, y[, 2:1])
####
# I think the actual answer is:
system.time({
idx1 = match(x$zip, y$zip)
idx2 = which(!is.na(idx1))
x$market[idx2] <- y$market[idx1[idx2]]
})
# takes 0.085 seconds
## here's a data.table solution using joins - although this is not necessary here:
require(data.table)
set.seed(1L)
x = data.frame(zip=sample(1e6), market=0L)
y = data.frame(market=sample(20, 2000, TRUE), zip=sample(1e6, 2000, FALSE))
system.time({
setDT(x)
setDT(y)
setkey(x, zip)
setkey(y, zip)
x[y, market := i.market]
})
## takes 0.094 seconds
## here's using match and `:=` in data.table
require(data.table)
set.seed(1L)
x = data.frame(zip=sample(1e6), market=0L)
y = data.frame(market=sample(20, 2000, TRUE), zip=sample(1e6, 2000, FALSE))
system.time({
idx = match(x$zip, y$zip, nomatch=0L)
setDT(x)[idx != 0L, market := y$market[idx]]
})
## takes 0.58 seconds.
@bryangoodrich
Copy link

I did a quick benchmark of the 3 approaches

benchmark(
INDEX = {
  set.seed(1L)
  x <- data.frame(zip=sample(1e6), market=0L)
  y <- data.frame(market=sample(20, 2000, TRUE), zip=sample(1e6, 2000, FALSE))
  idx1 <- match(x$zip, y$zip)
  idx2 <- which(!is.na(idx1))
  x$market[idx2] <- y$market[idx1[idx2]]
},
DATATABLE = {
  set.seed(1L)
  x <- data.frame(zip=sample(1e6), market=0L)
  y <- data.frame(market=sample(20, 2000, TRUE), zip=sample(1e6, 2000, FALSE))
  setDT(x)
  setDT(y)

  setkey(x, zip)
  setkey(y, zip)
  x[y, market := i.market]
},
DPLYR = {
  set.seed(1L)
  x <- data.frame(zip=sample(1e6)) 
  y <- data.frame(market=sample(20, 2000, TRUE), zip=sample(1e6, 2000, FALSE))
  x <- left_join(x,y, by = 'zip')  # You have NA instead of 0 for non-matches
})

The winner is! Indexing

       test replications elapsed relative user.self sys.self user.child
2 DATATABLE          100   7.631    1.075     7.624        0          0
3     DPLYR          100   7.228    1.018     7.222        0          0
1     INDEX          100   7.097    1.000     7.090        0          0

@arunsrinivasan
Copy link
Author

Note to self: cleaned up noise, and updated benchmarks. Benchmark on relatively large data - 200 million rows and 50k unique groups, which illustrates the impact of the copy to replace NAs with 0's in dplyr after left_join:

On big data benchmark:

With data.table v1.9.6+

require(data.table)
N = 200e6L
K = 50e3L
DT = function() {
  setDT(x)[, market := 0L][setDT(y), market := i.market, on="zip"]
}

set.seed(1L)
x <- data.frame(zip=sample(N))
y <- data.frame(market=sample(20, 2000, TRUE), zip=sample(N, K, FALSE))
gc()

system.time(DT()) # all heap & anonymous VM, total bytes = 1.81GB, persistent = 1.01GB
#10.862   0.545  11.520

With dplyr v0.4.3.9001:

require(dplyr)
N = 200e6L
K = 50e3L
DPLYR = function() {
  x <- left_join(x,y, by = 'zip') %>% mutate(market = replace(market, which(is.na(market)), 0L))
}

set.seed(1L)
x <- data.frame(zip=sample(N))
y <- data.frame(market=sample(20, 2000, TRUE), zip=sample(N, K, FALSE))
gc()

system.time(DPLYR()) # All heap & anonymous VM, total bytes = 9.22GB, persistent = 5.22GB
#24.249  3.341  28.946

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment