Skip to content

Instantly share code, notes, and snippets.

@arunsrinivasan
Last active Apr 12, 2016
Embed
What would you like to do?
SO_17844143
require(qdapTools)
set.seed(1L)
x = data.frame(zip=sample(1e6), market=0L)
y = data.frame(market=sample(20, 2000, TRUE), zip=sample(1e6, 2000, FALSE))
#### this line takes a long time... Tyler any ideas why?
x$market = lookup(x$zip, y[, 2:1])
####
# I think the actual answer is:
system.time({
idx1 = match(x$zip, y$zip)
idx2 = which(!is.na(idx1))
x$market[idx2] <- y$market[idx1[idx2]]
})
# takes 0.085 seconds
## here's a data.table solution using joins - although this is not necessary here:
require(data.table)
set.seed(1L)
x = data.frame(zip=sample(1e6), market=0L)
y = data.frame(market=sample(20, 2000, TRUE), zip=sample(1e6, 2000, FALSE))
system.time({
setDT(x)
setDT(y)
setkey(x, zip)
setkey(y, zip)
x[y, market := i.market]
})
## takes 0.094 seconds
## here's using match and `:=` in data.table
require(data.table)
set.seed(1L)
x = data.frame(zip=sample(1e6), market=0L)
y = data.frame(market=sample(20, 2000, TRUE), zip=sample(1e6, 2000, FALSE))
system.time({
idx = match(x$zip, y$zip, nomatch=0L)
setDT(x)[idx != 0L, market := y$market[idx]]
})
## takes 0.58 seconds.
@arunsrinivasan
Copy link
Author

arunsrinivasan commented May 28, 2014

Note to self: cleaned up noise, and updated benchmarks. Benchmark on relatively large data - 200 million rows and 50k unique groups, which illustrates the impact of the copy to replace NAs with 0's in dplyr after left_join:

On big data benchmark:

With data.table v1.9.6+

require(data.table)
N = 200e6L
K = 50e3L
DT = function() {
  setDT(x)[, market := 0L][setDT(y), market := i.market, on="zip"]
}

set.seed(1L)
x <- data.frame(zip=sample(N))
y <- data.frame(market=sample(20, 2000, TRUE), zip=sample(N, K, FALSE))
gc()

system.time(DT()) # all heap & anonymous VM, total bytes = 1.81GB, persistent = 1.01GB
#10.862   0.545  11.520

With dplyr v0.4.3.9001:

require(dplyr)
N = 200e6L
K = 50e3L
DPLYR = function() {
  x <- left_join(x,y, by = 'zip') %>% mutate(market = replace(market, which(is.na(market)), 0L))
}

set.seed(1L)
x <- data.frame(zip=sample(N))
y <- data.frame(market=sample(20, 2000, TRUE), zip=sample(N, K, FALSE))
gc()

system.time(DPLYR()) # All heap & anonymous VM, total bytes = 9.22GB, persistent = 5.22GB
#24.249  3.341  28.946

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment