SO_17844143
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
require(qdapTools) | |
set.seed(1L) | |
x = data.frame(zip=sample(1e6), market=0L) | |
y = data.frame(market=sample(20, 2000, TRUE), zip=sample(1e6, 2000, FALSE)) | |
#### this line takes a long time... Tyler any ideas why? | |
x$market = lookup(x$zip, y[, 2:1]) | |
#### | |
# I think the actual answer is: | |
system.time({ | |
idx1 = match(x$zip, y$zip) | |
idx2 = which(!is.na(idx1)) | |
x$market[idx2] <- y$market[idx1[idx2]] | |
}) | |
# takes 0.085 seconds | |
## here's a data.table solution using joins - although this is not necessary here: | |
require(data.table) | |
set.seed(1L) | |
x = data.frame(zip=sample(1e6), market=0L) | |
y = data.frame(market=sample(20, 2000, TRUE), zip=sample(1e6, 2000, FALSE)) | |
system.time({ | |
setDT(x) | |
setDT(y) | |
setkey(x, zip) | |
setkey(y, zip) | |
x[y, market := i.market] | |
}) | |
## takes 0.094 seconds | |
## here's using match and `:=` in data.table | |
require(data.table) | |
set.seed(1L) | |
x = data.frame(zip=sample(1e6), market=0L) | |
y = data.frame(market=sample(20, 2000, TRUE), zip=sample(1e6, 2000, FALSE)) | |
system.time({ | |
idx = match(x$zip, y$zip, nomatch=0L) | |
setDT(x)[idx != 0L, market := y$market[idx]] | |
}) | |
## takes 0.58 seconds. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Note to self: cleaned up noise, and updated benchmarks. Benchmark on relatively large data - 200 million rows and 50k unique groups, which illustrates the impact of the copy to replace NAs with 0's in
dplyr
afterleft_join
:On big data benchmark:
With data.table v1.9.6+
With dplyr v0.4.3.9001: