-
-
Save arunsrinivasan/ee2d9ef43bdc02c32958 to your computer and use it in GitHub Desktop.
SO_17844143
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
require(qdapTools) | |
set.seed(1L) | |
x = data.frame(zip=sample(1e6), market=0L) | |
y = data.frame(market=sample(20, 2000, TRUE), zip=sample(1e6, 2000, FALSE)) | |
#### this line takes a long time... Tyler any ideas why? | |
x$market = lookup(x$zip, y[, 2:1]) | |
#### | |
# I think the actual answer is: | |
system.time({ | |
idx1 = match(x$zip, y$zip) | |
idx2 = which(!is.na(idx1)) | |
x$market[idx2] <- y$market[idx1[idx2]] | |
}) | |
# takes 0.085 seconds | |
## here's a data.table solution using joins - although this is not necessary here: | |
require(data.table) | |
set.seed(1L) | |
x = data.frame(zip=sample(1e6), market=0L) | |
y = data.frame(market=sample(20, 2000, TRUE), zip=sample(1e6, 2000, FALSE)) | |
system.time({ | |
setDT(x) | |
setDT(y) | |
setkey(x, zip) | |
setkey(y, zip) | |
x[y, market := i.market] | |
}) | |
## takes 0.094 seconds | |
## here's using match and `:=` in data.table | |
require(data.table) | |
set.seed(1L) | |
x = data.frame(zip=sample(1e6), market=0L) | |
y = data.frame(market=sample(20, 2000, TRUE), zip=sample(1e6, 2000, FALSE)) | |
system.time({ | |
idx = match(x$zip, y$zip, nomatch=0L) | |
setDT(x)[idx != 0L, market := y$market[idx]] | |
}) | |
## takes 0.58 seconds. |
I did a quick benchmark of the 3 approaches
benchmark(
INDEX = {
set.seed(1L)
x <- data.frame(zip=sample(1e6), market=0L)
y <- data.frame(market=sample(20, 2000, TRUE), zip=sample(1e6, 2000, FALSE))
idx1 <- match(x$zip, y$zip)
idx2 <- which(!is.na(idx1))
x$market[idx2] <- y$market[idx1[idx2]]
},
DATATABLE = {
set.seed(1L)
x <- data.frame(zip=sample(1e6), market=0L)
y <- data.frame(market=sample(20, 2000, TRUE), zip=sample(1e6, 2000, FALSE))
setDT(x)
setDT(y)
setkey(x, zip)
setkey(y, zip)
x[y, market := i.market]
},
DPLYR = {
set.seed(1L)
x <- data.frame(zip=sample(1e6))
y <- data.frame(market=sample(20, 2000, TRUE), zip=sample(1e6, 2000, FALSE))
x <- left_join(x,y, by = 'zip') # You have NA instead of 0 for non-matches
})
The winner is! Indexing
test replications elapsed relative user.self sys.self user.child
2 DATATABLE 100 7.631 1.075 7.624 0 0
3 DPLYR 100 7.228 1.018 7.222 0 0
1 INDEX 100 7.097 1.000 7.090 0 0
Note to self: cleaned up noise, and updated benchmarks. Benchmark on relatively large data - 200 million rows and 50k unique groups, which illustrates the impact of the copy to replace NAs with 0's in dplyr
after left_join
:
On big data benchmark:
With data.table v1.9.6+
require(data.table)
N = 200e6L
K = 50e3L
DT = function() {
setDT(x)[, market := 0L][setDT(y), market := i.market, on="zip"]
}
set.seed(1L)
x <- data.frame(zip=sample(N))
y <- data.frame(market=sample(20, 2000, TRUE), zip=sample(N, K, FALSE))
gc()
system.time(DT()) # all heap & anonymous VM, total bytes = 1.81GB, persistent = 1.01GB
#10.862 0.545 11.520
With dplyr v0.4.3.9001:
require(dplyr)
N = 200e6L
K = 50e3L
DPLYR = function() {
x <- left_join(x,y, by = 'zip') %>% mutate(market = replace(market, which(is.na(market)), 0L))
}
set.seed(1L)
x <- data.frame(zip=sample(N))
y <- data.frame(market=sample(20, 2000, TRUE), zip=sample(N, K, FALSE))
gc()
system.time(DPLYR()) # All heap & anonymous VM, total bytes = 9.22GB, persistent = 5.22GB
#24.249 3.341 28.946
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
If I understand correctly, x contains the keys of interest and you want to return the market values stored in the lookup table y. The data.table approach was to join the two together and populate the default 0 set values in x with those in y. From a database (SQL) perspective, that's actually kind of odd, but it makes sense here.
I think dplyr keeps more true to the database approach and is a straight-forward left-join that leaves non-matches as NA here. The speeds were virtually the same and a simple table(x$market) for both of these show they get the same results. If you actually want to replace the NAs with 0 in this example, it'll cost a bit more.