Skip to content

Instantly share code, notes, and snippets.

@arunsrinivasan

arunsrinivasan/17844143.R Secret

Last active Apr 12, 2016
Embed
What would you like to do?
SO_17844143
require(qdapTools)
set.seed(1L)
x = data.frame(zip=sample(1e6), market=0L)
y = data.frame(market=sample(20, 2000, TRUE), zip=sample(1e6, 2000, FALSE))
#### this line takes a long time... Tyler any ideas why?
x$market = lookup(x$zip, y[, 2:1])
####
# I think the actual answer is:
system.time({
idx1 = match(x$zip, y$zip)
idx2 = which(!is.na(idx1))
x$market[idx2] <- y$market[idx1[idx2]]
})
# takes 0.085 seconds
## here's a data.table solution using joins - although this is not necessary here:
require(data.table)
set.seed(1L)
x = data.frame(zip=sample(1e6), market=0L)
y = data.frame(market=sample(20, 2000, TRUE), zip=sample(1e6, 2000, FALSE))
system.time({
setDT(x)
setDT(y)
setkey(x, zip)
setkey(y, zip)
x[y, market := i.market]
})
## takes 0.094 seconds
## here's using match and `:=` in data.table
require(data.table)
set.seed(1L)
x = data.frame(zip=sample(1e6), market=0L)
y = data.frame(market=sample(20, 2000, TRUE), zip=sample(1e6, 2000, FALSE))
system.time({
idx = match(x$zip, y$zip, nomatch=0L)
setDT(x)[idx != 0L, market := y$market[idx]]
})
## takes 0.58 seconds.
@trinker

This comment has been minimized.

Copy link

@trinker trinker commented May 9, 2014

Not sure. Some folks at talkstats helped me design lookup many moons ago. It may be well that R's optimization has caught up to the point where this approach may be faster than the hash environment. At the moment I don't have much time to look at this closely. When I do this will certainly inform qdapTools optimization. Thanks for the gist.

@trinker

This comment has been minimized.

Copy link

@trinker trinker commented May 9, 2014

The optimization seen here is from pulling out the NA values separate which in itself is informative. When I do something similar with lookup I get:

> set.seed(1L)
> x = data.frame(zip=sample(1e6), market=0L)
> y = data.frame(market=sample(20, 2000, TRUE), zip=sample(1e6, 2000, FALSE))
> 
> system.time({
+ idx1 = match(x$zip, y$zip)
+ idx2 = which(!is.na(idx1))
+ x$market[idx2] <- y$market[idx1[idx2]]
+ })
   user  system elapsed 
   0.05    0.00    0.05 
> 
> require(qdapTools)
> set.seed(1L)
> x = data.frame(zip=sample(1e6), market=0L)
> y = data.frame(market=sample(20, 2000, TRUE), zip=sample(1e6, 2000, FALSE))
>  
> 
> system.time({
+ idx1 = which(!is.na(match(x$zip, y$zip)))
+ x$market[idx1] <- x$zip[idx1] %l% y[, 2:1]
+ })
   user  system elapsed 
   0.17    0.00    0.17 

Still slower. I knew from some tests I did 6 months back when someone posted on RBloggers about vectorized lookups that on smaller vectors lookup fared slightly worse though negligible. Just tinkering around here this may not be the case. Informative, I've opened an issue at qdapTools to deal with this later when I have time: trinker/qdapTools#3

@trinker

This comment has been minimized.

Copy link

@trinker trinker commented May 9, 2014

I went ahead and made the changes based on your findings (couldn't help myself). Thanks a ton. I didn't even realize there was a problem. This still may be slower than a well thought out vectorized approach but it's difficult because lookup has a bit of syntactic sugar.

@bryangoodrich

This comment has been minimized.

Copy link

@bryangoodrich bryangoodrich commented May 17, 2014

If I understand correctly, x contains the keys of interest and you want to return the market values stored in the lookup table y. The data.table approach was to join the two together and populate the default 0 set values in x with those in y. From a database (SQL) perspective, that's actually kind of odd, but it makes sense here.

I think dplyr keeps more true to the database approach and is a straight-forward left-join that leaves non-matches as NA here. The speeds were virtually the same and a simple table(x$market) for both of these show they get the same results. If you actually want to replace the NAs with 0 in this example, it'll cost a bit more.

require(dplyr)

set.seed(1L)
x <- data.frame(zip=sample(1e6)) 
y <- data.frame(market=sample(20, 2000, TRUE), zip=sample(1e6, 2000, FALSE))
x <- left_join(x,y, by = 'zip')  # You have NA instead of 0 for non-matches
@bryangoodrich

This comment has been minimized.

Copy link

@bryangoodrich bryangoodrich commented May 17, 2014

I did a quick benchmark of the 3 approaches

benchmark(
INDEX = {
  set.seed(1L)
  x <- data.frame(zip=sample(1e6), market=0L)
  y <- data.frame(market=sample(20, 2000, TRUE), zip=sample(1e6, 2000, FALSE))
  idx1 <- match(x$zip, y$zip)
  idx2 <- which(!is.na(idx1))
  x$market[idx2] <- y$market[idx1[idx2]]
},
DATATABLE = {
  set.seed(1L)
  x <- data.frame(zip=sample(1e6), market=0L)
  y <- data.frame(market=sample(20, 2000, TRUE), zip=sample(1e6, 2000, FALSE))
  setDT(x)
  setDT(y)

  setkey(x, zip)
  setkey(y, zip)
  x[y, market := i.market]
},
DPLYR = {
  set.seed(1L)
  x <- data.frame(zip=sample(1e6)) 
  y <- data.frame(market=sample(20, 2000, TRUE), zip=sample(1e6, 2000, FALSE))
  x <- left_join(x,y, by = 'zip')  # You have NA instead of 0 for non-matches
})

The winner is! Indexing

       test replications elapsed relative user.self sys.self user.child
2 DATATABLE          100   7.631    1.075     7.624        0          0
3     DPLYR          100   7.228    1.018     7.222        0          0
1     INDEX          100   7.097    1.000     7.090        0          0
@arunsrinivasan

This comment has been minimized.

Copy link
Owner Author

@arunsrinivasan arunsrinivasan commented May 28, 2014

Note to self: cleaned up noise, and updated benchmarks. Benchmark on relatively large data - 200 million rows and 50k unique groups, which illustrates the impact of the copy to replace NAs with 0's in dplyr after left_join:

On big data benchmark:

With data.table v1.9.6+

require(data.table)
N = 200e6L
K = 50e3L
DT = function() {
  setDT(x)[, market := 0L][setDT(y), market := i.market, on="zip"]
}

set.seed(1L)
x <- data.frame(zip=sample(N))
y <- data.frame(market=sample(20, 2000, TRUE), zip=sample(N, K, FALSE))
gc()

system.time(DT()) # all heap & anonymous VM, total bytes = 1.81GB, persistent = 1.01GB
#10.862   0.545  11.520

With dplyr v0.4.3.9001:

require(dplyr)
N = 200e6L
K = 50e3L
DPLYR = function() {
  x <- left_join(x,y, by = 'zip') %>% mutate(market = replace(market, which(is.na(market)), 0L))
}

set.seed(1L)
x <- data.frame(zip=sample(N))
y <- data.frame(market=sample(20, 2000, TRUE), zip=sample(N, K, FALSE))
gc()

system.time(DPLYR()) # All heap & anonymous VM, total bytes = 9.22GB, persistent = 5.22GB
#24.249  3.341  28.946
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment