Skip to content

Instantly share code, notes, and snippets.

@arunsrinivasan
Last active September 17, 2015 14:03
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save arunsrinivasan/db6e1ce05227f120a2c9 to your computer and use it in GitHub Desktop.
Save arunsrinivasan/db6e1ce05227f120a2c9 to your computer and use it in GitHub Desktop.
Illustrating the impact of number of groups on joins

Update: The timings are now updated with runs from R v3.2.2 along with the new 'on=' syntax

A small note on this tweet from @KevinUshey and this tweet from @ChengHLee:

The number of rows, while is important, is only one of the factors that influence the time taken to perform the join. From my benchmarking experience, the two features that I found to influence join speed, especially on hash table based approaches (ex: dplyr), much more are:

  • The number of unique groups.
  • The number of columns to perform the join based on - note that this is also related to the previous point as in most cases, more the columns, more the number of unique groups.

That is, these features influence join speed in spite of having the same number of rows.

I'll consider a self-join to illustrate the idea on 10 million rows. Except that we'll compare the self-join on

  • 100 and 100,000 groups on one column to join by (cases A and B)
  • 100 and 100,000 groups on three columns to join by (cases C and D) .

I'll compare both dplyr and data.table - as this is a very nice case to compare the performance tradeoffs of hash table approach in dplyr vs radix sort + binary search approach from data.table.

## prepare functions
require(dplyr)
require(data.table)
set.seed(1L)

N = 1e7L
getDT <- function(N) {
  data.table(v1 = sample(10L, N, TRUE), v2 = 2L, 
             v3 = sample(10L, N, TRUE), v4 = sample(1e3L, N, TRUE), 
             v5 = sample(1e5L, N, TRUE), val = runif(1e7))
}

case A - 100 groups + join on one column

## create the first data set
x.DT <- getDT(N)
x.DF <- tbl_df(as.data.frame(x.DT))

## second data to perform self-join
i.DT = unique(x.DT, by="v4")[, list(v4)]
i.DF = tbl_df(as.data.frame(i.DT))

## data.table with 'on=' syntax
system.time({
    x.DT[i.DT, on="v4"]
})
#    user  system elapsed 
#   0.888   0.118   1.011

## data.table with keys
system.time({
    setkey(x.DT, v4)
    ans1 = x.DT[i.DT]
})
#    user  system elapsed 
#   0.921   0.039   0.966

## dplyr
system.time({
    ans2 = left_join(i.DF, x.DF, by="v4")
})
#    user  system elapsed 
#   0.928   0.152   1.087

case B - 100, 000 groups + join on one column

## create the first data set
x.DT <- getDT(N)
x.DF <- tbl_df(as.data.frame(x.DT))

## second data to perform self-join
i.DT = unique(x.DT, by="v5")[, list(v5)]
i.DF = tbl_df(as.data.frame(i.DT))

## data.table with 'on=' syntax
system.time({
    x.DT[i.DT, on="v5"]
})
#    user  system elapsed 
#   1.099   0.108   1.211

## data.table with keys
system.time({
    setkey(x.DT, v5)
    ans1 = x.DT[i.DT]
})
#    user  system elapsed 
#   1.231   0.038   1.273

## dplyr
system.time({
    ans2 = left_join(i.DF, x.DF, by="v5")
})
#    user  system elapsed 
#   2.770   0.209   2.989

Note that data.table join time is not much different between 100 and 100,000 groups, as compared to dplyr. In other words, number of groups doesn't seem to influence the run time much for data.table approach (2.73 vs 2.67s), where as that seems not the case with dplyr (2.75s vs 7.76s).

data.table scales much better.


Now on to cases C and D.

case C - 100 groups + join on three columns

## create the first data set
x.DT <- getDT(N)
x.DF <- tbl_df(as.data.frame(x.DT))

## second data to perform self-join
i.DT = unique(x.DT, by=c("v1", "v2", "v3"))[, list(v1, v2, v3)]
i.DF = tbl_df(as.data.frame(i.DT))

## data.table with 'on=' syntax
system.time({
    x.DT[i.DT, on=c("v1", "v2", "v3")]
})
#    user  system elapsed 
#   0.647   0.049   0.698

## data.table with keys
system.time({
    setkey(x.DT, v1, v2, v3)
    ans1 = x.DT[i.DT]
})
#    user  system elapsed 
#   0.921   0.041   0.965

## dplyr
system.time({
    ans2 = left_join(i.DF, x.DF, by=c("v1", "v2", "v3"))
})
#    user  system elapsed 
#   0.826   0.201   1.033 

case D - 100,000 groups + join on three columns

## create the first data set
x.DT <- getDT(N)
x.DF <- tbl_df(as.data.frame(x.DT))

## second data to perform self-join
i.DT = unique(x.DT, by=c("v1", "v3", "v4"))[, list(v1, v3, v4)]
i.DF = tbl_df(as.data.frame(i.DT))

## data.table with 'on=' syntax
system.time({
    x.DT[i.DT, on=c("v1", "v3", "v4")]
})
#    user  system elapsed 
#   0.992   0.111   1.105 

## data.table with keys
system.time({
    setkey(x.DT, v1, v3, v4)
    ans1 = x.DT[i.DT]
})
#    user  system elapsed 
#   1.303   0.072   1.380 

## dplyr
system.time({
    ans2 = left_join(i.DF, x.DF, by=c("v1", "v3", "v4"))
})
#    user  system elapsed 
#   5.228   0.219   5.478 

Summary table of the timings:

Case    Groups    Columns    dplyr    DT_key   DT_on
   A       100          1     1.09      0.97    1.01
   B   100,000          1     2.99      1.27    1.21
   C       100          3     1.03      0.97    0.70
   D   100,000          3     5.48      1.38    1.11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment