Skip to content

Instantly share code, notes, and snippets.

@szilard
Last active August 29, 2015 14:10
Show Gist options
  • Save szilard/6d73bd6e1a157eb4faf2 to your computer and use it in GitHub Desktop.
Save szilard/6d73bd6e1a157eb4faf2 to your computer and use it in GitHub Desktop.
R data.table vs pandas aggregate/join
## Minimal example of R's data.table vs pandas aggregation and join benchmark
## ( more detailed but still basic benchmark here:
## http://datascience.la/dplyr-and-a-very-basic-benchmark/ )
## Just copy paste into R and Ipython, respectively
## Timings on a decent server with data.table 1.9.4 & pandas 0.15.1 (Nov 2014)
#### R:
library(data.table)
n <- 10e6
m <- 1e6
d <- data.table(x = sample(m, n, replace=TRUE), y = runif(n))
## aggregation
system.time(
d[, mean(y), by=x]
)
# ~ 1 sec
d2 <- data.table(x = sample(m))
setkey(d)
## join
system.time(
d[d2, nomatch=0]
)
# ~ 0.5 sec
#### Ipython:
import pandas as pd
import numpy as np
n = 10e6
m = 1e6
d = pd.DataFrame({"x": np.random.randint(0,m,n), "y": np.random.random(n)})
## aggregation
%time dd = d.groupby("x")["y"].mean()
## ~ 1.5 sec
## join
d2 = pd.DataFrame({"x": np.random.permutation(np.arange(m))})
d = d.sort_index(by = "x")
%time dd = pd.merge(d, d2)
## ~ 2.5 sec
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment