Last active
August 29, 2015 14:10
-
-
Save szilard/6d73bd6e1a157eb4faf2 to your computer and use it in GitHub Desktop.
R data.table vs pandas aggregate/join
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
## Minimal example of R's data.table vs pandas aggregation and join benchmark | |
## ( more detailed but still basic benchmark here: | |
## http://datascience.la/dplyr-and-a-very-basic-benchmark/ ) | |
## Just copy paste into R and Ipython, respectively | |
## Timings on a decent server with data.table 1.9.4 & pandas 0.15.1 (Nov 2014) | |
#### R: | |
library(data.table) | |
n <- 10e6 | |
m <- 1e6 | |
d <- data.table(x = sample(m, n, replace=TRUE), y = runif(n)) | |
## aggregation | |
system.time( | |
d[, mean(y), by=x] | |
) | |
# ~ 1 sec | |
d2 <- data.table(x = sample(m)) | |
setkey(d) | |
## join | |
system.time( | |
d[d2, nomatch=0] | |
) | |
# ~ 0.5 sec | |
#### Ipython: | |
import pandas as pd | |
import numpy as np | |
n = 10e6 | |
m = 1e6 | |
d = pd.DataFrame({"x": np.random.randint(0,m,n), "y": np.random.random(n)}) | |
## aggregation | |
%time dd = d.groupby("x")["y"].mean() | |
## ~ 1.5 sec | |
## join | |
d2 = pd.DataFrame({"x": np.random.permutation(np.arange(m))}) | |
d = d.sort_index(by = "x") | |
%time dd = pd.merge(d, d2) | |
## ~ 2.5 sec |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment