Skip to content

Instantly share code, notes, and snippets.

@arunsrinivasan
arunsrinivasan / rbind_fill_benchmarking
Last active October 17, 2016 22:14
data.table version of rbind.fill benchmarking with plyr version of rbind.fill
# The post with benchmarking results is the link given below:
# http://stackoverflow.com/questions/18003717/is-there-any-efficient-way-than-rbind-filllist/18004698#18004698
# This is the script with which the benchmarking and plots were generated in case anyone else wants to replicate it.
# Note: it takes about 2-3 hours for the benchmarking to finish.
require(plyr)
require(data.table)
require(ggplot2)
require(microbenchmark)
@arunsrinivasan
arunsrinivasan / DT_1.8.10_benchmark.R
Last active December 30, 2015 12:59
1.8.10 : Benchmark: comparison between data.table 1.8.10 and 1.8.11 commit 1048
# version 1.8.10
require(data.table)
# Loading required package: data.table
# data.table 1.8.10 For help type: help("data.table")
## create a huge data.table:
## -------------------------
set.seed(1)
N <- 2e7 # size of DT
@arunsrinivasan
arunsrinivasan / DT_1.8.11_1048_benchmark.R
Last active December 30, 2015 13:38
1.8.11 (commit 1048) : Benchmark: comparison between data.table 1.8.10 and 1.8.11 commit 1048
# version 1.8.11 (commit 1048)
require(data.table)
# Loading required package: data.table
# data.table 1.8.11 For help type: help("data.table")
## create a huge data.table:
## -------------------------
set.seed(1)
N <- 2e7 # size of DT
@arunsrinivasan
arunsrinivasan / dplyr_vs_data.table_1.8.11.R
Last active December 30, 2015 13:59
Benchmarking dplyr and data.table 1.8.11 commit 1048
# version 1.8.11 (commit 1048)
require(data.table)
# Loading required package: data.table
# data.table 1.8.11 For help type: help("data.table")
## create a huge data.table:
## -------------------------
set.seed(1)
N <- 2e7 # size of DT
@arunsrinivasan
arunsrinivasan / CologneR.R
Last active December 30, 2015 14:09
Script used to generate results for CologneR user group meet (for reproducibility)
require(reshape2)
# data.table commit (1048)
require(data.table)
# Loading required package: data.table
# data.table 1.8.11 For help type: help("data.table")
set.seed(1)
N <- 2e7 # size of DT
@arunsrinivasan
arunsrinivasan / dplyr_vs_data.table_1.8.11_less_groupings.R
Created December 7, 2013 17:44
Benchmarking dplyr and data.table 1.8.11 commit 1048 (with lesser groups)
# version 1.8.11 (commit 1048)
require(data.table)
# Loading required package: data.table
# data.table 1.8.11 For help type: help("data.table")
## create a huge data.table:
## -------------------------
set.seed(1)
N <- 2e7 # size of DT
@arunsrinivasan
arunsrinivasan / pandas_data.table.R
Last active December 31, 2015 00:49
Comparison of pandas with data.table (along with base:::merge and plyr:::join)
require(data.table)
# tested on 1.8.10 AND 1.8.11, results don't differ much at all.
require(plyr)
# 1_8 version
set.seed(1000) # for reproducibility
N <- 1e4
foo <- function() paste(sample(letters, 10), collapse="")
indices <- replicate(N, foo())
@arunsrinivasan
arunsrinivasan / pandas_data.table.py
Created December 11, 2013 13:27
Comparison of pandas with data.table joins (along with base:::merge and plyr:::join)
from pandas import *
from pandas.util.testing import rands
import random
N = 10000
ngroups = 10
def get_test_data(ngroups=100, n=N):
unique_groups = range(ngroups)
arr = np.asarray(np.tile(unique_groups, n / ngroups), dtype=object)
@arunsrinivasan
arunsrinivasan / DT_1.8.10vs1.8.11.R
Created December 16, 2013 23:49
Comparing 1.8.11 to 1.8.10
# version 1.8.11 (commit 1048)
require(data.table)
# Loading required package: data.table
# data.table 1.8.11 For help type: help("data.table")
## create a huge data.table:
## -------------------------
set.seed(1)
N <- 2e7 # size of DT
@arunsrinivasan
arunsrinivasan / dplyr_data.table_mini_benchmark.R
Created December 17, 2013 00:03
A small comparison between 'dplyr' and 'data.table'
# version 1.8.11
require(data.table)
# Loading required package: data.table
# data.table 1.8.11 For help type: help("data.table")
## create a huge data.table:
## -------------------------
set.seed(1)
N <- 2e7 # size of DT