Skip to content

Instantly share code, notes, and snippets.

@arunsrinivasan
arunsrinivasan / SO_23388893.md
Last active August 29, 2015 14:00
data.table's sub-assignment by reference feature vs R v3.0.3 and R v3.1 benchmarks

Here's the code for comparing base R (v3.0.3 and 3.1.0 vs data.table's sub-assignment by reference feature):

require(data.table)
set.seed(20140430)
N <- as.integer(10^(3:7)*2L)
ans = vector("list", length(N))
for (i in seq_along(N)) {
    print(i)
    nreg = N[i]
@arunsrinivasan
arunsrinivasan / Ramnath_twitter_question.R
Created March 18, 2014 20:32
Clarifying twitter question regarding usage of `.I`
require(data.table)
DT <- as.data.table(mtcars)
# directly using .SD
set.seed(45L)
system.time(ans1 <- DT[, .SD[sample(.N, 5L)], by=gear])
# user system elapsed
# 0.009 0.000 0.010
@arunsrinivasan
arunsrinivasan / SO_21308436.R
Last active January 4, 2016 12:49
min_rank vs min - Hadley's "premature optimisation" point
require(dplyr)
require(data.table)
foo <- function(N) {
group_sizes = 10^(1:(log10(N)-1L))
uniqval <- unique(runif(2*N))
fans <- vector("list", length(group_sizes))
for (i in seq_along(group_sizes)) {
@arunsrinivasan
arunsrinivasan / DT_comp_set.R
Created January 13, 2014 21:08
`:=` vs 'set' in data.table
require(data.table)
set.seed(1L)
DT1 <- data.table(x=sample(1e7), y=as.numeric(sample(1e7)), z=sample(letters, 1e7, TRUE))
DT2 <- copy(DT1)
val <- runif(1e7)
# 'set' seems faster when adding 1-column
# =======================================
# here's some sample data to test it out
require(data.table)
require(dplyr)
set.seed(45)
DF <- data.frame(x=sample(3, 25, TRUE), y=1:25, z=26:50)
DP <- tbl_df(DF) # for DPLYR data.frame object
DT <- data.table(DF)
# 1) row-wise subset (usually based on conditions):
@arunsrinivasan
arunsrinivasan / FR_5241.R
Last active January 1, 2016 10:39
FR #5241
require(data.table)
# let's create data huge data.table
set.seed(1)
N <- 2e7 # size of DT
# generate a character vector of length about 1e5
foo <- function() paste(sample(letters, sample(5:9, 1), TRUE), collapse="")
ch <- replicate(1e5, foo())
ch <- unique(ch)
@arunsrinivasan
arunsrinivasan / dplyr_data.table_mini_benchmark.R
Created December 17, 2013 00:03
A small comparison between 'dplyr' and 'data.table'
# version 1.8.11
require(data.table)
# Loading required package: data.table
# data.table 1.8.11 For help type: help("data.table")
## create a huge data.table:
## -------------------------
set.seed(1)
N <- 2e7 # size of DT
@arunsrinivasan
arunsrinivasan / DT_1.8.10vs1.8.11.R
Created December 16, 2013 23:49
Comparing 1.8.11 to 1.8.10
# version 1.8.11 (commit 1048)
require(data.table)
# Loading required package: data.table
# data.table 1.8.11 For help type: help("data.table")
## create a huge data.table:
## -------------------------
set.seed(1)
N <- 2e7 # size of DT
@arunsrinivasan
arunsrinivasan / pandas_data.table.py
Created December 11, 2013 13:27
Comparison of pandas with data.table joins (along with base:::merge and plyr:::join)
from pandas import *
from pandas.util.testing import rands
import random
N = 10000
ngroups = 10
def get_test_data(ngroups=100, n=N):
unique_groups = range(ngroups)
arr = np.asarray(np.tile(unique_groups, n / ngroups), dtype=object)
@arunsrinivasan
arunsrinivasan / pandas_data.table.R
Last active December 31, 2015 00:49
Comparison of pandas with data.table (along with base:::merge and plyr:::join)
require(data.table)
# tested on 1.8.10 AND 1.8.11, results don't differ much at all.
require(plyr)
# 1_8 version
set.seed(1000) # for reproducibility
N <- 1e4
foo <- function() paste(sample(letters, 10), collapse="")
indices <- replicate(N, foo())