Skip to content

Instantly share code, notes, and snippets.

View szilard's full-sized avatar

Szilard Pafka szilard

View GitHub Profile
@szilard
szilard / sqlite_vs_datatable.txt
Last active May 2, 2016 22:39
SQLite vs R data.table
sqlite vs R's data.table
TLDR; sqlite (:memory:) 250 sec data.table 7 sec
data: 100 million rows, 1 million groups
generated by: https://github.com/szilard/benchm-databases/blob/master/0-gendata.txt
@szilard
szilard / h2o-group_by.txt
Created November 8, 2015 04:57
h2o group_by simple speed test
########### R
library(h2o)
h2oServer <- h2o.init(max_mem_size = "50g", nthreads = -1)
d <- h2o.importFile(h2oServer, path = "d.csv")
system.time({
@szilard
szilard / nnet_outliers.R
Created January 12, 2016 13:01
outliers impact on neural net classifier
library(nnet)
library(h2o)
h2o.init()
set.seed(123)
n <- 1000
x1 <- runif(n)
@szilard
szilard / data_table_materialized_join_vs_not.R
Created February 11, 2016 21:50
data.table materialized join vs not
## count
benchmark(
nrow(d[dm, nomatch=0L, on="x"]),
d[dm, .N, nomatch=0L, on="x"],
replications = 5, columns = c("test", "replications", "elapsed", "relative"))
# test replications elapsed relative
#2 d[dm, .N, nomatch = 0, on = "x"] 5 28.535 1.000
#1 nrow(d[dm, nomatch = 0, on = "x"]) 5 38.562 1.351
@szilard
szilard / R_df_copy_3.0vs3.1.R
Last active June 12, 2016 18:38
R dataframes copying 3.0 vs 3.1
system.time(z <- 1:1e9)
system.time(d <- data.frame(x = 1:1e9))
system.time(d$y <- 1:1e9)
system.time(d$z <- z)
system.time(d$x[1] <- 0L)
@szilard
szilard / ML_with_H2O.R
Last active June 25, 2016 16:19
ML with H2O.ai
library(h2o)
h2o.init(max_mem_size = "20g", nthreads = -1)
# R is connected to the H2O cluster:
# H2O cluster uptime: 1 seconds 704 milliseconds
# H2O cluster version: 3.8.2.8
# H2O cluster name: H2O_started_from_R_szilard_lcr105
# H2O cluster total nodes: 1
# H2O cluster total memory: 17.78 GB
@szilard
szilard / ec2_x1_2TB.R
Created June 24, 2016 03:15
R on EC2 x1 2TB RAM 128 cores
> system.time(x <- 1:1e11)
user system elapsed
221.491 210.466 432.030
> object.size(x)/1e9
800.00000004 bytes
> system.time(sum(x))
user system elapsed
145.913 78.183 230.063
@szilard
szilard / R_1TB_bug.R
Last active June 24, 2016 07:11
R 1TB bug
## allocate <1TB first, stuff works
> x <- 1:1.2e11
> gc()
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 214403 11.5 4.60000e+05 24.6 350000 18.7
Vcells 120000397293 915530.4 1.72801e+11 1318366.9 120000398080 915530.4
> system("echo 1")
1
@szilard
szilard / h2o_steam.R
Created October 14, 2016 04:44
H2O Steam deploy GBM
library(h2o)
h2o.init(nthreads = -1)
dx_train <- h2o.importFile("https://s3.amazonaws.com/benchm-ml--main/train-1m.csv")
system.time({
md_10 <- h2o.gbm(x = 1:(ncol(dx_train)-1), y = ncol(dx_train), training_frame = dx_train,
model_id = "airline_depth10",
@szilard
szilard / GBM_vs_SVDKL.R
Last active August 1, 2023 00:59
GBM vs SV-DKL (Stochastic Variational Deep Kernel Learning) on the airline dataset
## Stochastic Variational Deep Kernel Learning
## paper: https://arxiv.org/abs/1611.00336
## code+data from the authors (thanks!!!): https://people.orie.cornell.edu/andrew/code/#SVDKL
## get data + prepare sample authors used for evaluation
wget https://people.orie.cornell.edu/andrew/code/svdklcode.zip
unzip svdklcode.zip