Skip to content

Instantly share code, notes, and snippets.

View szilard's full-sized avatar

Szilard Pafka szilard

View GitHub Profile
@szilard
szilard / data_table_materialized_join_vs_not.R
Created February 11, 2016 21:50
data.table materialized join vs not
## count
benchmark(
nrow(d[dm, nomatch=0L, on="x"]),
d[dm, .N, nomatch=0L, on="x"],
replications = 5, columns = c("test", "replications", "elapsed", "relative"))
# test replications elapsed relative
#2 d[dm, .N, nomatch = 0, on = "x"] 5 28.535 1.000
#1 nrow(d[dm, nomatch = 0, on = "x"]) 5 38.562 1.351
@szilard
szilard / nnet_outliers.R
Created January 12, 2016 13:01
outliers impact on neural net classifier
library(nnet)
library(h2o)
h2o.init()
set.seed(123)
n <- 1000
x1 <- runif(n)
@szilard
szilard / sparse-linreg.R
Last active January 1, 2016 10:09
Sparse linear regression
library(Matrix)
rm(list=ls())
set.seed(123)
## parameters
n <- 1e6
@szilard
szilard / meetup_raffle.R
Last active December 25, 2015 20:39
LA R meetup raffle
library(yaml)
library(RJSONIO)
library(httr)
event_id <- 132296372
n_max <- 20
api_key <- yaml.load_file("meetup_api_key.yml")$api_key
## get your api key from http://www.meetup.com/meetup_api/key/ while logged in
@szilard
szilard / h2o-group_by.txt
Created November 8, 2015 04:57
h2o group_by simple speed test
########### R
library(h2o)
h2oServer <- h2o.init(max_mem_size = "50g", nthreads = -1)
d <- h2o.importFile(h2oServer, path = "d.csv")
system.time({
@szilard
szilard / h2o_sum_1bn.R
Created November 4, 2015 19:00
H2O sum 1 bn numbers
install.packages("h2o", type="source", repos=(c("http://h2o-release.s3.amazonaws.com/h2o/rel-slater/9/R")))
library(h2o)
h2oServer <- h2o.init(nthreads = -1)
system.time({
d <- h2o.createFrame(h2oServer, rows = 1e9, cols = 1, missing_fraction = 0,
@szilard
szilard / overfitting.R
Created November 1, 2015 16:22
Illustration for overfitting
library(ggplot2)
n <- 30
d <- data.frame(x = 1:n, y = runif(n))
ggplot(d, aes(x = x, y = y)) + geom_point() +
geom_smooth(se = FALSE, span = 0.1)
@szilard
szilard / psum.c
Last active October 29, 2015 02:56
Parallel sum 1 bn numbers pthreads
/*
Adapted from:
https://computing.llnl.gov/tutorials/pthreads/samples/arrayloops.c
http://stackoverflow.com/questions/2962785/c-using-clock-to-measure-time-in-multi-threaded-programs
Run as:
gcc -Ofast -pthread psum.c -lm && ./a.out
*/
@szilard
szilard / adding_numbers.R
Last active October 24, 2015 00:12
Timing sum of 1 billion numbers
x <- as.numeric(1:1e9)
system.time(sum(x))
@szilard
szilard / benchm-ml-spark
Last active September 9, 2015 16:29 — forked from jkbradley/benchm-ml-spark
Running benchm-ml benchmark for random forest on Spark, using soft predictions to get better AUC
Here are 2 code snippets:
(1) Compute one-hot encoded data for Spark, using the data generated by https://github.com/szilard/benchm-ml/blob/master/0-init/2-gendata.txt
(2) Run MLlib, computing soft predictions by hand.
I ran these with Spark 1.4, and they should work for 1.5 as well.
Note: There's no real need to switch to DataFrames yet for benchmarking. Both the RDD and DataFrame APIs use the same underlying implementation. (I hope to improve on that in Spark 1.6 if there is time.)
Ran on EC2 cluster with 4 workers with 9.6GB memory each, and 8 partitions for training RDD.
For the 1M dataset, training the forest took 2080.814977193 sec and achieved AUC 0.7129779357732448 on the test set.