Skip to content

Instantly share code, notes, and snippets.

View szilard's full-sized avatar

Szilard Pafka szilard

View GitHub Profile
@szilard
szilard / mlbenchm-spark-gendata.txt
Last active August 29, 2015 14:20
Generate data for machine learning benchmark for Spark
## get the data
for yr in 2005 2006 2007; do
wget http://stat-computing.org/dataexpo/2009/$yr.csv.bz2
bunzip2 $yr.csv.bz2
done
## install R and data.table
@szilard
szilard / mlbenchm-spark-RF.txt
Last active August 29, 2015 14:20
Training random forest in Spark / MLlib
spark-1.3.0-bin-hadoop2.4/bin/spark-shell --driver-memory 100G --executor-memory 100G
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.tree.RandomForest
import org.apache.spark.mllib.evaluation.BinaryClassificationMetrics
@szilard
szilard / mlbenchm-py-int-encoded.R
Last active August 29, 2015 14:20
Generate integers encoded categoricals
## generate integer-encoded categoricals
for SIZE in 1; do
time R --vanilla --quiet << EOF
library(data.table)
d1 <- as.data.frame(fread("train-${SIZE}m.csv"))
d2 <- as.data.frame(fread("test.csv"))
@szilard
szilard / mlbenchm-py-RF-int-enc.py
Created May 6, 2015 21:01
Scikit-learn RF with integer encoded categoricals
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics
d_train = pd.read_csv("train-intcateg-1m.csv", header=None)
d_test = pd.read_csv("test-intcateg-1m.csv", header=None)
X_train = d_train.ix[:,0:7]
y_train = d_train.ix[:,8]
@szilard
szilard / SparkR-datatable-aggr100M.txt
Last active April 3, 2019 20:58
SparkR vs data.table - aggregate 100M records
data.table vs SparkR
group-by aggregate on 100M records (1M groups)
data.table 6.5 sec (without key) / 1.3 sec (with key) - all 1 core
SparkR cached 200 sec (8 cores)
30x / 150x ( 240x / 1200x per core)
@szilard
szilard / benchm-ml-spark
Last active September 9, 2015 16:29 — forked from jkbradley/benchm-ml-spark
Running benchm-ml benchmark for random forest on Spark, using soft predictions to get better AUC
Here are 2 code snippets:
(1) Compute one-hot encoded data for Spark, using the data generated by https://github.com/szilard/benchm-ml/blob/master/0-init/2-gendata.txt
(2) Run MLlib, computing soft predictions by hand.
I ran these with Spark 1.4, and they should work for 1.5 as well.
Note: There's no real need to switch to DataFrames yet for benchmarking. Both the RDD and DataFrame APIs use the same underlying implementation. (I hope to improve on that in Spark 1.6 if there is time.)
Ran on EC2 cluster with 4 workers with 9.6GB memory each, and 8 partitions for training RDD.
For the 1M dataset, training the forest took 2080.814977193 sec and achieved AUC 0.7129779357732448 on the test set.
@szilard
szilard / adding_numbers.R
Last active October 24, 2015 00:12
Timing sum of 1 billion numbers
x <- as.numeric(1:1e9)
system.time(sum(x))
@szilard
szilard / psum.c
Last active October 29, 2015 02:56
Parallel sum 1 bn numbers pthreads
/*
Adapted from:
https://computing.llnl.gov/tutorials/pthreads/samples/arrayloops.c
http://stackoverflow.com/questions/2962785/c-using-clock-to-measure-time-in-multi-threaded-programs
Run as:
gcc -Ofast -pthread psum.c -lm && ./a.out
*/
@szilard
szilard / overfitting.R
Created November 1, 2015 16:22
Illustration for overfitting
library(ggplot2)
n <- 30
d <- data.frame(x = 1:n, y = runif(n))
ggplot(d, aes(x = x, y = y)) + geom_point() +
geom_smooth(se = FALSE, span = 0.1)
@szilard
szilard / h2o_sum_1bn.R
Created November 4, 2015 19:00
H2O sum 1 bn numbers
install.packages("h2o", type="source", repos=(c("http://h2o-release.s3.amazonaws.com/h2o/rel-slater/9/R")))
library(h2o)
h2oServer <- h2o.init(nthreads = -1)
system.time({
d <- h2o.createFrame(h2oServer, rows = 1e9, cols = 1, missing_fraction = 0,