Skip to content

Instantly share code, notes, and snippets.

View szilard's full-sized avatar

Szilard Pafka szilard

View GitHub Profile
@szilard
szilard / GBM_vs_SVDKL.R
Last active August 1, 2023 00:59
GBM vs SV-DKL (Stochastic Variational Deep Kernel Learning) on the airline dataset
## Stochastic Variational Deep Kernel Learning
## paper: https://arxiv.org/abs/1611.00336
## code+data from the authors (thanks!!!): https://people.orie.cornell.edu/andrew/code/#SVDKL
## get data + prepare sample authors used for evaluation
wget https://people.orie.cornell.edu/andrew/code/svdklcode.zip
unzip svdklcode.zip
@szilard
szilard / ML_pkgs.R
Last active October 1, 2020 19:18
ML R packages (my focus: supervised learning) by CRAN downloads
##install.packages("cranlogs")
library(data.table)
library(cranlogs)
##caret/models/file
## grep "library =" * | sed 's/.*=//' | sed 's/c(//' | sed 's/),/,/' | grep -v NULL | sed 's/,.*$/,/' | sort | uniq | tr -d '\n'
caret_pkgs <- c("rpart", "C50", "CHAID", "Cubist", "FCNN4R", "HDclassif", "HiDimDA", "KRLS", "LiblineaR",
@szilard
szilard / rpart_pruning.R
Created February 8, 2020 10:17
rpart pruning
library(data.table)
library(rpart)
d_train <- fread("https://s3.amazonaws.com/benchm-ml--main/train-0.1m.csv")
md <- rpart(ifelse(dep_delayed_15min=="Y",1,0) ~ ., d_train,
control = rpart.control(cp = 0.001))
plotcp(md)
printcp(md)
@szilard
szilard / SparkR-datatable-aggr100M.txt
Last active April 3, 2019 20:58
SparkR vs data.table - aggregate 100M records
data.table vs SparkR
group-by aggregate on 100M records (1M groups)
data.table 6.5 sec (without key) / 1.3 sec (with key) - all 1 core
SparkR cached 200 sec (8 cores)
30x / 150x ( 240x / 1200x per core)
@szilard
szilard / h2o_scoring.R
Last active March 12, 2018 08:54
ML Scoring (REST API) - h2o.ai
## training a model
library(h2o)
h2o.init(nthreads = -1)
dx_train <- h2o.importFile("https://s3.amazonaws.com/benchm-ml--main/train-0.1m.csv")
md_rf <- h2o.randomForest(x = 1:(ncol(dx_train)-1), y = ncol(dx_train), training_frame = dx_train,
model_id = "h2o_RF",
ntrees = 100, max_depth = 10, nbins = 100)
@szilard
szilard / lightgbm_example.R
Created August 27, 2017 03:48
Minimal lightgbm example
library(data.table)
library(ROCR)
library(lightgbm)
set.seed(123)
d_train <- fread("/var/data/bm-ml/train-0.1m.csv")
d_test <- fread("/var/data/bm-ml/test.csv")
@szilard
szilard / simul_unbal_methods.R
Last active August 25, 2017 18:39
A little framework for experimenting with the impact of various methods for dealing with unbalanced classes for machine learning
## partial credit :) to @earino for the idea
library(lightgbm)
library(data.table)
library(ROCR)
d0_train <- fread("/var/data/bm-ml/train-10m.csv")
d0_test <- fread("/var/data/bm-ml/test.csv")
d0 <- rbind(d0_train, d0_test)
@szilard
szilard / dataset_size_openML.R
Created August 21, 2017 00:13
Dataset sizes in OpenML
# OpenML Benchmarking Suites and the OpenML100
# https://arxiv.org/abs/1708.03731
# https://www.openml.org/s/14/data
library(OpenML)
ids <- getOMLStudy('OpenML100')$data$data.id
dsall <- listOMLDataSets()
sum(dsall$data.id %in% ids) ## 96???
ds <- dsall[dsall$data.id %in% ids,]
@szilard
szilard / dataset_sizes_pmlb.py
Last active August 20, 2017 04:25
Size distribution of datasets in the Penn Machine Learning Benchmarks
## https://github.com/EpistasisLab/penn-ml-benchmarks
## pip install pmlb
import numpy as np
from pmlb import fetch_data
from pmlb import dataset_names
x = np.zeros(len(dataset_names))
for i, dn in enumerate(dataset_names):
#include <stdio.h>
#include <stdlib.h>
#define N 128
#define B0 100
#define R 1000000
#define M 1000
int cmpfunc (const void * a, const void * b)
{