Skip to content

Instantly share code, notes, and snippets.

View gvyshnya's full-sized avatar

George Vyshnya gvyshnya

View GitHub Profile
@gvyshnya
gvyshnya / pre-processing.R
Created August 5, 2017 19:28
Pre-processing Routine in Vine Prediction Project Pipeline
# This file manages pre-processing of raw traing and testing data sets for the
# Kaggle competion per https://inclass.kaggle.com/c/pred-411-2016-04-u3-wine/
# It is intended to run from a command line in a batch mode, using the Rscript command below:
# Rscript --vanilla code/pre-processing.R data/wine.csv data/wine_test.csv data/train_imputed.csv data/test_imputed.csv
# 4 arguments are required
# - input file name for raw traing data csv
# - input file name for raw testing data csv,
# - output file name for imputed training data csv,
# - output file name for imputed testing data csv
@gvyshnya
gvyshnya / LR.R
Created August 5, 2017 19:39
Predicting Vine Sales: Forecasting with linear regression model
# Competition: https://inclass.kaggle.com/c/pred-411-2016-04-u3-wine/
# This is a file to perform
# - Linear Regression (LR) model training
# - predition on the imputed testing set, using the fitted LR model
# - preparation of a Kaggle submission file
# It is intended to run from a command line in a batch mode, using the Rscript command below:
# Rscript --vanilla code/LF.R data/train_imputed.csv data/test_imputed.csv 0.7 826 data/submission.csv config.R
# 6 arguments are required
# - input file name for imputed training data csv,
# - input file name for imputed testing data csv
@gvyshnya
gvyshnya / GBM.R
Created August 5, 2017 19:48
Forecasting vine sales with GBM model
# Competition: https://inclass.kaggle.com/c/pred-411-2016-04-u3-wine/
# This is a file to perform
# - GBM model training
# - predition on the imputed testing set, using the fitted GBM model (for regression problem,
# gaussian distribution used in GBM)
# - preparation of a Kaggle submission file
# It is intended to run from a command line in a batch mode, using the Rscript command like one below:
# Rscript --vanilla code/GBM.R data/train_imputed.csv data/test_imputed.csv 5000 5 4 25 output/submission.csv code/config.R
#
# 8 arguments are required
@gvyshnya
gvyshnya / xgboost.R
Created August 5, 2017 19:53
Forecasting Vine Sales with XGBOOST algorithm
# Competition: https://inclass.kaggle.com/c/pred-411-2016-04-u3-wine/
# This is a file to perform
# - xgboost model training (linear booster used)
# - predition on the imputed testing set, using the fitted xgboost model
# - preparation of a Kaggle submission file
# It is intended to run from a command line in a batch mode, using the Rscript command below:
# Rscript --vanilla code/xgboost.R data/train_imputed.csv data/test_imputed.csv 10 2 0.0001 1 data/xgboost_submission.csv code/config.R
#
# 8 arguments are required
# - input file name for imputed training data csv,
@gvyshnya
gvyshnya / ensemble.R
Created August 6, 2017 15:24
Ensemble Submission Script
# Competition: https://inclass.kaggle.com/c/pred-411-2016-04-u3-wine/
# This is a file to perform
# - ensemble prediction based on 3 models fitted (LR, GBM, and xgboost)
# - preparation of a Kaggle submission file for the ensemble prediction
# It is intended to run from a command line in a batch mode, using the Rscript command below:
# Rscript --vanilla code/ensemble.R data/ensemble_submission.csv code/config.R
#
# 2 arguments are required
# - output file name for the result submission csv file (in a ready-for-Kaggle-upload format)
# - configuration file in R (setup of the ensemble implemented as R code module), which has to have
@gvyshnya
gvyshnya / dvc.bat
Created August 6, 2017 16:05
Wine sales prediction: pipeline job batch file using DVC capabilities
# This is a DVC-based script to manage machine-learning pipeline for a project per
# https://inclass.kaggle.com/c/pred-411-2016-04-u3-wine/
mkdir R_DVC_GITHUB_CODE
cd R_DVC_GITHUB_CODE
# clone the github repo with the code
git clone https://github.com/gvyshnya/DVC_R_Ensemble
# initialize DVC
@gvyshnya
gvyshnya / mad_with_holidays.py
Created August 8, 2017 23:44
Simple benchmark prediction of Wikipedia traffic with median (median by page, weekdays, and holidays) and consistent Holidays management
# Project/Competition: https://www.kaggle.com/c/web-traffic-time-series-forecasting/
# Simple benchmark prediction with median (median by page, weekdays, and holidays)
#
# - You should insall Workalendar from its github repo directly
# >>> pip install git+https://github.com/novafloss/workalendar.git
import pandas as pd
import pandas.tseries.holiday as hol
import re
@gvyshnya
gvyshnya / dvc repro code
Created August 20, 2017 19:22
DVC repro command power
# Improve ensemble configuration
$ vi code/config.R
# Commit all the changes.
$ git commit -am "Updated weights of the models in the ensemble"
# Reproduce the ensemble prediction
$ dvc repro data/submission_ensemble.csv
@gvyshnya
gvyshnya / config.R
Last active August 20, 2017 20:23
Wine sales prediction: an R configuration file
# Competition: https://inclass.kaggle.com/c/pred-411-2016-04-u3-wine/
# This is a configuration file to the entire solution
# LR.R specific settings
cfg_run_LR <- 1 # if set to 0, LR model will not fit, and its prediction will not be calculated in the batch mode
# GMB.R specific settings
cfg_run_GBM <- 1 # if set to 0, GBM model will not fit, and its prediction will not be calculated in the batch mode
# xgboost.R specific settings
@gvyshnya
gvyshnya / gist:bc69fe987fa49f34de98af67d99ee684
Created November 21, 2017 06:37
xgboost running with tree_method = 'hist'
import xgboost as xgb
import numpy as np
from sklearn.datasets import load_digits
from sklearn.cross_validation import train_test_split
rng = np.random.RandomState(1994)
digits = load_digits(2)
X = digits['data']
y = digits['target']