codinguncut/gist:c4359d9bc6f36549b625

## gistfile1.md

      
    Raw
  

              gistfile1.md
            
          
    Competitive Machine Learning


feature engineering (most important by far)!!!!!
simple models
overfitting leaderboard
ensembling


predict the right thing!
build pipeline and put something on the leaderboard
allocate time to play with data, explore
make heavy use of forums
understand subtleties of algos, know what tool to use when

http://blog.kaggle.com/2014/08/01/learning-from-the-best/

https://www.kaggle.com/c/15-071x-the-analytics-edge-competition-spring-2015/forums/t/13492/useful-tips?limit=all


skewed data (hist -> log)
scaling (mean 0, stddev 1), centering, normalizing
factors where sensible
ideally work with model.matrix (but didn’t get it to work)
use correct outcome metric (AUC, etc.)
look at differences, patterns between train/test
feature selection (“importance”, “varImp”)

http://www.quora.com/What-do-top-Kaggle-competitors-focus-on


research comps have less competition than the ones for money
good team!

http://www.edii.uclm.es/~useR-2013/Tutorials/kuhn/user_caret_2up.pdf


very, very good
use model.matrix
eliminate collinear predictors, scale, center
imputation of missing data
transformation (Box-Cox, PCA/ICA)
"To get honest estimates of performance, all data transformations should be included within the cross–validation loop."

https://github.com/sux13/DataScienceSpCourseNotes/raw/master/8_PREDMACHLEARN/Practical_Machine_Learning_Course_Notes.pdf


caret tuneLength
author likes repeated cv
try “C5.0” method
author prefers “kernlab” for svm
run each algo on same data, by setting seed before train
caret “resamples” for comparing models

http://www.slideshare.net/ksankar/data-wrangling-for-kaggle-data-science-competitions


viz test/train. param distribution.
bagging, boosting, stacking, ensembling
complexity of model should reflect complexity of data
test for normality visually (skew, curtosis, outliers, z-scores)
keep tab of your submission cv values, etc. tag git code for submissions. unique IDs
train a classifier and look at feature weights, importance. visualize a tree

http://www.slideshare.net/DataRobot/final-10-r-xc-36610234?related=1


tau n-grams
compute differences/ratios of features
discard features that are “too good"

http://www.slideshare.net/ksankar/oscon-kaggle20?related=1


understand data distribution, collinearity, peculiarities cont/discrete, etc.
understand differences between training and test data`

http://www.slideshare.net/OwenZhang2/winning-data-science-competitions?related=1


think more, try less
when in doubt, use gbm
target 1000 trees, tune learning rate
don’t be afraid to use 10+ interaction depth
convert high-cardinality into numerical (out-of-fold average)
glmnet - “opposite of gbm” (needs much more work)
tau for text mining (n-grams, num chars, num words
many “text-mining” comps. are dominated by structured fields
when in doubt, use average blender

http://www.slideshare.net/SebastianRaschka/nextgen-talk-022015


don’t use validation set until the very end
convert ordinal vars to numeric
convert categorical to vector
kernel pca
look out for imbalanced training/test set

http://topepo.github.io/caret/


excellent resource all around

https://www.youtube.com/watch?v=9Zag7uhjdYo


kaggle best prac

http://ml.posthaven.com/machine-learning-done-wrong


standardize before regularize
multi-collinearity

http://blog.kaggle.com/2011/01/14/how-i-did-it-will-cukierski-on-finishing-second-in-the-ijcnn-social-network-challenge/#more-728


anecdotal

http://blog.kaggle.com/


awesome

https://rstudio-pubs-static.s3.amazonaws.com/22067_48fad02fb1a944e9a8fb1d56c55119ef.html


very good feature selection

http://machinelearningmastery.com/blog/


good

http://stat.ethz.ch/Teaching/maechler/R/useR_2014/Maechler-2014-pr.pdf


start a package via package.skeleton()
RUnit, testthat
rbenchmark, microbenchmark, pdbPROF, etc.
never use .RData, code must run in batch mode
prefer “attach()” over “load()"
floating point is not exact
use log1p rather than log(1+x) for x<<1
use x[ind, drop=FALSE] rather than x[ind,]
is.na()
always go column-by-column, not row-by-row
plot training error vs. test error

http://technocalifornia.blogspot.ie/2012/07/more-data-or-better-models.html


good overview

https://www.kaggle.com/c/stumbleupon/forums/t/5761/methods-for-getting-a-first-overview-over-the-data


how to visualize data
visualization tab in weka
cluster and look at clusters
great post by “Martin"

Other Sources:

https://medium.com/@nomadic_mind/new-to-machine-learning-avoid-these-three-mistakes-73258b3848a4 - pretty good
https://www.kaggle.com/c/pakdd-cup-2014/forums/t/7573/what-did-you-do-to-get-to-the-top-of-the-board
http://danielnee.com/archives/155
http://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf
https://medium.com/cs-math/why-becoming-a-data-scientist-is-not-actually-easier-than-you-think-5b65b548069b
http://blog.kaggle.com/2012/07/06/the-dangers-of-overfitting-psychopathy-post-mortem/
http://www.autonlab.org/tutorials/overfit10.pdf
http://machinelearningmastery.com/hands-on-big-data-by-peter-norvig/

no source:


parameter tuning (visualize parameters)
talk about issues with factor vars, as well as different factor levels between train and test
in many cases the actual run shouldn’t run too long. if it does, make sure you haven’t overcomplicated things
find best way of visualizing data for exploratory purposes
start with the simplest approach that could possibly work, and refine/iterate from there
read relevant literature, but don’t get carried away, and keep things simple
be careful about your intuitions
understand the target function and eval function (AUC, etc.)
choose the right tool for the right job (postgres, excel, R, etc.)
doing cross validation on test set is a serious methodological error (sic!)

Postgres


consider using postgres for data analysis
easy import from csv
very powerful queries (arithmetic, select, complex queries, etc.)
http://www.postgresql.org/docs/9.2/static/functions-aggregate.html
http://www.postgresql.org/docs/9.4/static/functions-math.html
http://madlib.net/product/ (machine learning in sql)
http://postgis.net/
dplyr (r integration with sql databases)
http://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html (alternative to sql queries?)
http://zevross.com/blog/2014/03/26/four-reasons-why-you-should-check-out-the-r-package-dplyr-3/