Skip to content

Instantly share code, notes, and snippets.

@codinguncut
Last active October 26, 2017 15:29
Show Gist options
  • Star 3 You must be signed in to star a gist
  • Fork 4 You must be signed in to fork a gist
  • Save codinguncut/c4359d9bc6f36549b625 to your computer and use it in GitHub Desktop.
Save codinguncut/c4359d9bc6f36549b625 to your computer and use it in GitHub Desktop.
kaggle collection
  1. feature engineering (most important by far)!!!!!
  2. simple models
  3. overfitting leaderboard
  4. ensembling
  • predict the right thing!
  • build pipeline and put something on the leaderboard
  • allocate time to play with data, explore
  • make heavy use of forums
  • understand subtleties of algos, know what tool to use when
  • skewed data (hist -> log)
  • scaling (mean 0, stddev 1), centering, normalizing
  • factors where sensible
  • ideally work with model.matrix (but didn’t get it to work)
  • use correct outcome metric (AUC, etc.)
  • look at differences, patterns between train/test
  • feature selection (“importance”, “varImp”)
  • research comps have less competition than the ones for money
  • good team!
  • very, very good
  • use model.matrix
  • eliminate collinear predictors, scale, center
  • imputation of missing data
  • transformation (Box-Cox, PCA/ICA)
  • "To get honest estimates of performance, all data transformations should be included within the cross–validation loop."
  • caret tuneLength
  • author likes repeated cv
  • try “C5.0” method
  • author prefers “kernlab” for svm
  • run each algo on same data, by setting seed before train
  • caret “resamples” for comparing models
  • viz test/train. param distribution.
  • bagging, boosting, stacking, ensembling
  • complexity of model should reflect complexity of data
  • test for normality visually (skew, curtosis, outliers, z-scores)
  • keep tab of your submission cv values, etc. tag git code for submissions. unique IDs
  • train a classifier and look at feature weights, importance. visualize a tree
  • tau n-grams
  • compute differences/ratios of features
  • discard features that are “too good"
  • understand data distribution, collinearity, peculiarities cont/discrete, etc.
  • understand differences between training and test data`
  • think more, try less
  • when in doubt, use gbm
  • target 1000 trees, tune learning rate
  • don’t be afraid to use 10+ interaction depth
  • convert high-cardinality into numerical (out-of-fold average)
  • glmnet - “opposite of gbm” (needs much more work)
  • tau for text mining (n-grams, num chars, num words
  • many “text-mining” comps. are dominated by structured fields
  • when in doubt, use average blender
  • don’t use validation set until the very end
  • convert ordinal vars to numeric
  • convert categorical to vector
  • kernel pca
  • look out for imbalanced training/test set
  • excellent resource all around
  • kaggle best prac
  • standardize before regularize
  • multi-collinearity
  • anecdotal
  • awesome
  • very good feature selection
  • good
  • start a package via package.skeleton()
  • RUnit, testthat
  • rbenchmark, microbenchmark, pdbPROF, etc.
  • never use .RData, code must run in batch mode
  • prefer “attach()” over “load()"
  • floating point is not exact
  • use log1p rather than log(1+x) for x<<1
  • use x[ind, drop=FALSE] rather than x[ind,]
  • is.na()
  • always go column-by-column, not row-by-row
  • plot training error vs. test error
  • good overview
  • how to visualize data
  • visualization tab in weka
  • cluster and look at clusters
  • great post by “Martin"

Other Sources:

no source:

  • parameter tuning (visualize parameters)
  • talk about issues with factor vars, as well as different factor levels between train and test
  • in many cases the actual run shouldn’t run too long. if it does, make sure you haven’t overcomplicated things
  • find best way of visualizing data for exploratory purposes
  • start with the simplest approach that could possibly work, and refine/iterate from there
  • read relevant literature, but don’t get carried away, and keep things simple
  • be careful about your intuitions
  • understand the target function and eval function (AUC, etc.)
  • choose the right tool for the right job (postgres, excel, R, etc.)
  • doing cross validation on test set is a serious methodological error (sic!)

Postgres

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment