yingminc/ensembling_guide.md

## ensembling_guide.md

      
    Raw
  

              ensembling_guide.md
            
          
    Ensembling guide

Guide:

https://mlwave.com/kaggle-ensembling-guide/
Code:

https://github.com/MLWave/Kaggle-Ensemble-Guide
Voting ensembles:

Average predictions from multiple already trained models (easiest to setup). See error correcting codes (like repetition codes).

Correlation: calculate Pearson correlation and average models that are the least correlated (uncorrelated models means errors will appear for different inputs, therefore “autocorrect” each others)
Weighing: give bigger weight to best model. Reasoning: The only way for the inferior models to overrule the best model (expert) is for them to collectively (and confidently) agree on an alternative
Bagging: averaging submissions for multiple individual models. Reduces overfit. Can use geometric mean instead of average.
Ranking: turn predictions into ranks, then average ranks. See also historical ranks (?)

Stacking ensembles:

The basic idea behind stacked generalization is to use a pool of base classifiers, then using another classifier to combine their predictions, with the aim of reducing the generalization error.


Stacking: stacked generalization is a means of non-linearly combining generalizers to make a new generalizer, to try to optimally integrate what each of the original generalizers has to say about the learning set.
Blending: instead of creating out-of-fold predictions for the train set, you create a small holdout set of say 10% of the train set. The stacker model then trains on this holdout set only.
Popular non-linear algorithms for stacking are GBM, KNN, NN, RF and ET.
See Vowpal Wabbit for feature weighted linear stacking.
Stacking classifiers with regressors and vice versa also possible.
You can also stack with unsupervised learning techniques.

Everything is a hyper-parameter

When doing stacking/blending/meta-modeling it is healthy to think of every action as a hyper-parameter for the stacker model.So for instance:

Not scaling the data
Standard-Scaling the data
Minmax scaling the data
are simply extra parameters to be tuned to improve the ensemble performance.

Boosting:

During each step of learning, increase weights of the examples incorrectly learned, decrease weights of examples correctly learned.
Bootstrapping:

randomly sample with replacement from the n known observations.
Bagging:

(Bootstrap AGGregatING)
Sample several training sets of size n, build a model for each training set, combine the model’s predictions.
Each model receives equal weight.
(Wagging: weighted aggregating)
Cascading:

Stacking but only use next level model if preceding ones are not confident