Instantly share code, notes, and snippets.

# jasdumas/Notes from xgboost: GBM.md

Last active October 4, 2017 16:55
title layout
Notes on xgboost: Extreme Gradient Boosting Machine
default

### Background

• Computation in C++
• tree-based models
• easy to use, install, R/python interface
• Automatic parallel computation on a single Machine
• Tunable parameters
• requires a spare matrix (dgcMatrix) which is more memory efficient
• need to provide input features, target, objective, number of iterations
• can score or predict model based on new data
• cross validation measures a model's predictive power

### Supervised Machine Learning

• to train a model we need to optimize a loss function.

• Root Mean Square Error (RMSE) is typically used for regression

${RMSD}({\hat {\theta }})={\sqrt {{MSE}({\hat {\theta }})}}={\sqrt { {E} (({\hat {\theta }}-\theta )^{2})}}$

• Logloss is for binary/logistic classification

$$log loss = -\frac{1}{N}\sum_{i=1}^N\sum_{j=1}^My_{ij}\log(p_{ij})$$

$$log loss = -\frac{1}{N}\sum_{i=1}^N {(y_i\log(p_i) + (1 - y_i)\log(1 - p_i))}$$

• mLogloss is for multi classification for multinomial logistic analysis

$$F = -\frac{1}{N}\sum_{i}^{N}\sum_{j}^{M}y_{ij} \cdot Ln(p_{ij}))=\sum_{j}^{M} \left (-\frac{1}{N}\sum_{i}^{N}y_{ij} \cdot Ln(p_{ij})) \right ) = \sum_{j}^{M}F_i$$

• Regularization is another important part of the model. A good regularization term controls the complexity of the model which prevents over-fitting.

• The training objective is: objective = loss function + regularization
obj = ${L} + \Omega$
where the loss function controls the predictive power and regularization controls simplicity (or rather complexity)

• in xgboost: gradient decent is used to optimize the objective which is a first order optimization algorithm to take steps to find the local minimum point or peak. The performance can be improved by considering both the first and second order. ### Tree Building algorithm

• how can we define a tree that improves the prediction along the gradient/
• decision tree (binary splits), each data point flows to one of the leaves following the direction on each node
• internal node: each internal node splits the flow of the data points by one the features (variables)
• the condition on the edge of specifies what data can flow through
• leaves: data points reach a leaf and assigned by weight, which is the prediction
• how to find a good structure?
• how to choose features to split?
• when to split?
• how to assign a prediction score?
• In each split we want to greedily find the best point that can optimize the objective by:
1. sorting the number
2. scan the best splitting point
3. choose the best feature
• Calculate gain (gini index or entropy)

Nice note! Thx.