Instantly share code, notes, and snippets.

jasdumas/Notes from xgboost: GBM.md

Last active May 25, 2024 20:56
Show Gist options
• Save jasdumas/b8e1c457d50c26a2233d4d41d030ed20 to your computer and use it in GitHub Desktop.
title layout
Notes on xgboost: Extreme Gradient Boosting Machine
default

Background

• Computation in C++
• tree-based models
• easy to use, install, R/python interface
• Automatic parallel computation on a single Machine
• Tunable parameters
• requires a spare matrix (dgcMatrix) which is more memory efficient
• need to provide input features, target, objective, number of iterations
• can score or predict model based on new data
• cross validation measures a model's predictive power

Supervised Machine Learning

• to train a model we need to optimize a loss function.

• Root Mean Square Error (RMSE) is typically used for regression

${RMSD}({\hat {\theta }})={\sqrt {{MSE}({\hat {\theta }})}}={\sqrt { {E} (({\hat {\theta }}-\theta )^{2})}}$

• Logloss is for binary/logistic classification

$$log loss = -\frac{1}{N}\sum_{i=1}^N\sum_{j=1}^My_{ij}\log(p_{ij})$$

$$log loss = -\frac{1}{N}\sum_{i=1}^N {(y_i\log(p_i) + (1 - y_i)\log(1 - p_i))}$$

• mLogloss is for multi classification for multinomial logistic analysis

$$F = -\frac{1}{N}\sum_{i}^{N}\sum_{j}^{M}y_{ij} \cdot Ln(p_{ij}))=\sum_{j}^{M} \left (-\frac{1}{N}\sum_{i}^{N}y_{ij} \cdot Ln(p_{ij})) \right ) = \sum_{j}^{M}F_i$$

• Regularization is another important part of the model. A good regularization term controls the complexity of the model which prevents over-fitting.

• The training objective is: objective = loss function + regularization
obj = ${L} + \Omega$
where the loss function controls the predictive power and regularization controls simplicity (or rather complexity)

• in xgboost: gradient decent is used to optimize the objective which is a first order optimization algorithm to take steps to find the local minimum point or peak. The performance can be improved by considering both the first and second order.

Tree Building algorithm

• how can we define a tree that improves the prediction along the gradient/
• decision tree (binary splits), each data point flows to one of the leaves following the direction on each node
• internal node: each internal node splits the flow of the data points by one the features (variables)
• the condition on the edge of specifies what data can flow through
• leaves: data points reach a leaf and assigned by weight, which is the prediction
• how to find a good structure?
• how to choose features to split?
• when to split?
• how to assign a prediction score?
• In each split we want to greedily find the best point that can optimize the objective by:
1. sorting the number
2. scan the best splitting point
3. choose the best feature
• Calculate gain (gini index or entropy)

Nice note! Thx.

Jason2Brownlee commented May 25, 2024

Excellent notes on the algorithm!

I note that even knowing the algorithm internals does not help to configure the algorithm in practice. We still end up using a grid search to configure xgboost in practice.