title | layout |
---|---|
Notes on xgboost: Extreme Gradient Boosting Machine |
default |
Background
- Computation in C++
- tree-based models
- easy to use, install, R/python interface
- Automatic parallel computation on a single Machine
- Tunable parameters
- requires a spare matrix (dgcMatrix) which is more memory efficient
- need to provide input features, target, objective, number of iterations
- can score or predict model based on new data
- cross validation measures a model's predictive power
Supervised Machine Learning
-
to train a model we need to optimize a loss function.
- Root Mean Square Error (RMSE) is typically used for regression
${RMSD}({\hat {\theta }})={\sqrt {{MSE}({\hat {\theta }})}}={\sqrt { {E} (({\hat {\theta }}-\theta )^{2})}}$ - Logloss is for binary/logistic classification
$$log loss = -\frac{1}{N}\sum_{i=1}^N\sum_{j=1}^My_{ij}\log(p_{ij})$$ $$log loss = -\frac{1}{N}\sum_{i=1}^N {(y_i\log(p_i) + (1 - y_i)\log(1 - p_i))}$$ - mLogloss is for multi classification for multinomial logistic analysis
$$ F = -\frac{1}{N}\sum_{i}^{N}\sum_{j}^{M}y_{ij} \cdot Ln(p_{ij}))=\sum_{j}^{M} \left (-\frac{1}{N}\sum_{i}^{N}y_{ij} \cdot Ln(p_{ij})) \right ) = \sum_{j}^{M}F_i $$
-
Regularization is another important part of the model. A good regularization term controls the complexity of the model which prevents over-fitting.
-
The training objective is: objective = loss function + regularization
obj =${L} + \Omega$
where the loss function controls the predictive power and regularization controls simplicity (or rather complexity) -
in xgboost: gradient decent is used to optimize the objective which is a first order optimization algorithm to take steps to find the local minimum point or peak. The performance can be improved by considering both the first and second order.
Tree Building algorithm
- how can we define a tree that improves the prediction along the gradient/
- decision tree (binary splits), each data point flows to one of the leaves following the direction on each node
- internal node: each internal node splits the flow of the data points by one the features (variables)
- the condition on the edge of specifies what data can flow through
- leaves: data points reach a leaf and assigned by weight, which is the prediction
- how to find a good structure?
- how to choose features to split?
- when to split?
- how to assign a prediction score?
- In each split we want to greedily find the best point that can optimize the objective by:
- sorting the number
- scan the best splitting point
- choose the best feature
- Calculate gain (gini index or entropy)
Nice note! Thx.