jasdumas/Notes from xgboost: GBM.md

## Notes from xgboost: GBM.md

      
    Raw
  

              Notes from xgboost: GBM.md
            
          
  title
  layout
  
  
  Notes on xgboost: Extreme Gradient Boosting Machine
  default
  
  
Background


Computation in C++
tree-based models
easy to use, install, R/python interface
Automatic parallel computation on a single Machine
Tunable parameters
requires a spare matrix (dgcMatrix) which is more memory efficient
need to provide input features, target, objective, number of iterations
can score or predict model based on new data
cross validation measures a model's predictive power

Supervised Machine Learning


to train a model we need to optimize a loss function.

Root Mean Square Error (RMSE) is typically used for regression

${RMSD}({\hat {\theta }})={\sqrt {{MSE}({\hat {\theta }})}}={\sqrt { {E} (({\hat {\theta }}-\theta )^{2})}}$

Logloss is for binary/logistic classification

$$log loss = -\frac{1}{N}\sum_{i=1}^N\sum_{j=1}^My_{ij}\log(p_{ij})$$
$$log loss = -\frac{1}{N}\sum_{i=1}^N {(y_i\log(p_i) + (1 - y_i)\log(1 - p_i))}$$

mLogloss is for multi classification for multinomial logistic analysis

$$ F = -\frac{1}{N}\sum_{i}^{N}\sum_{j}^{M}y_{ij} \cdot Ln(p_{ij}))=\sum_{j}^{M} \left (-\frac{1}{N}\sum_{i}^{N}y_{ij} \cdot Ln(p_{ij}))  \right ) = \sum_{j}^{M}F_i $$


Regularization is another important part of the model. A good regularization term controls the complexity of the model which prevents over-fitting.


The training objective is: objective = loss function + regularization

obj = ${L} + \Omega$

where the loss function controls the predictive power and regularization controls simplicity (or rather complexity)


in xgboost: gradient decent is used to optimize the objective which is a first order optimization algorithm to take steps to find the local minimum point or peak. The performance can be improved by considering both the first and second order.


Tree Building algorithm


how can we define a tree that improves the prediction along the gradient/

decision tree (binary splits), each data point flows to one of the leaves following the direction on each node
internal node: each internal node splits the flow of the data points by one the features (variables)
the condition on the edge of specifies what data can flow through
leaves: data points reach a leaf and assigned by weight, which is the prediction


how to find a good structure?

how to choose features to split?
when to split?


how to assign a prediction score?
In each split we want to greedily find the best point that can optimize the objective by:

sorting the number
scan the best splitting point
choose the best feature


Calculate gain (gini index or entropy)