Dixhom/readme.md

## readme.md

      
    Raw
  

              readme.md
            
          
    What is this?

This is a verification to confirm whether a tree based machine learning model causes multicollinearity. This is purely out of my curiosity. I just wondered if multicollinearity takes place in a tree based model and aggravates its performance. If it does, it means one needs to remove correlated features before building models.
Method

I artificially generated three datasets, obtained scores with lightgbm and k-fold cross validation and compared the finals scores.

dataset 1: 10 informative features (useful features for prediction)
dataset 2: 10 informative features + 10 features with random values
dataset 3: 10 informative features + 10 redundant features (random linear combination of the informative features)

The experiment above was repeated using two kinds of class imbalance proportion: (0, 1)=(0.5, 0.5) and (0, 1)=(0.99, 0.01).
Result

The boxplots of the three datasets were almost the same for both of the class imbalance proportion. In the conditions used in this experiment, a tree based machine learning model doesn't cause multicollinearity.

  
## tree-model-multico.ipynb

      
Display the source blob

    
Display the rendered blob

    
    Raw
  

              tree-model-multico.ipynb
            
          
      Sorry, something went wrong. Reload?
      Sorry, we cannot display this file.
      Sorry, this file is invalid so it cannot be displayed.
      
          Viewer requires iframe.