Skip to content

Instantly share code, notes, and snippets.

@Dixhom
Created October 10, 2022 02:29
Show Gist options
  • Save Dixhom/bc5c6de519e48718571f4905bbac59e4 to your computer and use it in GitHub Desktop.
Save Dixhom/bc5c6de519e48718571f4905bbac59e4 to your computer and use it in GitHub Desktop.
tree-model-multicollinearity

What is this?

This is a verification to confirm whether a tree based machine learning model causes multicollinearity. This is purely out of my curiosity. I just wondered if multicollinearity takes place in a tree based model and aggravates its performance. If it does, it means one needs to remove correlated features before building models.

Method

I artificially generated three datasets, obtained scores with lightgbm and k-fold cross validation and compared the finals scores.

  1. dataset 1: 10 informative features (useful features for prediction)
  2. dataset 2: 10 informative features + 10 features with random values
  3. dataset 3: 10 informative features + 10 redundant features (random linear combination of the informative features)

The experiment above was repeated using two kinds of class imbalance proportion: (0, 1)=(0.5, 0.5) and (0, 1)=(0.99, 0.01).

Result

The boxplots of the three datasets were almost the same for both of the class imbalance proportion. In the conditions used in this experiment, a tree based machine learning model doesn't cause multicollinearity.

Display the source blob
Display the rendered blob
Raw
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment