This is a verification to confirm whether a tree based machine learning model causes multicollinearity. This is purely out of my curiosity. I just wondered if multicollinearity takes place in a tree based model and aggravates its performance. If it does, it means one needs to remove correlated features before building models.
I artificially generated three datasets, obtained scores with lightgbm and k-fold cross validation and compared the finals scores.
- dataset 1: 10 informative features (useful features for prediction)
- dataset 2: 10 informative features + 10 features with random values
- dataset 3: 10 informative features + 10 redundant features (random linear combination of the informative features)
The experiment above was repeated using two kinds of class imbalance proportion: (0, 1)=(0.5, 0.5) and (0, 1)=(0.99, 0.01).
The boxplots of the three datasets were almost the same for both of the class imbalance proportion. In the conditions used in this experiment, a tree based machine learning model doesn't cause multicollinearity.