This is a verification to confirm whether a tree based machine learning model causes multicollinearity. This is purely out of my curiosity. I just wondered if multicollinearity takes place in a tree based model and aggravates its performance. If it does, it means one needs to remove correlated features before building models.
I artificially generated three datasets, obtained scores with lightgbm and k-fold cross validation and compared the finals scores.
- dataset 1: 10 informative features (useful features for prediction)
- dataset 2: 10 informative features + 10 features with random values