Skip to content

Instantly share code, notes, and snippets.

@DnanaDev
Last active September 18, 2020 12:25
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save DnanaDev/00aa1d5cd4bed7f61644e3d1ca5b5863 to your computer and use it in GitHub Desktop.
Save DnanaDev/00aa1d5cd4bed7f61644e3d1ca5b5863 to your computer and use it in GitHub Desktop.
[ML- Tree based Models and Categorical data]

One-hot encoded categorical data and sklearns RF, XGBoost don't work properly.

There seems to be different opinions about using one-hot encoded categorical features with implementations that don't natively support them. Try CatBoost or H20 Random Forrest that support categorical data by design. Also, investigate one-hot encoding not being recommended for features with high cardinality, something to do with creating very sparse features.

For reference : https://roamanalytics.com/2016/10/28/are-categorical-variables-getting-lost-in-your-random-forests/
https://medium.com/data-design/visiting-categorical-features-and-encoding-in-decision-trees-53400fa65931
https://www.kaggle.com/c/avito-demand-prediction/discussion/57094
https://www.kaggle.com/c/zillow-prize-1/discussion/38793
https://www.kaggle.com/c/bnp-paribas-cardif-claims-management/discussion/19851 \

Resolution : Looking at the Feature Importance for the models.

  1. Random Forrest - just seems to put all importance on the size feature and the categorical features are drowned out.

Screenshot 2020-07-18 at 2 20 37 PM

2. XGBoost Regressor - Seems to put more importance on the Categorical features like South Delhi, this is good, but worse validation set performance than RF.

Screenshot 2020-07-18 at 2 19 57 PM

3. CatBoost - Best performing individual model. Can't directly compare feature importance to other models, but puts significant importance on the categorical features. Good option.

Screenshot 2020-07-18 at 2 20 13 PM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment