DnanaDev/Tree_Categorical_data.md

## Tree_Categorical_data.md

      
    Raw
  

              Tree_Categorical_data.md
            
          
    One-hot encoded categorical data and sklearns RF, XGBoost don't work properly.

There seems to be different opinions about using one-hot encoded categorical features with implementations that don't natively support them. Try CatBoost or H20 Random Forrest that support categorical data by design. Also, investigate one-hot encoding not being recommended for features with high cardinality, something to do with creating very sparse features.
For reference :
https://roamanalytics.com/2016/10/28/are-categorical-variables-getting-lost-in-your-random-forests/ 

https://medium.com/data-design/visiting-categorical-features-and-encoding-in-decision-trees-53400fa65931 

https://www.kaggle.com/c/avito-demand-prediction/discussion/57094 

https://www.kaggle.com/c/zillow-prize-1/discussion/38793 

https://www.kaggle.com/c/bnp-paribas-cardif-claims-management/discussion/19851 \
Resolution :
Looking at the Feature Importance for the models.

Random Forrest - just seems to put all importance on the size feature and the categorical features are drowned out.


2. XGBoost Regressor - Seems to put more importance on the Categorical features like South Delhi, this is good, but worse validation set performance than  RF. 

3. CatBoost - Best performing individual model. Can't directly compare feature importance to other models, but puts significant importance on the categorical features. Good option.