Skip to content

Instantly share code, notes, and snippets.

@DnanaDev
DnanaDev / Tree_Categorical_data.md
Last active September 18, 2020 12:25
[ML- Tree based Models and Categorical data]

One-hot encoded categorical data and sklearns RF, XGBoost don't work properly.

There seems to be different opinions about using one-hot encoded categorical features with implementations that don't natively support them. Try CatBoost or H20 Random Forrest that support categorical data by design. Also, investigate one-hot encoding not being recommended for features with high cardinality, something to do with creating very sparse features.

For reference : https://roamanalytics.com/2016/10/28/are-categorical-variables-getting-lost-in-your-random-forests/
https://medium.com/data-design/visiting-categorical-features-and-encoding-in-decision-trees-53400fa65931
https://www.kaggle.com/c/avito-demand-prediction/discussion/57094
https://www.kaggle.com/c/zillow-prize-1/discussion/38793
https://www.kaggle.com/c/bnp-paribas-cardif-claims-management/discussion/19851 \

@DnanaDev
DnanaDev / Covid_ingest_DB.py
Last active July 26, 2020 09:48
SQLite Data Ingestion script
""" Data Ingestion for Covid19 Data Pipeline
Using SQLAlchemy engine to interface to PostgresQL Database.
Functions to create DB according to schema and for ingesting data.
The use case is to run the script and automatically update CSVs in Data/Raw and to
store the cleaned data in the database. Backup of the database in stored in Data/cleaned.
# Data Ingestion Functions
1. add_data_table(engine, tablename, df)
Uses Pandas dataframe from Covid19_india_org_api to append data to table using SQLAlchemy and DF.to_sql()