Created
February 2, 2019 18:53
-
-
Save GermanCM/ad418bb1c7dfd1218bcbb20f20a5cd0d to your computer and use it in GitHub Desktop.
Get dummy variables stratifying by frequency threshold
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# source: https://stackoverflow.com/questions/18016495/get-subset-of-most-frequent-dummy-variables-in-pandas | |
# func that returns a dummified DataFrame of significant dummies in a given column | |
def dum_sign(dummy_col, threshold=0.1): | |
import pandas as pd | |
import numpy as np | |
# removes the bind | |
dummy_col = dummy_col.copy() | |
# what is the ratio of a dummy in whole column | |
count = pd.value_counts(dummy_col) / len(dummy_col) | |
# cond whether the ratios is higher than the threshold | |
mask = dummy_col.isin(count[count > threshold].index) | |
# replace the ones which ratio is lower than the threshold by a special name | |
dummy_col[~mask] = "others" | |
return pd.get_dummies(dummy_col, prefix=dummy_col.name) |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment