Skip to content

Instantly share code, notes, and snippets.

@GermanCM
Created February 2, 2019 18:53
Show Gist options
  • Save GermanCM/ad418bb1c7dfd1218bcbb20f20a5cd0d to your computer and use it in GitHub Desktop.
Save GermanCM/ad418bb1c7dfd1218bcbb20f20a5cd0d to your computer and use it in GitHub Desktop.
Get dummy variables stratifying by frequency threshold
# source: https://stackoverflow.com/questions/18016495/get-subset-of-most-frequent-dummy-variables-in-pandas
# func that returns a dummified DataFrame of significant dummies in a given column
def dum_sign(dummy_col, threshold=0.1):
import pandas as pd
import numpy as np
# removes the bind
dummy_col = dummy_col.copy()
# what is the ratio of a dummy in whole column
count = pd.value_counts(dummy_col) / len(dummy_col)
# cond whether the ratios is higher than the threshold
mask = dummy_col.isin(count[count > threshold].index)
# replace the ones which ratio is lower than the threshold by a special name
dummy_col[~mask] = "others"
return pd.get_dummies(dummy_col, prefix=dummy_col.name)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment