Skip to content

Instantly share code, notes, and snippets.

@mzaradzki
Last active July 4, 2017 11:09
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save mzaradzki/3c1561cd1e8b7195a13b5c059b0b15ff to your computer and use it in GitHub Desktop.
Save mzaradzki/3c1561cd1e8b7195a13b5c059b0b15ff to your computer and use it in GitHub Desktop.
# Search for variables that are very similar
def show_similars(cols, threshold=0.90):
for i1, col1 in enumerate(cols):
for i2, col2 in enumerate(cols):
if (i1<i2):
cm12 = pd.crosstab(dfX[col1], dfX[col2]).values # contingency table
cv12 = cramers_corrected_stat(cm12) # Cramer V statistic
if (cv12 > threshold):
print((col1, col2), int(cv12*100))
show_similars(['basin','region','region_code','district_code','lga'], 0.95)
# Output :
# ('region', 'region_code') 99
# ('region', 'lga') 99
# ('region_code', 'lga') 97
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment