Skip to content

Instantly share code, notes, and snippets.

@Swarchal
Last active June 23, 2022 04:17
Show Gist options
  • Star 6 You must be signed in to star a gist
  • Fork 6 You must be signed in to fork a gist
  • Save Swarchal/881976176aaeb21e8e8df486903e99d6 to your computer and use it in GitHub Desktop.
Save Swarchal/881976176aaeb21e8e8df486903e99d6 to your computer and use it in GitHub Desktop.
findCorrelation in python
import pandas as pd
import numpy as np
def find_correlation(df, thresh=0.9):
"""
Given a numeric pd.DataFrame, this will find highly correlated features,
and return a list of features to remove
params:
- df : pd.DataFrame
- thresh : correlation threshold, will remove one of pairs of features with
a correlation greater than this value
"""
corrMatrix = df.corr()
corrMatrix.loc[:,:] = np.tril(corrMatrix, k=-1)
already_in = set()
result = []
for col in corrMatrix:
perfect_corr = corrMatrix[col][corrMatrix[col] > thresh].index.tolist()
if perfect_corr and col not in already_in:
already_in.update(set(perfect_corr))
perfect_corr.append(col)
result.append(perfect_corr)
select_nested = [f[1:] for f in result]
select_flat = [i for j in select_nested for i in j]
return select_flat
@rathdebi
Copy link

rathdebi commented Jan 9, 2019

hey Hi ,

Nice to see. this it was useful. ''
This will be for only numerical columns.
How to do it for categorical columns?

@vincenzo-scotto001
Copy link

@rathdebi

Here is a website that I often refer to, for categorical columns and ways to change them:
https://pbpython.com/categorical-encoding.html

Hope that helps.

@eloieloieloi
Copy link

Very specific and useful, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment