Skip to content

Instantly share code, notes, and snippets.

Embed
What would you like to do?
findCorrelation in python
import pandas as pd
import numpy as np
def find_correlation(df, thresh=0.9):
"""
Given a numeric pd.DataFrame, this will find highly correlated features,
and return a list of features to remove
params:
- df : pd.DataFrame
- thresh : correlation threshold, will remove one of pairs of features with
a correlation greater than this value
"""
corrMatrix = df.corr()
corrMatrix.loc[:,:] = np.tril(corrMatrix, k=-1)
already_in = set()
result = []
for col in corrMatrix:
perfect_corr = corrMatrix[col][corrMatrix[col] > thresh].index.tolist()
if perfect_corr and col not in already_in:
already_in.update(set(perfect_corr))
perfect_corr.append(col)
result.append(perfect_corr)
select_nested = [f[1:] for f in result]
select_flat = [i for j in select_nested for i in j]
return select_flat
@rathdebi

This comment has been minimized.

Copy link

@rathdebi rathdebi commented Jan 9, 2019

hey Hi ,

Nice to see. this it was useful. ''
This will be for only numerical columns.
How to do it for categorical columns?

@vincenzo-scotto001

This comment has been minimized.

Copy link

@vincenzo-scotto001 vincenzo-scotto001 commented Aug 22, 2019

@rathdebi

Here is a website that I often refer to, for categorical columns and ways to change them:
https://pbpython.com/categorical-encoding.html

Hope that helps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.