Skip to content

Instantly share code, notes, and snippets.

@lonly197
Created January 2, 2018 16:15
Show Gist options
  • Save lonly197/9bd000290d0ddc8aa093dcb556fe8d52 to your computer and use it in GitHub Desktop.
Save lonly197/9bd000290d0ddc8aa093dcb556fe8d52 to your computer and use it in GitHub Desktop.
相关性阈值,它会去掉那些高度相关的特征(亦即,这些特征的特征值变化与其他特征非常相似)。它们提供的是冗余信息。
import pandas as pd
import numpy as np
def find_correlation(df, thresh=0.9):
"""
Given a numeric pd.DataFrame, this will find highly correlated features,
and return a list of features to remove
params:
- df : pd.DataFrame
- thresh : correlation threshold, will remove one of pairs of features with
a correlation greater than this value
"""
corrMatrix = df.corr()
corrMatrix.loc[:,:] = np.tril(corrMatrix, k=-1)
already_in = set()
result = []
for col in corrMatrix:
perfect_corr = corrMatrix[col][corrMatrix[col] > thresh].index.tolist()
if perfect_corr and col not in already_in:
already_in.update(set(perfect_corr))
perfect_corr.append(col)
result.append(perfect_corr)
select_nested = [f[1:] for f in result]
select_flat = [i for j in select_nested for i in j]
return select_flat
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment