Created
January 2, 2018 16:15
-
-
Save lonly197/9bd000290d0ddc8aa093dcb556fe8d52 to your computer and use it in GitHub Desktop.
相关性阈值,它会去掉那些高度相关的特征(亦即,这些特征的特征值变化与其他特征非常相似)。它们提供的是冗余信息。
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import pandas as pd | |
import numpy as np | |
def find_correlation(df, thresh=0.9): | |
""" | |
Given a numeric pd.DataFrame, this will find highly correlated features, | |
and return a list of features to remove | |
params: | |
- df : pd.DataFrame | |
- thresh : correlation threshold, will remove one of pairs of features with | |
a correlation greater than this value | |
""" | |
corrMatrix = df.corr() | |
corrMatrix.loc[:,:] = np.tril(corrMatrix, k=-1) | |
already_in = set() | |
result = [] | |
for col in corrMatrix: | |
perfect_corr = corrMatrix[col][corrMatrix[col] > thresh].index.tolist() | |
if perfect_corr and col not in already_in: | |
already_in.update(set(perfect_corr)) | |
perfect_corr.append(col) | |
result.append(perfect_corr) | |
select_nested = [f[1:] for f in result] | |
select_flat = [i for j in select_nested for i in j] | |
return select_flat |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment