Skip to content

Instantly share code, notes, and snippets.

@erykml
Created February 11, 2019 22:11
Show Gist options
  • Star 2 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save erykml/6854134220276b1a50862aa486a44192 to your computer and use it in GitHub Desktop.
Save erykml/6854134220276b1a50862aa486a44192 to your computer and use it in GitHub Desktop.
from sklearn.base import clone
def drop_col_feat_imp(model, X_train, y_train, random_state = 42):
# clone the model to have the exact same specification as the one initially trained
model_clone = clone(model)
# set random_state for comparability
model_clone.random_state = random_state
# training and scoring the benchmark model
model_clone.fit(X_train, y_train)
benchmark_score = model_clone.score(X_train, y_train)
# list for storing feature importances
importances = []
# iterating over all columns and storing feature importance (difference between benchmark and new model)
for col in X_train.columns:
model_clone = clone(model)
model_clone.random_state = random_state
model_clone.fit(X_train.drop(col, axis = 1), y_train)
drop_col_score = model_clone.score(X_train.drop(col, axis = 1), y_train)
importances.append(benchmark_score - drop_col_score)
importances_df = imp_df(X_train.columns, importances)
return importances_df
@dvrcapture
Copy link

imp_df(X_train.columns, importances)?
imp_df is not defined!

@crystalhaohua0408
Copy link

cannot find imp_df

@erykml
Copy link
Author

erykml commented Jun 1, 2019

This gist was not supposed to be a standalone function, but was created to show the logic in the article. imp_df can be found in the notebook hosting entire code used for the article: https://github.com/erykml/medium_articles/blob/master/feature_importance.ipynb

function for creating a feature importance dataframe

def imp_df(column_names, importances):
df = pd.DataFrame({'feature': column_names,
'feature_importance': importances})
.sort_values('feature_importance', ascending = False)
.reset_index(drop = True)
return df

@erussell92
Copy link

I was wondering why you are fitting and scoring on X_train, y_train. Will this not give a score of 1 for each column?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment