Skip to content

Instantly share code, notes, and snippets.

@DeborahBarbedo
Last active March 10, 2024 13:21
Show Gist options
  • Save DeborahBarbedo/08ed242316fe3b9ed3350460e2a140f3 to your computer and use it in GitHub Desktop.
Save DeborahBarbedo/08ed242316fe3b9ed3350460e2a140f3 to your computer and use it in GitHub Desktop.
Python function designed to compute the Weight of Evidence (WoE) and Information Value (IV) for discrete variables in a given dataset.
def Woe_IV_Dis(df, features, target):
aux = features + [target]
df = df[aux].copy()
# Empty dataframe
df_woe_iv = pd.DataFrame({},index=[])
for feature in features:
df_woe_iv_aux = pd.crosstab(df[feature], df[target], normalize='columns') \
.assign(Distr=lambda dfx: dfx[0] / dfx[1]) \
.assign(WoE=lambda i: np.log(i[0] / i[1])) \
.assign(IV=lambda i: (i['WoE']*(i[0]-i[1]))) \
.assign(IV_total=lambda i: np.sum(i['IV']))
df_woe_iv = pd.concat([df_woe_iv, df_woe_iv_aux])
return df_woe_iv
@DeborahBarbedo
Copy link
Author

Function Description:

This Python function, Woe_IV_Dis, is designed to compute the Weight of Evidence (WoE) and Information Value (IV) for discrete variables in a given dataset. WoE and IV are widely used in credit scoring and predictive modeling to assess the predictive power of features. The function takes as input a pandas DataFrame (df), a list of discrete features (features), and a target variable (target). It returns a DataFrame containing the calculated WoE and IV values for each feature.

Parameters:

  • df: pandas DataFrame containing the dataset.
  • features: List of discrete features for which WoE and IV are to be calculated.
  • target: Name of the target variable.

Returns:

A pandas DataFrame containing the following columns:

  • variable: Name of the discrete variable.
  • 0: Proportion of observations with target variable = 0 for each level of the variable.
  • 1: Proportion of observations with target variable = 1 for each level of the variable.
  • Distr: Distribution ratio of observations for each level of the variable (proportion of good to bad).
  • WoE: Weight of Evidence value calculated for each level of the variable.
  • IV: Information Value calculated for each level of the variable.
  • IV_total: Total Information Value for the variable, summing IV across all levels.

The function computes the WoE and IV for each level of the discrete variables by using cross-tabulation. It then aggregates the IV values to provide the total IV for each variable.

This function is particularly useful for assessing the predictive power of discrete variables and identifying important features for classification models.

For more information and examples, please visit my blog post and GitHub repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment