Last active
March 10, 2024 13:21
-
-
Save DeborahBarbedo/08ed242316fe3b9ed3350460e2a140f3 to your computer and use it in GitHub Desktop.
Python function designed to compute the Weight of Evidence (WoE) and Information Value (IV) for discrete variables in a given dataset.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
def Woe_IV_Dis(df, features, target): | |
aux = features + [target] | |
df = df[aux].copy() | |
# Empty dataframe | |
df_woe_iv = pd.DataFrame({},index=[]) | |
for feature in features: | |
df_woe_iv_aux = pd.crosstab(df[feature], df[target], normalize='columns') \ | |
.assign(Distr=lambda dfx: dfx[0] / dfx[1]) \ | |
.assign(WoE=lambda i: np.log(i[0] / i[1])) \ | |
.assign(IV=lambda i: (i['WoE']*(i[0]-i[1]))) \ | |
.assign(IV_total=lambda i: np.sum(i['IV'])) | |
df_woe_iv = pd.concat([df_woe_iv, df_woe_iv_aux]) | |
return df_woe_iv |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Function Description:
This Python function,
Woe_IV_Dis
, is designed to compute the Weight of Evidence (WoE) and Information Value (IV) for discrete variables in a given dataset. WoE and IV are widely used in credit scoring and predictive modeling to assess the predictive power of features. The function takes as input a pandas DataFrame (df
), a list of discrete features (features
), and a target variable (target
). It returns a DataFrame containing the calculated WoE and IV values for each feature.Parameters:
df
: pandas DataFrame containing the dataset.features
: List of discrete features for which WoE and IV are to be calculated.target
: Name of the target variable.Returns:
A pandas DataFrame containing the following columns:
variable
: Name of the discrete variable.0
: Proportion of observations with target variable = 0 for each level of the variable.1
: Proportion of observations with target variable = 1 for each level of the variable.Distr
: Distribution ratio of observations for each level of the variable (proportion of good to bad).WoE
: Weight of Evidence value calculated for each level of the variable.IV
: Information Value calculated for each level of the variable.IV_total
: Total Information Value for the variable, summing IV across all levels.The function computes the WoE and IV for each level of the discrete variables by using cross-tabulation. It then aggregates the IV values to provide the total IV for each variable.
This function is particularly useful for assessing the predictive power of discrete variables and identifying important features for classification models.
For more information and examples, please visit my blog post and GitHub repository.