-
-
Save danyashorokh/b2f894c2ab29ba927944493597dca152 to your computer and use it in GitHub Desktop.
import pandas as pd | |
# Calculate information value | |
def calc_iv(df, feature, target, pr=0): | |
lst = [] | |
for i in range(df[feature].nunique()): | |
val = list(df[feature].unique())[i] | |
lst.append([feature, val, df[df[feature] == val].count()[feature], df[(df[feature] == val) & (df[target] == 1)].count()[feature]]) | |
data = pd.DataFrame(lst, columns=['Variable', 'Value', 'All', 'Bad']) | |
data = data[data['Bad'] > 0] | |
data['Share'] = data['All'] / data['All'].sum() | |
data['Bad Rate'] = data['Bad'] / data['All'] | |
data['Distribution Good'] = (data['All'] - data['Bad']) / (data['All'].sum() - data['Bad'].sum()) | |
data['Distribution Bad'] = data['Bad'] / data['Bad'].sum() | |
data['WoE'] = np.log(data['Distribution Good'] / data['Distribution Bad']) | |
data['IV'] = (data['WoE'] * (data['Distribution Good'] - data['Distribution Bad'])).sum() | |
data = data.sort_values(by=['Variable', 'Value'], ascending=True) | |
if pr == 1: | |
print(data) | |
return data['IV'].values[0] |
how do you call the function or implement the above?
Example data frame is:-
| Age | Performance | Work experience in years | Promotion
1 | 52 | 3 | 9 | 1
2 | 32 | 9 | 6 | 1
3 | 51 | 9 | 10 | 0
4 | 18 | 2 | 20 | 0
5 | 60 | 5 | 5 | 1
6 | 59 | 4 | 17 | 0
7 | 55 | 8 | 8 | 1
8 | 56 | 10 | 1 | 0
9 | 59 | 2 | 17 | 1
10 | 59 | 5 | 11 | 0
For above our target/dependent variable is promotion.
Name of the dataframe is 'df'
In case you want to find the information value for Age, we'll use the function as:-
calc_iv(df, 'Age', 'Promotion', pr=0)
This will result in an output of information value for the feature Age.
You are welcome to check my revision
I think it's a bit more clear and closer to books explanations.
If I have like 100 params, how can I run this function for all params simultaneously, is it possible?
how do you call the function or implement the above?