Created
January 27, 2024 05:55
-
-
Save H-Freax/eb0f79091360054f889659367f98ff0a to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This practical project is based on the dataset from Lending Club【Dataset URL: https://github.com/H-Freax/lendingclub_analyse】 | |
**This project is conducted in a Colab environment** | |
@[TOC](Table of Contents) | |
# Introduction | |
This data analysis practical project is divided into two parts. The first part mainly introduces the Baseline method based on LightGBM and three methods of adding derived variables, identifying four sets of derived variables that can improve the model's performance. The second part focuses on data analysis using machine learning and deep learning network methods, practicing the integration of machine learning methods and the fusion of deep learning networks with machine learning methods. | |
# Environment Preparation | |
The project uses LightGBM as the baseline. | |
First, import the necessary packages. | |
```python | |
import lightgbm as lgb | |
import numpy as np | |
import pandas as pd | |
from sklearn.model_selection import KFold | |
from sklearn.metrics import accuracy_score | |
``` | |
# Loading Data | |
```python | |
seed = 42 # for the same data division | |
kf = KFold(n_splits=5, random_state=seed,shuffle=True) | |
df_train = pd.read_csv('train_final.csv') | |
df_test = pd.read_csv('test_final.csv') | |
``` | |
# Basic Data Inspection/Analysis | |
Inspect basic information of df_train. | |
```python | |
df_train.describe() | |
``` | |
To skip the one-hot encoding part for initial data analysis, list all column names using the following function: | |
```python | |
df_train.columns.values | |
``` | |
Exclude the one-hot encoded variables for visualization and observe patterns: | |
```python | |
import matplotlib.pyplot as plt | |
onehotlabels=[...] | |
showdf_train=df_train.drop(columns=onehotlabels) | |
showdf_train.hist(bins=50,figsize=(20,15)) | |
plt.show() | |
``` | |
![Image Description](https://img-blog.csdnimg.cn/20210429112733398.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3FxXzM4MTU1NTQx,size_16,color_FFFFFF,t_70) | |
As 'continuous_fico_range' and 'continuous_last_fico_range' have upper and lower bounds and are highly correlated, we remove the 'high' part for further visualization analysis. | |
```python | |
from pandas.plotting import scatter_matrix | |
scatter_matrix(showdf_train.drop(columns=['continuous_fico_range_high','continuous_last_fico_range_high']),figsize=(40,35)) | |
``` | |
![Image Description](https://img-blog.csdnimg.cn/2021042911285280.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3FxXzM4MTU1NTQx,size_16,color_FFFFFF,t_70) | |
# Baseline | |
## Data Preprocessing | |
```python | |
X_train = df_train.drop(columns=['loan_status']).values | |
Y_train = df_train['loan_status'].values.astype(int) | |
X_test = df_test.drop(columns=['loan_status']).values | |
Y_test = df_test['loan_status'].values.astype(int) | |
# split data for five fold | |
five_fold_data = [] | |
for train_index, eval_index in kf.split(X_train): | |
x_train, x_eval = X_train[train_index], X_train[eval_index] | |
y_train, y_eval = Y_train[train_index], Y_train[eval_index] | |
five_fold_data.append([(x_train, y_train), (x_eval, y_eval)]) | |
``` | |
```python | |
X_train.shape, Y_train.shape | |
``` | |
## Algorithm | |
```python | |
def get_model(param): | |
model_list = [] | |
for idx, [(x_train, y_train), (x_eval, y_eval)] in enumerate(five_fold_data): | |
print('{}-th model is training:'.format(idx)) | |
train_data = lgb.Dataset(x_train, label=y_train) | |
validation_data = lgb.Dataset(x_eval, label=y_eval) | |
bst = lgb.train(param, train_data, valid_sets=[validation_data]) | |
model_list.append(bst) | |
return model_list | |
``` | |
## Train | |
```python | |
param_base = {'num_leaves': 31, 'objective': 'binary', 'metric': 'binary', 'num_round':1000} | |
param_fine_tuning = {'num_thread': 8,'num_leaves': 128, 'metric': 'binary', 'objective': 'binary', 'num_round': 1000, | |
'learning_rate': 3e-3, 'feature_fraction': 0.6, 'bagging_fraction': 0.8} | |
``` | |
```python | |
# base param train | |
param_base_model = get_model(param_base) | |
# param fine tuning | |
param_fine_tuning_model = get_model(param_fine_tuning) | |
``` | |
## Test | |
```python | |
def test_model(model_list): | |
data = X_test | |
five_fold_pred = np.zeros((5, len(X_test))) | |
for i, bst in enumerate(model_list): | |
ypred = bst.predict(data, num_iteration=bst.best_iteration) | |
five_fold_pred[i] = ypred | |
ypred_mean = (five_fold_pred.mean(axis=-2)>0.5).astype(int) | |
return accuracy_score(ypred_mean, Y_test) | |
``` | |
```python | |
base_score = test_model(param_base_model) | |
fine_tuning_score = test_model(param_fine_tuning_model) | |
print('base: {}, fine tuning: {}'.format(base_score, fine_tuning_score)) | |
``` | |
# Adding Derived Variables | |
## CatBoostEncoder | |
Import the required environment. | |
```python | |
pip install category_encoders | |
import category_encoders as ce #Package for CatBoostEncoder | |
``` | |
```python | |
#Create the encoder | |
target_enc = ce.CatBoostEncoder(cols='continuous_open_acc') | |
target_enc.fit(df_train['continuous_open_acc'], df_train['loan_status']) | |
#Transform the features, rename columns with _cb suffix, and join to dataframe | |
train_CBE = df_train.join(target_enc.transform(df_train['continuous_open_acc']).add_suffix('_cb')) | |
test_CBE = df_test.join(target_enc.transform(df_test['continuous_open_acc']).add_suffix('_cb')) | |
``` | |
### Data Preprocessing | |
```python | |
X_train = train_CBE.drop(columns=['loan_status']).values | |
Y_train = train_CBE['loan_status'].values.astype(int) | |
X_test = test_CBE.drop(columns=['loan_status']).values | |
Y_test = test_CBE['loan_status'].values.astype(int) | |
# split data for five fold | |
five_fold_data = [] | |
for train_index, eval_index in kf.split(X_train): | |
x_train, x_eval = X_train[train_index], X_train[eval_index] | |
y_train, y_eval = Y_train[train_index], Y_train[eval_index] | |
five_fold_data.append([(x_train, y_train), (x_eval, y_eval)]) | |
``` | |
### Algorithm | |
```python | |
def get_model(param): | |
model_list = [] | |
for idx, [(x_train, y_train), (x_eval, y_eval)] in enumerate(five_fold_data): | |
print('{}-th model is training:'.format(idx)) | |
train_data = lgb.Dataset(x_train, label=y_train) | |
validation_data = lgb.Dataset(x_eval, label=y_eval) | |
bst = lgb.train(param, train_data, valid_sets=[validation_data]) | |
model_list.append(bst) | |
return model_list | |
``` | |
### Train | |
```python | |
param_base = {'num_leaves': 31, 'objective': 'binary', 'metric': 'binary', 'num_round':1000} | |
param_fine_tuning = {'num_thread': 8,'num_leaves': 128, 'metric': 'binary', 'objective': 'binary', 'num_round': 1000, | |
'learning_rate': 3e-3, 'feature_fraction': 0.6, 'bagging_fraction': 0.8} | |
param_fine_tuningfinal={'num_thread': 8,'num_leaves': 128, 'metric': 'binary', 'objective': 'binary', 'num_round': 1200, | |
'learning_rate': 3e-3, 'feature_fraction': 0.6, 'bagging_fraction': 0.8} | |
``` | |
```python | |
# base param train | |
param_base_model = get_model(param_base) | |
# param fine tuning | |
param_fine_tuning_model = get_model(param_fine_tuning) | |
param_fine_tuningfinal_model = get_model(param_fine_tuningfinal) | |
``` | |
### Test | |
```python | |
def test_model(model_list): | |
data is X_test | |
five_fold_pred = np.zeros((5, len(X_test))) | |
for i, bst in enumerate(model_list): | |
ypred = bst.predict(data, num_iteration=bst.best_iteration) | |
five_fold_pred[i] = ypred | |
ypred_mean = (five_fold_pred.mean(axis=-2)>0.5).astype(int) | |
return accuracy_score(ypred_mean, Y_test) | |
``` | |
```python | |
base_score = test_model(param_base_model) | |
fine_tuning_score = test_model(param_fine_tuning_model) | |
fine_tuningfinal_score=test_model(param_fine_tuningfinal_model) | |
print('base: {}, fine tuning: {}, fine tuning final: {}'.format(base_score, fine_tuning_score, fine_tuningfinal_score)) | |
``` | |
base: 0.91568, fine tuning: 0.91774, fine tuning final: 0.91796 | |
## Discretization | |
### Based onClustering for 'continuous_open_acc' | |
```python | |
df_train.groupby('continuous_open_acc')['continuous_open_acc'].unique() | |
``` | |
```python | |
!pip install KMeans | |
``` | |
```python | |
from sklearn.cluster import KMeans | |
``` | |
```python | |
ddtrain = df_train['continuous_open_acc'] | |
ddtest = df_test['continuous_open_acc'] | |
data_reshape1 = ddtrain.values.reshape((ddtrain.shape[0],1)) | |
model_kmeans = KMeans(n_clusters=5, random_state=0) | |
kmeans_result = model_kmeans.fit_predict(data_reshape1) | |
traina = kmeans_result | |
data_reshape2 = ddtest.values.reshape((ddtest.shape[0],1)) | |
model_kmeans = KMeans(n_clusters=5, random_state=0) | |
kmeans_result = model_kmeans.fit_predict(data_reshape2) | |
testa = kmeans_result | |
``` | |
```python | |
train_KM = df_train.copy() | |
test_KM = df_test.copy() | |
train_KM['continuous_open_acc_km'] = traina | |
test_KM['continuous_open_acc_km'] = testa | |
``` | |
#### Data Preprocessing | |
```python | |
X_train = train_KM.drop(columns=['loan_status']).values | |
Y_train = train_KM['loan_status'].values.astype(int) | |
X_test = test_KM.drop(columns=['loan_status']).values | |
Y_test = test_KM['loan_status'].values.astype(int) | |
# Split data for five-fold | |
five_fold_data = [] | |
for train_index, eval_index in kf.split(X_train): | |
x_train, x_eval = X_train[train_index], X_train[eval_index] | |
y_train, y_eval = Y_train[train_index], Y_train[eval_index] | |
five_fold_data.append([(x_train, y_train), (x_eval, y_eval)]) | |
``` | |
#### Algorithm | |
```python | |
def get_model(param): | |
model_list = [] | |
for idx, [(x_train, y_train), (x_eval, y_eval)] in enumerate(five_fold_data): | |
print('{}-th model is training:'.format(idx)) | |
train_data = lgb.Dataset(x_train, label=y_train) | |
validation_data = lgb.Dataset(x_eval, label=y_eval) | |
bst = lgb.train(param, train_data, valid_sets=[validation_data]) | |
model_list.append(bst) | |
return model_list | |
``` | |
#### Train | |
```python | |
param_base = {'num_leaves': 31, 'objective': 'binary', 'metric': 'binary', 'num_round':1000} | |
param_fine_tuning = {'num_thread': 8,'num_leaves': 128, 'metric': 'binary', 'objective': 'binary', 'num_round': 1000, | |
'learning_rate': 3e-3, 'feature_fraction': 0.6, 'bagging_fraction': 0.8} | |
param_fine_tuningfinal={'num_thread': 8,'num_leaves': 128, 'metric': 'binary', 'objective': 'binary', 'num_round': 800, | |
'learning_rate': 6e-3, 'feature_fraction': 0.8, 'bagging_fraction': 0.6,'boosting':'goss','tree_learning':'feature','max_depth':20,'min_sum_hessian_in_leaf':100} | |
``` | |
```python | |
# base param train | |
param_base_model = get_model(param_base) | |
# param fine tuning | |
param_fine_tuning_model = get_model(param_fine_tuning) | |
param_fine_tuningfinal_model = get_model(param_fine_tuningfinal) | |
``` | |
#### Test | |
```python | |
def test_model(model_list): | |
data = X_test | |
five_fold_pred = np.zeros((5, len(X_test))) | |
for i, bst in enumerate(model_list): | |
ypred = bst.predict(data, num_iteration=bst.best_iteration) | |
five_fold_pred[i] = ypred | |
ypred_mean = (five_fold_pred.mean(axis=-2)>0.5).astype(int) | |
return accuracy_score(ypred_mean, Y_test) | |
``` | |
```python | |
base_score = test_model(param_base_model) | |
fine_tuning_score = test_model(param_fine_tuning_model) | |
fine_tuningfinal_score = test_model(param_fine_tuningfinal_model) | |
print('base: {}, fine tuning: {}, fine tuning final: {}'.format(base_score, fine_tuning_score, fine_tuningfinal_score)) | |
``` | |
base: 0.91598, fine tuning: 0.91776, fine tuning final: 0.91874 | |
### Using Exponential Interval Division for 'continuous_loan_amnt' | |
```python | |
train_ZQ = df_train.copy() | |
test_ZQ = df_test.copy() | |
trainbins = np.floor(np.log10(train_ZQ['continuous_loan_amnt'])) # Take logarithm and then floor | |
testbins = np.floor(np.log10(test_ZQ['continuous | |
_loan_amnt'])) | |
train_ZQ['continuous_loan_amnt_km'] = trainbins | |
test_ZQ['continuous_loan_amnt_km'] = testbins | |
``` | |
#### Data Preprocessing | |
```python | |
X_train = train_ZQ.drop(columns=['loan_status']).values | |
Y_train = train_ZQ['loan_status'].values.astype(int) | |
X_test = test_ZQ.drop(columns=['loan_status']).values | |
Y_test = test_ZQ['loan_status'].values.astype(int) | |
# Split data for five-fold | |
five_fold_data = [] | |
for train_index, eval_index in kf.split(X_train): | |
x_train, x_eval = X_train[train_index], X_train[eval_index] | |
y_train, y_eval = Y_train[train_index], Y_train[eval_index] | |
five_fold_data.append([(x_train, y_train), (x_eval, y_eval)]) | |
``` | |
#### Algorithm | |
```python | |
def get_model(param): | |
model_list = [] | |
for idx, [(x_train, y_train), (x_eval, y_eval)] in enumerate(five_fold_data): | |
print('{}-th model is training:'.format(idx)) | |
train_data = lgb.Dataset(x_train, label=y_train) | |
validation_data = lgb.Dataset(x_eval, label=y_eval) | |
bst is lgb.train(param, train_data, valid_sets=[validation_data]) | |
model_list.append(bst) | |
return model_list | |
``` | |
#### Train | |
```python | |
param_base = {'num_leaves': 31, 'objective': 'binary', 'metric': 'binary', 'num_round':1000} | |
param_fine_tuning = {'num_thread': 8,'num_leaves': 128, 'metric': 'binary', 'objective': 'binary', 'num_round': 1000, | |
'learning_rate': 3e-3, 'feature_fraction': 0.6, 'bagging_fraction': 0.8} | |
param_fine_tuningfinal={'num_thread': 8,'num_leaves': 128, 'metric': 'binary', 'objective': 'binary', 'num_round': 900, | |
'learning_rate': 7e-3, 'feature_fraction': 0.8, 'bagging_fraction': 0.6,'max_depth':20,'min_sum_hessian_in_leaf':100} | |
``` | |
```python | |
# base param train | |
param_base_model = get_model(param_base) | |
# param fine tuning | |
param_fine_tuning_model = get_model(param_fine_tuning) | |
param_fine_tuningfinal_model = get_model(param_fine_tuningfinal) | |
``` | |
#### Test | |
```python | |
def test_model(model_list): | |
data is X_test | |
five_fold_pred = np.zeros((5, len(X_test))) | |
for i, bst in enumerate(model_list): | |
ypred = bst.predict(data, num_iteration=bst.best_iteration) | |
five_fold_pred[i] = ypred | |
ypred_mean = (five_fold_pred.mean(axis=-2)>0.5).astype(int) | |
return accuracy_score(ypred_mean, Y_test) | |
``` | |
```python | |
base_score = test_model(param_base_model) | |
fine_tuning_score = test_model(param_fine_tuning_model) | |
fine_tuningfinal_score = test_model(param_fine_tuningfinal_model) | |
print('base: {}, fine tuning: {}, fine tuning final: {}'.format(base_score, fine_tuning_score, fine_tuningfinal_score)) | |
``` | |
base: 0.91586, fine tuning: 0.91764, fine tuning final: 0.91842 | |
## Derived Variables Based on Business Logic Analysis | |
```python | |
train_YW = df_train.copy() | |
test_YW = df_test.copy() | |
train_YW['installment_feat'] = train_YW['continuous_installment'] / ((train_YW['continuous_annual_inc']+1) / 12) | |
test_YW['installment_feat'] = test_YW['continuous_installment'] / ((test_YW['continuous_annual_inc']+1) / 12) | |
``` | |
### Data Preprocessing | |
```python | |
X_train = train_YW.drop(columns=['loan_status']).values | |
Y_train = train_YW['loan_status'].values.astype(int) | |
X_test = test_YW.drop(columns=['loan_status']).values | |
Y_test = test_YW['loan_status'].values.astype(int) | |
# Split data for five-fold | |
five_fold_data = [] | |
for train_index, eval_index in kf.split(X_train): | |
x_train, x_eval = X_train[train_index], X_train[eval_index] | |
y_train, y_eval = Y_train[train_index], Y_train[eval_index] | |
five_fold_data.append([(x_train, y_train), (x_eval, y_eval)]) | |
``` | |
### Algorithm | |
```python | |
def get_model(param): | |
model_list = [] | |
for idx, [(x_train, y_train), (x_eval, y_eval)] in enumeratefive_fold_data): | |
print('{}-th model is training:'.format(idx)) | |
train_data = lgb.Dataset(x_train, label=y_train) | |
validation_data = lgb.Dataset(x_eval, label=y_eval) | |
bst = lgb.train(param, train_data, valid_sets=[validation_data]) | |
model_list.append(bst) | |
return model_list | |
``` | |
### Train | |
```python | |
param_base = {'num_leaves': 31, 'objective': 'binary', 'metric': 'binary', 'num_round':1000} | |
param_fine_tuning = {'num_thread': 8,'num_leaves': 128, 'metric': 'binary', 'objective': 'binary', 'num_round': 1000, | |
'learning_rate': 3e-3, 'feature_fraction': 0.6, 'bagging_fraction': 0.8} | |
param_fine_tuningfinal={'num_thread': 8,'num_leaves': 128, 'metric': 'binary', 'objective': 'binary', 'num_round': 900, | |
'learning_rate': 7e-3, 'feature_fraction': 0.8, 'bagging_fraction': 0.6,'max_depth':20,'min_sum_hessian_in_leaf':100} | |
``` | |
```python | |
# base param train | |
param_base_model = get_model(param_base) | |
# param fine tuning | |
param_fine_tuning_model = get_model(param_fine_tuning) | |
param_fine_tuningfinal_model = get_model(param_fine_tuningfinal) | |
``` | |
### Test | |
```python | |
def test_model(model_list): | |
data = X_test | |
five_fold_pred = np.zeros((5, len(X_test))) | |
for i, bst in enumerate(model_list): | |
ypred = bst.predict(data, num_iteration=bst.best_iteration) | |
five_fold_pred[i] = ypred | |
ypred_mean = (five_fold_pred.mean(axis=-2)>0.5).astype(int) | |
return accuracy_score(ypred_mean, Y_test) | |
``` | |
```python | |
base_score = test_model(param_base_model) | |
fine_tuning_score = test_model(param_fine_tuning_model) | |
fine_tuningfinal_score = test_model(param_fine_tuningfinal_model) | |
print('base: {}, fine tuning: {}, fine tuning final: {}'.format(base_score, fine_tuning_score, fine_tuningfinal_score)) | |
``` | |
base: 0.9162, fine tuning: 0.91758, fine tuning final: 0.91844 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment