Skip to content

Instantly share code, notes, and snippets.

@htnminh
Last active July 31, 2023 15:37
Show Gist options
  • Save htnminh/b211a093720853deef0da2ede5a69bff to your computer and use it in GitHub Desktop.
Save htnminh/b211a093720853deef0da2ede5a69bff to your computer and use it in GitHub Desktop.
Project 2 topic: Loan demand prediction

Project 2 Topic: Machine learning approaches for loan demand prediction

DATA CONFIDENTIALITY AGREEMENT

DUE TO VIETTEL MILITARY INDUSTRY AND TELECOMS GROUP REGULATIONS, THE DATA COLLECTED FOR THIS PROJECT MUST BE KEPT STRICTLY PRIVATE AND CONFIDENTIAL. HOWEVER, PARTICIPANTS OF THIS PROJECT, INCLUDING BUT NOT LIMITED TO THE STUDENTS WORKING ON THE PROJECT, MENTORS, AND INSTRUCTORS, ARE ALLOWED TO USE THE DATA FOR THE PURPOSES OF THIS PROJECT ONLY AND MUST REFRAIN FROM SHARING ANY OF THE DATA WITH THIRD PARTIES WITHOUT PRIOR CONSENT FROM THE COMPANY.

Abstract

This project focuses on a trending subject in the field of finance, which involves forecasting a user’s loan demand by utilizing their personal information and historical financial data. The dataset is preprocessed with some common and advanced techniques. Linear regression, support vector machine, neural network, random forest and its variation are the machine learning techniques used to learn the dataset.

Quick introduction

This project focuses on the historical data of customers in Viettel Money to predict the binary classification problem of loan demands. The project used some techniques of categorical data encoding, missing data handling, data preprocessing, data exploration, dimensionality reduction, balanced sampling, and some machine learning approaches, including linear regression, support vector machines, random forests, neural networks, and a new random-forest-based model that has the ability to learn imbalanced data effectively.

Dataset overview

The dataset has 40,000 samples of 146 features (customer identifier ”id” excluded). The dataset has two equal time frames: two previous months (namely ”n_2”) and one previous month (namely ”n_1”). They have the same 20,000 distinct customers. As the requirement is prediction, the whole data in n_2 is used to predict the labels of the respective customers in n_1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment