htnminh/project2topic.md

## project2topic.md

      
    Raw
  

              project2topic.md
            
          
    Project 2 Topic: Machine learning approaches for loan demand prediction

DATA CONFIDENTIALITY AGREEMENT

DUE TO VIETTEL MILITARY INDUSTRY AND TELECOMS GROUP REGULATIONS, THE
DATA COLLECTED FOR THIS PROJECT MUST BE KEPT STRICTLY PRIVATE AND
CONFIDENTIAL. HOWEVER, PARTICIPANTS OF THIS PROJECT, INCLUDING BUT NOT
LIMITED TO THE STUDENTS WORKING ON THE PROJECT, MENTORS, AND INSTRUCTORS,
ARE ALLOWED TO USE THE DATA FOR THE PURPOSES OF THIS PROJECT ONLY AND
MUST REFRAIN FROM SHARING ANY OF THE DATA WITH THIRD PARTIES WITHOUT
PRIOR CONSENT FROM THE COMPANY.
Abstract

This project focuses on a trending subject in the field of finance, which
involves forecasting a user’s loan demand by utilizing their personal
information and historical financial data. The dataset is preprocessed
with some common and advanced techniques. Linear regression, support
vector machine, neural network, random forest and its variation are the
machine learning techniques used to learn the dataset.
Quick introduction

This project focuses on the historical data of customers in Viettel
Money to predict the binary classification problem of loan demands.
The project used some techniques of categorical data encoding, missing
data handling, data preprocessing, data exploration, dimensionality
reduction, balanced sampling, and some machine learning approaches,
including linear regression, support vector machines, random forests,
neural networks, and a new random-forest-based model that has the
ability to learn imbalanced data effectively.
Dataset overview

The dataset has 40,000 samples of 146 features (customer identifier ”id”
excluded). The dataset has two equal time frames: two previous months
(namely ”n_2”) and one previous month (namely ”n_1”). They have the
same 20,000 distinct customers. As the requirement is prediction, the
whole data in n_2 is used to predict the labels of the respective customers
in n_1.