IDS 400 - Group Project Report
docx
keyboard_arrow_up
School
University of Illinois, Chicago *
*We aren’t endorsed by this school
Course
400
Subject
Industrial Engineering
Date
Feb 20, 2024
Type
docx
Pages
24
Uploaded by JusticeMagpieMaster175
LOAN REPAYMENT PREDICTION
USING MACHINE LEARNING
IDS 400 - Programming for Data Science
University of Illinois at Chicago
Spring 2021 Group Project Report
Date: May 6, 2021
Table of Contents
1. Introduction
3
2. Project Description and Objectives
3
3. Business Value
4
4. Project Analysis
4
5. Packages and Functions Used:
5
6. Data Preparation
6
7. Exploratory Data Analysis
9
8. Data Transformation & Processing for Machine Learning
13
9. Machine Learning Models
14
Chosen Models:
14
LogisticRegression from sklearn.linear_model
15
DecisionTreeClassifier from sklearn.tree
15
KNeighborsClassifier from sklearn.neighbors
16
XGBClassifier from xgboost
16
Tuning the Parameters for our xGBoost Model
17
Scale_pos_weight
17
Max_depth
19
Learning_rate
20
10. Conclusion
21
1. Introduction
According to marutitech.com, until recently, only the hedge funds were the primary users
of Artificial Intelligence and Machine Learning in Finance, but the last few years have seen the
applications of Machine Learning spreading to various other areas, including banks, fintech,
regulators, and insurance firms, to name a few. Loans are the core business of many financial
institutions and financial service providers. In the past, companies would have to rely on a
limited amount of data, and a set of policies and processes in order to assess a customer’s
financial position and their intention to repay before issuing a loan, which would be very time
consuming. However, tremendous improvements in computational power and increased
research and development of machine learning algorithms have helped make predictions
quicker and more accurate, and it is not surprising that financial institutions and financial service
providers are currently one of the top spenders in Big Data and Data Analytics. According to
Soma Metrics, between 2014 and 2017, mortgage industry spending on big data increased from
$2.6 billion to $3.2 billion. With the growing interest in using Machine Learning to predict the outcome and return
from loans in the financial industry, we thought it would be interesting to see if we could come
use machine learning algorithms to accurately predict whether a loan would default or not.
2. Project Description and Objectives
The dataset we have decided to work with comes from Lending Club, which is a digital
marketplace for peer-to-peer lending, connecting borrowers looking for loans and lenders
interested in making an investment.
They replace the high cost and complexity of bank lending
with a faster, smarter way to borrow and invest, providing lower interest rates, better terms and
amounts offered to borrowers by utilizing data, analytics, and technology-enabled models. The
data in this dataset was initially scraped from the thousands of loans made through the Lending
Club platform between 9/1/14 to 1/1/15. The raw dataset includes 81023 and 146 columns of
information either pertaining to the loan or information on the financial history of the borrower
which was captured at the time of the loan application.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
The main goal of the project is to use Python to help visualize the data, to get a better
understanding of the dataset, and to create models using Machine Learning algorithms in order
to predict whether a loan will be Fully Paid or Charged Off.
3. Business Value Financial service providers can use this to reduce risk by being able to predict whether a
loan will be paid or defaulted on using machine learning. This will also enhance revenues, and
lower operational costs, saving lenders both time and money by automating the end-to-end loan
application process, from loan approval to loan monitoring, which will increase productivity,
since a manual review will only be needed after the loan application passes through the initial
screening process. Thus, this will allow financial institutions and financial service providers to
process more loans quicker while also reducing their risk, allowing them to invest their money
and time in other areas of their business.
It will also help improve user experience when applying for loans, giving applicants a
smoother and quicker loan application process, where applicants can either be denied or
approved a loan in the matter of minutes, rather than having to go through a lengthy manual
process
4. Project Analysis
To complete this project, we did the following: 1. Prepared and cleaned the data by removing any unnecessary variables and dropping any variables that could potentially cause data leakage. 2. Performed Exploratory Data Analysis on the data.
3. Pre-processed the data again to prepare it for our Machine Learning Algorithms.
3. Built a model to help predict which loans would default and which would not.
5. Packages and Functions Used:
In the project, we used a variety of packages in order to perform data manipulation, data
visualization, and data modelling using machine learning.
For Data Manipulation we used the following to help prepare the data for manipulation
and the initial data exploration analysis:
●
Pandas
- to help manipulate the data, for e.g. creating dummy variable columns
●
Numpy
For Data Visualization to create the graphs and tables used to better understand the
dataset better we used the following:
●
Seaborn
●
Matplotlib.pyplot
To create models using Machine Learning and to evaluate the performance of our
models we used the following packages:
●
Sklearn - to use machine learning algorithms such as LogisticRegression, KNeighborsClassifier, DecisionTreeClassifier to create and train my models, train_test_split to create a training and test dataset, and also used metrics such as roc_curve, auc, confusion_matrix, classification_report, and accuracy_score to help determine which models performed the best.
●
XGBoost
– to create a model using Extreme Gradient Boosting Regression.
●
matplotlib.pyplot - to help plot an ROC/AUC Curve to help compare and evaluate the models’ performance.
6. Data Preparation Firstly, to prepare data for analysis and modeling, we make sure that the “loan_status”
column only contains the values “Fully Paid” or “Charged Off” since in this data analysis we are
only looking at these two types of loans, which removes 1 row of data.
We also decided to only keep selected columns/variables (as shown below) from the
data set as there are 146 columns in the data set, and we want to minimize the time it takes to
run our code/models. The variables we decided to keep are ones that are also used in
assessing one’s credit score, and the ones we think are most likely to affect whether or not a
loan will be defaulted on. We also removed any variables that may lead to data leakage, which
is
when information from outside the training dataset is used to create the model, which can
allow the model to learn or know something that it otherwise would not know and in turn
invalidate the estimated performance of the model being constructed, which may also lead to
overfitting.
Variable
Description
loan_amnt
The listed amount of the loan applied for by the borrower. If at some point in time, the credit department reduces the loan amount, then it will be reflected in this value.
int_rate
Interest Rate on the loan
grade
LC assigned loan grade
sub_grade
LC assigned loan subgrade
emp_length
Employment length in years. Possible values are between 0 and 10 where 0 means less than one year and 10 means ten or more
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
years.
home_ownership
The home ownership status provided by the borrower during registration. Our values are: RENT, OWN, MORTGAGE, OTHER.
annual_inc
The self-reported annual income provided by the borrower during registration.
verification_status
The self-reported annual income provided by the borrower during registration.
loan_status
Current status of the loan
purpose
A category provided by the borrower for the loan request. dti
A ratio calculated using the borrower’s total monthly debt payments on the total debt obligations, excluding mortgage and the requested LC loan, divided by the borrower’s self-reported monthly income.
total_rev_hi_lim
Total revolving high credit/credit limit
acc_open_past_24mths
Number of trades opened in past 24 months
bc_util
Ratio of total current balance to high credit/credit limit for all bankcard accounts.
chargeoff_within_12_mths
Number of charge-offs within 12 months
mort_acc
Number of mortgage accounts.
mths_since_recent_inq
Months since most recent inquiry.
mths_since_last_delinq
The number of months since the borrower's last delinquency.
percent_bc_gt_75
Percentage of all bankcard accounts > 75% of limit.
pub_rec_bankruptcies
Number of public record bankruptcies
tot_hi_cred_lim
Total high credit/credit limit
total_bc_limit
Total bankcard high credit/credit limit
Once the unwanted columns were filtered out we checked the dataset for any missing
values, as those values need to be replaced.
We found that there were 5 columns with missing values, so for emp_length
(employment length) we replaced missing/NA values with 0. For bc_util
(bankcard utilization
rate) and percent_bc_gt_75
(all bankcard accounts > 75% of limit), we replaced any columns
with missing/NA values with their column medians so it would not skew the column's average
values. For mths_since_recent_inq
(months since recent inquiry) and mths_since_last_delinq
(months since last delinquency) we replaced the missing/NA values with a very large number
since it cannot be replaced with the median or a 0 as that would imply that these two things
have either occurred or not.
7. Exploratory Data Analysis
1.
What is The Proportion of Loan Defaults (‘Charged Off’ vs ‘Fully Paid’ Loans)?
There are 11827 Charged Off Loans
and 69195 Fully Paid Loans. Percentage
wise, 85.4% of loans have been fully paid
while 14.6% of loans have been charged off
in this dataset. 2.
How Many Loans Are There in Each Grade?
There are 20402 loans in grade
A, 23399 loans in grade B, 22577 loans
in grade C, 10802 loans in grade D, 3191
loans in grade F and 91 loans in grade G.
From the bar chart plotted above, we can
see that in general, the higher the loan
grade, the higher the number of loans
issued, with the highest number of loans
issued being in grade B and the lowest
being in grade G. This is what we would
expect as the better credit history a
borrower has the better their loan grade, and generally, those that have better credit histories
are more likely be more qualified to receive loans compared to those that don’t have a very
good credit history and can only qualify for loans of lower grades. This doesn’t mean lower
grades don’t receive loans at all but rather less often it seems.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
3.
How does default rate vary with loan grade?
The default rate is the lowest with loans graded “A”. The higher the grade of the loan, the
higher the rate of the loan being “Fully paid”. Conversely, the “Charged off” rates are the highest with
loans graded “G”. As the grade of the loan decreases, so does the rate of it being “Fully Paid”. The
proportion of charged-off loans are almost half of the grade G loans. This is what we would expect
as the lower the loan grade, the riskier the loan is and are more likely to be defaulted on. 4.
What Purpose Are People Borrowing Money For?
Most borrowers are taking out loans primarily for debt consolidation, followed by credit
card, other purposes, home improvements, and major purchases.
5.
Do Defaults Rates Vary Per Purpose?
If we exclude the three least common purposes (house, renewable_energy, and
wedding), small_business seems to be the purpose with the highest percentage of defaulted
loans. This doesn't come as a surprise as many small businesses do not succeed and go
bankrupt, which is probably why the default rates are higher than other purposes.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
6.
Heatmap of Correlations Between All Variables
We plot a heatmap of the correlation between all the variables to check whether there
are any variables that are highly correlated to each other as too many highly correlated
variables can result in multicollinearity issues. Multicollinearity makes it hard to interpret the
coefficients, and it reduces the power of a model to identify independent variables that are
statistically significant, and can also create redundant information, skewing the results in a
regression model. However, since most of the variables seem to have a low correlation to each
other, we leave the dataset as it is. 8. Data Transformation & Processing for Machine Learning
Converting the Categorical Variables into Dummy Variables
We start off by converting all the columns with categorical variables into dummy/indicator
variables using the get_dummies function from the pandas library as many machine learning
algorithms cannot process categorical variables.
As you can see above, this adds a column for each categorical variable and assigns 0s
and 1s to indicate whether or not the factor in the categorical variable was used or not.
We then move the columns “loan_status_Fully Paid” and “loan_status_Charged Off” to
the front before we split the dataset to make it easier to select these columns when we have to
create the training and test data sets.
Splitting the Data into Training and Testing Sets
We create an array with the values in each column and then select all the values in the
columns excluding the “loan_status_Fully Paid” and “loan_status_Charged Off” columns and
store them into the variable “X”. We select the values in the column “loan_status_Fully Paid”
and store them into our variable “Y” as we will use a 1 value to represent “Fully Paid” cases and
a 0 value to represent the “Charged Off” cases in our models.
We use the train_test_split function from sklearn to split the dataset into 70% training
data and 30% testing/validation data as we feel this is a good split ratio to avoid our models
from becoming overfitted or underfitted on the training data. 9. Machine Learning Models
Chosen Models:
-
LogisticRegression from sklearn.linear_model
-
DecisionTreeClassifier from sklearn.tree
-
KNeighborsClassifier from sklearn.neighbors
-
XGBClassifier from xgboost
Since our goal is to correctly predict both Fully Paid and Charged Off loans we use
Accuracy and Area Under the Curve (AUC) as our main evaluation metrics to determine which
model is the best. Accuracy can be computed using the following formula:
We chose AUC as our second evaluation metric as AUC is used for classification with
two classes. AUC measures the entire two-dimensional area underneath the entire ROC curve
(receiver operating characteristic curve). An ROC Curve plots the True Positive Rate vs. False
Positive Rate at different classification thresholds, and AUC provides an aggregate measure of
performance across all possible classification thresholds. They represent the degree or
measure of separability, and the AUC value will tell us how much our models are capable of
distinguishing between classes, where the higher the AUC value, the better our model is at
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
classifying loans that will be Fully Paid and those that will be Charged Off. Hence, the best
model will have a higher AUC (that is at least above a threshold of 0.5) and a higher Accuracy.
1.
LogisticRegression from sklearn.linear_model
We choose to use LogisticRegression as one of our models as it is a commonly a
statistical method for predicting binary classes.
Confusion Table and Metrics:
ROC/AUC Curve:
Accuracy: 0.85
AUC: 0.57932
2.
DecisionTreeClassifier from sklearn.tree
Decision Trees are a non-parametric supervised learning method used for classification
and regression, where a tree structure is constructed that breaks the dataset down into smaller
subsets, which can be used to predict the value of a target variable by learning simple decision
rules inferred from the data features, where there are decision nodes that partition the data and
leaf nodes that give the prediction that can be followed by traversing simple IF, AND, and THEN
logic down the nodes in a tree-like structure.
Confusion Table and Metrics:
ROC/AUC Curve:
Accuracy: 0.75
AUC: 0.52590
3.
KNeighborsClassifier from sklearn.neighbors
Nearest neighbors is also a non-parametric supervised learning method, and we chose
this method to create a model as it performs well in classification situations where the decision
boundary is very irregular.
Confusion Table and Metrics:
ROC/AUC Curve:
Accuracy: 0.83
AUC
: 0.53547
4.
XGBClassifier from xgboost
XGBoost is an implementation of gradient boosted decision trees that is designed for speed and
performance, and is a very popular and preferred algorithm choice when it comes to machine
learning. It makes use of regularization parameters that helps prevent overfitting. According to
towardsdatascience.com, since its introduction, this algorithm has not only been credited with
winning numerous Kaggle competitions but also for being the driving force under the hood for
several cutting-edge industry applications.
Confusion Table and Metrics:
ROC/AUC Curve:
Accuracy: 0.85
AUC: 0.66676
Overall, we can see that our xGBoost model performs the best, with not only the best
Accuracy value, but also the best AUC value. Hence we decide to continue using this model,
and experiment with different parameter values to help improve and fine tune our model.
Tuning the Parameters for our xGBoost Model
1.
Scale_pos_weight
Adjusting the scale_pos_weight
will help us balance the dataset better as our dataset is
very unevenly balanced when it comes to the number of “Fully Paid” and “Charged Off”
loans.The parameter scale_pos_weight
represents the ratio of number of negative class to the
positive class. To get the ratio for scale_post_weight we count the number of negative classes
(“Charged Off” Loans) to the number of positive classes (“Fully Paid” Loans),
and then take the total number of negative classes divided by the number of positive classes to
get the weight using this formula: scale_pos_weight = count(negative examples)/count(Positive examples)
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
which gives us a value of 0.1709
Since our dataset is extremely unbalanced, we also try a more conservative number by
using the square root of the weight using sqrt from the math package which gives us a value of
0.4134
Outputted results:
Looking at the results, it seems that using
the default value of scale_pos_weight =1
actually
gives us a better accuracy, which seems to
decrease the more balanced the dataset is.
However, the AUC value actually increases the
more balanced the dataset is.
The recall of the minority class is very low at 0.03 for the model with
scale_pos
weight = 1
. This proves that it is more biased towards the majority class, which is “Fully Paid”
loans. So, this proves that this is not the best model. We decide to use the square root of the
weight value (0.4134) instead for scale_pos_weight as we feel this gives us a better balance
with a higher overall accuracy and AUC value.
2.
Max_depth
For the max_depth
parameter, we experiment with the values 3, 5, 8, and 10, where 6 is
the default value. The reason we decide to experiment with this parameter is to try to decrease
the chance of our model being overfitted, and to see if reducing the depth of the tree will help
improve our model’s accuracy and AUC.
Outputted results:
As you can see, both the AUC seems to decrease as the max_depth
value increases.
This is likely due to the model becoming overfitted on the training data, leading to the decreased
performance when predicted against the testing data. However, the accuracy decreases from 3
to 5, but then starts to increase from 5 to 10. Although the model with max_depth = 10
has a
slightly better accuracy, we feel it is too small of an increase compared to the larger decrease in
AUC from 0.68985 to 0.64964. Therefore, the optimal max_depth
value is 3.
3.
Learning_rate
One problem with gradient boosted decision trees is that they are often quick to learn
and overfit the training data. The way to solve this issue is by adjusting the learning_rate
parameter. We experiment with the values 0.3, 0.2, and 0.1, as these are commonly used
values, with 0.3 being the default value.
Outputted results:
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
The Accuracy of the model seems to
increase as learning_rate decreases. The model
using learning_rate = 0.1
gives us the best
accuracy of 0.834188, however, the best AUC
value is achieved using learning_rate = 0.2
.
However, the increase in AUC value is only very
marginal from 0.69085 to 0.69105, so we decide to
choose learning_rate = 0.1
as the most optimal
parameter.
10. Conclusion
Being able to predict whether or not loans will either be Charged Off or Fully Paid is
paramount for all financial institutions and financial service providers in order to maximize their
potential return on the loans they provide, minimize losses on Charged Off loans, make the
application and underwriting process more efficient, and improve the user experience when
applying for a loan. Newer machine learning algorithms, such as the regularizing gradient
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
boosting framework using the xGBoost library, have shown to perform almost significantly better
compared to many other machine learning algorithms.
Original Model:
Final Optimized Model:
When comparing our optimal model to the original model, although our accuracy has
decreased slightly, the AUC of our model has improved, and the recall for our negative class
has increased, which shows that our improved model is less biased towards the majority, which
is the positive class. Since our goal is to create a model that is optimized to predict both “Fully
Paid” and “Charged Off” loans correctly, we feel that our new model is better tuned to predict
outcomes of future loans than the original model. Looking back at the project, some improvements we could have made would be to try
using SMOTE (Synthetic minority oversampling technique) for oversampling and
undersampling, rather than using scale_pos_weight,
as oversampling using scale_pos_weight
can lead to overfitting in the training data as scale_pos_weight simply duplicates rows, and
undersampling by removing rows may always lead to loss of information for the “Charged Off”
minority class in our dataset. With new modern techniques like SMOTE, new minority class
rows will be created using the K-Nearest Neighbours algorithm, rather than being duplicated,
which might lead to a better model, especially considering that datasets like ours will usually be
highly imbalanced, due to the nature of the data (as defaulting on a loan does not happen very
often).
Another area that we could improve on is experimenting with different feature selection
techniques. In this project we removed any variables that we thought were either irrelevant or
would cause data leakage, and only kept the variables that we thought would be the best for our
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
models, and from some previous experience with experimenting with this dataset in R. While we
did plot a heatmap of the correlations between the variables to ensure they were not too highly
correlated, we could also try using the f_regression() function form the scikit-learn machine
library in our feature selection to select the top k most relevant features using the SelectKBest
class.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help