IDS 400 - Group Project Report

docx

School

University of Illinois, Chicago *

*We aren’t endorsed by this school

Course

400

Subject

Industrial Engineering

Date

Feb 20, 2024

Type

docx

Pages

24

Uploaded by JusticeMagpieMaster175

Report
LOAN REPAYMENT PREDICTION USING MACHINE LEARNING IDS 400 - Programming for Data Science University of Illinois at Chicago Spring 2021 Group Project Report Date: May 6, 2021
Table of Contents 1. Introduction 3 2. Project Description and Objectives 3 3. Business Value 4 4. Project Analysis 4 5. Packages and Functions Used: 5 6. Data Preparation 6 7. Exploratory Data Analysis 9 8. Data Transformation & Processing for Machine Learning 13 9. Machine Learning Models 14 Chosen Models: 14 LogisticRegression from sklearn.linear_model 15 DecisionTreeClassifier from sklearn.tree 15 KNeighborsClassifier from sklearn.neighbors 16 XGBClassifier from xgboost 16 Tuning the Parameters for our xGBoost Model 17 Scale_pos_weight 17 Max_depth 19 Learning_rate 20 10. Conclusion 21
1. Introduction According to marutitech.com, until recently, only the hedge funds were the primary users of Artificial Intelligence and Machine Learning in Finance, but the last few years have seen the applications of Machine Learning spreading to various other areas, including banks, fintech, regulators, and insurance firms, to name a few. Loans are the core business of many financial institutions and financial service providers. In the past, companies would have to rely on a limited amount of data, and a set of policies and processes in order to assess a customer’s financial position and their intention to repay before issuing a loan, which would be very time consuming. However, tremendous improvements in computational power and increased research and development of machine learning algorithms have helped make predictions quicker and more accurate, and it is not surprising that financial institutions and financial service providers are currently one of the top spenders in Big Data and Data Analytics. According to Soma Metrics, between 2014 and 2017, mortgage industry spending on big data increased from $2.6 billion to $3.2 billion. With the growing interest in using Machine Learning to predict the outcome and return from loans in the financial industry, we thought it would be interesting to see if we could come use machine learning algorithms to accurately predict whether a loan would default or not. 2. Project Description and Objectives The dataset we have decided to work with comes from Lending Club, which is a digital marketplace for peer-to-peer lending, connecting borrowers looking for loans and lenders interested in making an investment. They replace the high cost and complexity of bank lending with a faster, smarter way to borrow and invest, providing lower interest rates, better terms and amounts offered to borrowers by utilizing data, analytics, and technology-enabled models. The data in this dataset was initially scraped from the thousands of loans made through the Lending Club platform between 9/1/14 to 1/1/15. The raw dataset includes 81023 and 146 columns of information either pertaining to the loan or information on the financial history of the borrower which was captured at the time of the loan application.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
The main goal of the project is to use Python to help visualize the data, to get a better understanding of the dataset, and to create models using Machine Learning algorithms in order to predict whether a loan will be Fully Paid or Charged Off. 3. Business Value Financial service providers can use this to reduce risk by being able to predict whether a loan will be paid or defaulted on using machine learning. This will also enhance revenues, and lower operational costs, saving lenders both time and money by automating the end-to-end loan application process, from loan approval to loan monitoring, which will increase productivity, since a manual review will only be needed after the loan application passes through the initial screening process. Thus, this will allow financial institutions and financial service providers to process more loans quicker while also reducing their risk, allowing them to invest their money and time in other areas of their business. It will also help improve user experience when applying for loans, giving applicants a smoother and quicker loan application process, where applicants can either be denied or approved a loan in the matter of minutes, rather than having to go through a lengthy manual process 4. Project Analysis To complete this project, we did the following: 1. Prepared and cleaned the data by removing any unnecessary variables and dropping any variables that could potentially cause data leakage. 2. Performed Exploratory Data Analysis on the data. 3. Pre-processed the data again to prepare it for our Machine Learning Algorithms. 3. Built a model to help predict which loans would default and which would not.
5. Packages and Functions Used: In the project, we used a variety of packages in order to perform data manipulation, data visualization, and data modelling using machine learning. For Data Manipulation we used the following to help prepare the data for manipulation and the initial data exploration analysis: Pandas - to help manipulate the data, for e.g. creating dummy variable columns Numpy For Data Visualization to create the graphs and tables used to better understand the dataset better we used the following: Seaborn Matplotlib.pyplot To create models using Machine Learning and to evaluate the performance of our models we used the following packages: Sklearn - to use machine learning algorithms such as LogisticRegression, KNeighborsClassifier, DecisionTreeClassifier to create and train my models, train_test_split to create a training and test dataset, and also used metrics such as roc_curve, auc, confusion_matrix, classification_report, and accuracy_score to help determine which models performed the best. XGBoost – to create a model using Extreme Gradient Boosting Regression. matplotlib.pyplot - to help plot an ROC/AUC Curve to help compare and evaluate the models’ performance.
6. Data Preparation Firstly, to prepare data for analysis and modeling, we make sure that the “loan_status” column only contains the values “Fully Paid” or “Charged Off” since in this data analysis we are only looking at these two types of loans, which removes 1 row of data. We also decided to only keep selected columns/variables (as shown below) from the data set as there are 146 columns in the data set, and we want to minimize the time it takes to run our code/models. The variables we decided to keep are ones that are also used in assessing one’s credit score, and the ones we think are most likely to affect whether or not a loan will be defaulted on. We also removed any variables that may lead to data leakage, which is when information from outside the training dataset is used to create the model, which can allow the model to learn or know something that it otherwise would not know and in turn invalidate the estimated performance of the model being constructed, which may also lead to overfitting. Variable Description loan_amnt The listed amount of the loan applied for by the borrower. If at some point in time, the credit department reduces the loan amount, then it will be reflected in this value. int_rate Interest Rate on the loan grade LC assigned loan grade sub_grade LC assigned loan subgrade emp_length Employment length in years. Possible values are between 0 and 10 where 0 means less than one year and 10 means ten or more
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
years. home_ownership The home ownership status provided by the borrower during registration. Our values are: RENT, OWN, MORTGAGE, OTHER. annual_inc The self-reported annual income provided by the borrower during registration. verification_status The self-reported annual income provided by the borrower during registration. loan_status Current status of the loan purpose A category provided by the borrower for the loan request. dti A ratio calculated using the borrower’s total monthly debt payments on the total debt obligations, excluding mortgage and the requested LC loan, divided by the borrower’s self-reported monthly income. total_rev_hi_lim Total revolving high credit/credit limit acc_open_past_24mths Number of trades opened in past 24 months bc_util Ratio of total current balance to high credit/credit limit for all bankcard accounts. chargeoff_within_12_mths Number of charge-offs within 12 months mort_acc Number of mortgage accounts. mths_since_recent_inq Months since most recent inquiry. mths_since_last_delinq The number of months since the borrower's last delinquency. percent_bc_gt_75 Percentage of all bankcard accounts > 75% of limit. pub_rec_bankruptcies Number of public record bankruptcies tot_hi_cred_lim Total high credit/credit limit total_bc_limit Total bankcard high credit/credit limit
Once the unwanted columns were filtered out we checked the dataset for any missing values, as those values need to be replaced. We found that there were 5 columns with missing values, so for emp_length (employment length) we replaced missing/NA values with 0. For bc_util (bankcard utilization rate) and percent_bc_gt_75 (all bankcard accounts > 75% of limit), we replaced any columns with missing/NA values with their column medians so it would not skew the column's average values. For mths_since_recent_inq (months since recent inquiry) and mths_since_last_delinq (months since last delinquency) we replaced the missing/NA values with a very large number since it cannot be replaced with the median or a 0 as that would imply that these two things have either occurred or not. 7. Exploratory Data Analysis
1. What is The Proportion of Loan Defaults (‘Charged Off’ vs ‘Fully Paid’ Loans)? There are 11827 Charged Off Loans and 69195 Fully Paid Loans. Percentage wise, 85.4% of loans have been fully paid while 14.6% of loans have been charged off in this dataset. 2. How Many Loans Are There in Each Grade? There are 20402 loans in grade A, 23399 loans in grade B, 22577 loans in grade C, 10802 loans in grade D, 3191 loans in grade F and 91 loans in grade G. From the bar chart plotted above, we can see that in general, the higher the loan grade, the higher the number of loans issued, with the highest number of loans issued being in grade B and the lowest being in grade G. This is what we would expect as the better credit history a borrower has the better their loan grade, and generally, those that have better credit histories are more likely be more qualified to receive loans compared to those that don’t have a very good credit history and can only qualify for loans of lower grades. This doesn’t mean lower grades don’t receive loans at all but rather less often it seems.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
3. How does default rate vary with loan grade?
The default rate is the lowest with loans graded “A”. The higher the grade of the loan, the higher the rate of the loan being “Fully paid”. Conversely, the “Charged off” rates are the highest with loans graded “G”. As the grade of the loan decreases, so does the rate of it being “Fully Paid”. The proportion of charged-off loans are almost half of the grade G loans. This is what we would expect as the lower the loan grade, the riskier the loan is and are more likely to be defaulted on. 4. What Purpose Are People Borrowing Money For?
Most borrowers are taking out loans primarily for debt consolidation, followed by credit card, other purposes, home improvements, and major purchases. 5. Do Defaults Rates Vary Per Purpose? If we exclude the three least common purposes (house, renewable_energy, and wedding), small_business seems to be the purpose with the highest percentage of defaulted loans. This doesn't come as a surprise as many small businesses do not succeed and go bankrupt, which is probably why the default rates are higher than other purposes.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
6. Heatmap of Correlations Between All Variables We plot a heatmap of the correlation between all the variables to check whether there are any variables that are highly correlated to each other as too many highly correlated variables can result in multicollinearity issues. Multicollinearity makes it hard to interpret the coefficients, and it reduces the power of a model to identify independent variables that are statistically significant, and can also create redundant information, skewing the results in a regression model. However, since most of the variables seem to have a low correlation to each other, we leave the dataset as it is. 8. Data Transformation & Processing for Machine Learning Converting the Categorical Variables into Dummy Variables We start off by converting all the columns with categorical variables into dummy/indicator variables using the get_dummies function from the pandas library as many machine learning algorithms cannot process categorical variables.
As you can see above, this adds a column for each categorical variable and assigns 0s and 1s to indicate whether or not the factor in the categorical variable was used or not. We then move the columns “loan_status_Fully Paid” and “loan_status_Charged Off” to the front before we split the dataset to make it easier to select these columns when we have to create the training and test data sets. Splitting the Data into Training and Testing Sets We create an array with the values in each column and then select all the values in the columns excluding the “loan_status_Fully Paid” and “loan_status_Charged Off” columns and store them into the variable “X”. We select the values in the column “loan_status_Fully Paid” and store them into our variable “Y” as we will use a 1 value to represent “Fully Paid” cases and a 0 value to represent the “Charged Off” cases in our models.
We use the train_test_split function from sklearn to split the dataset into 70% training data and 30% testing/validation data as we feel this is a good split ratio to avoid our models from becoming overfitted or underfitted on the training data. 9. Machine Learning Models Chosen Models: - LogisticRegression from sklearn.linear_model - DecisionTreeClassifier from sklearn.tree - KNeighborsClassifier from sklearn.neighbors - XGBClassifier from xgboost Since our goal is to correctly predict both Fully Paid and Charged Off loans we use Accuracy and Area Under the Curve (AUC) as our main evaluation metrics to determine which model is the best. Accuracy can be computed using the following formula: We chose AUC as our second evaluation metric as AUC is used for classification with two classes. AUC measures the entire two-dimensional area underneath the entire ROC curve (receiver operating characteristic curve). An ROC Curve plots the True Positive Rate vs. False Positive Rate at different classification thresholds, and AUC provides an aggregate measure of performance across all possible classification thresholds. They represent the degree or measure of separability, and the AUC value will tell us how much our models are capable of distinguishing between classes, where the higher the AUC value, the better our model is at
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
classifying loans that will be Fully Paid and those that will be Charged Off. Hence, the best model will have a higher AUC (that is at least above a threshold of 0.5) and a higher Accuracy. 1. LogisticRegression from sklearn.linear_model We choose to use LogisticRegression as one of our models as it is a commonly a statistical method for predicting binary classes. Confusion Table and Metrics: ROC/AUC Curve: Accuracy: 0.85 AUC: 0.57932 2. DecisionTreeClassifier from sklearn.tree Decision Trees are a non-parametric supervised learning method used for classification and regression, where a tree structure is constructed that breaks the dataset down into smaller subsets, which can be used to predict the value of a target variable by learning simple decision rules inferred from the data features, where there are decision nodes that partition the data and leaf nodes that give the prediction that can be followed by traversing simple IF, AND, and THEN logic down the nodes in a tree-like structure. Confusion Table and Metrics: ROC/AUC Curve:
Accuracy: 0.75 AUC: 0.52590 3. KNeighborsClassifier from sklearn.neighbors Nearest neighbors is also a non-parametric supervised learning method, and we chose this method to create a model as it performs well in classification situations where the decision boundary is very irregular. Confusion Table and Metrics: ROC/AUC Curve: Accuracy: 0.83 AUC : 0.53547 4. XGBClassifier from xgboost XGBoost is an implementation of gradient boosted decision trees that is designed for speed and performance, and is a very popular and preferred algorithm choice when it comes to machine learning. It makes use of regularization parameters that helps prevent overfitting. According to towardsdatascience.com, since its introduction, this algorithm has not only been credited with
winning numerous Kaggle competitions but also for being the driving force under the hood for several cutting-edge industry applications. Confusion Table and Metrics: ROC/AUC Curve: Accuracy: 0.85 AUC: 0.66676 Overall, we can see that our xGBoost model performs the best, with not only the best Accuracy value, but also the best AUC value. Hence we decide to continue using this model, and experiment with different parameter values to help improve and fine tune our model. Tuning the Parameters for our xGBoost Model 1. Scale_pos_weight Adjusting the scale_pos_weight will help us balance the dataset better as our dataset is very unevenly balanced when it comes to the number of “Fully Paid” and “Charged Off” loans.The parameter scale_pos_weight represents the ratio of number of negative class to the positive class. To get the ratio for scale_post_weight we count the number of negative classes (“Charged Off” Loans) to the number of positive classes (“Fully Paid” Loans), and then take the total number of negative classes divided by the number of positive classes to get the weight using this formula: scale_pos_weight = count(negative examples)/count(Positive examples)
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
which gives us a value of 0.1709 Since our dataset is extremely unbalanced, we also try a more conservative number by using the square root of the weight using sqrt from the math package which gives us a value of 0.4134 Outputted results: Looking at the results, it seems that using the default value of scale_pos_weight =1 actually gives us a better accuracy, which seems to decrease the more balanced the dataset is. However, the AUC value actually increases the more balanced the dataset is.
The recall of the minority class is very low at 0.03 for the model with scale_pos weight = 1 . This proves that it is more biased towards the majority class, which is “Fully Paid” loans. So, this proves that this is not the best model. We decide to use the square root of the weight value (0.4134) instead for scale_pos_weight as we feel this gives us a better balance with a higher overall accuracy and AUC value. 2. Max_depth For the max_depth parameter, we experiment with the values 3, 5, 8, and 10, where 6 is the default value. The reason we decide to experiment with this parameter is to try to decrease the chance of our model being overfitted, and to see if reducing the depth of the tree will help improve our model’s accuracy and AUC. Outputted results:
As you can see, both the AUC seems to decrease as the max_depth value increases. This is likely due to the model becoming overfitted on the training data, leading to the decreased performance when predicted against the testing data. However, the accuracy decreases from 3 to 5, but then starts to increase from 5 to 10. Although the model with max_depth = 10 has a slightly better accuracy, we feel it is too small of an increase compared to the larger decrease in AUC from 0.68985 to 0.64964. Therefore, the optimal max_depth value is 3. 3. Learning_rate One problem with gradient boosted decision trees is that they are often quick to learn and overfit the training data. The way to solve this issue is by adjusting the learning_rate parameter. We experiment with the values 0.3, 0.2, and 0.1, as these are commonly used values, with 0.3 being the default value. Outputted results:
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
The Accuracy of the model seems to increase as learning_rate decreases. The model using learning_rate = 0.1 gives us the best accuracy of 0.834188, however, the best AUC value is achieved using learning_rate = 0.2 . However, the increase in AUC value is only very marginal from 0.69085 to 0.69105, so we decide to choose learning_rate = 0.1 as the most optimal parameter. 10. Conclusion Being able to predict whether or not loans will either be Charged Off or Fully Paid is paramount for all financial institutions and financial service providers in order to maximize their potential return on the loans they provide, minimize losses on Charged Off loans, make the application and underwriting process more efficient, and improve the user experience when applying for a loan. Newer machine learning algorithms, such as the regularizing gradient
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
boosting framework using the xGBoost library, have shown to perform almost significantly better compared to many other machine learning algorithms. Original Model: Final Optimized Model: When comparing our optimal model to the original model, although our accuracy has decreased slightly, the AUC of our model has improved, and the recall for our negative class has increased, which shows that our improved model is less biased towards the majority, which is the positive class. Since our goal is to create a model that is optimized to predict both “Fully Paid” and “Charged Off” loans correctly, we feel that our new model is better tuned to predict outcomes of future loans than the original model. Looking back at the project, some improvements we could have made would be to try using SMOTE (Synthetic minority oversampling technique) for oversampling and undersampling, rather than using scale_pos_weight, as oversampling using scale_pos_weight can lead to overfitting in the training data as scale_pos_weight simply duplicates rows, and undersampling by removing rows may always lead to loss of information for the “Charged Off” minority class in our dataset. With new modern techniques like SMOTE, new minority class rows will be created using the K-Nearest Neighbours algorithm, rather than being duplicated, which might lead to a better model, especially considering that datasets like ours will usually be highly imbalanced, due to the nature of the data (as defaulting on a loan does not happen very often). Another area that we could improve on is experimenting with different feature selection techniques. In this project we removed any variables that we thought were either irrelevant or would cause data leakage, and only kept the variables that we thought would be the best for our
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
models, and from some previous experience with experimenting with this dataset in R. While we did plot a heatmap of the correlations between the variables to ensure they were not too highly correlated, we could also try using the f_regression() function form the scikit-learn machine library in our feature selection to select the top k most relevant features using the SelectKBest class.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help