Predict Loan Default Using Decision Trees in R Programming

Week 3: HW3 for R Programming Chong Ren Trine University Course: IS 5213.7D2 Instructor: Fred Lumpkin Due Date: Nov 9 2023

Step 1: Read in the Data  Read the data into R  List the structure of the data (str)  Execute a summary of the data  Print the first six records

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version

Access to all documents
Unlimited textbook solutions
24/7 expert homework help

> PATH = "/Users/chongren/Downloads/HMEQ_Scrubbed" > FILE_NAME = "HMEQ_Scrubbed.csv" > OUT_NAME = "HMEQ_Scrubbed_HW3.csv" > INFILE = paste(PATH, FILE_NAME, sep="/") > OUTFILE = paste(PATH, FILE_NAME, sep="/") > #df = read.csv(INFILE) > df = read.csv( INFILE ) > head(df) > str(df) > summary(df)

Step 2: Classification Decision Tree  Use the rpart library to predict the variable TARGET_BAD_FLAG  Develop two decision trees, one using Gini and the other using Entropy  All other parameters such as tree depth are up to you.  Plot both decision trees  List the important variables for both trees  Create a ROC curve for both trees  Write a brief summary of the decision trees discussing whether or not they make sense. Which tree would you recommend using? What type of person will default on a loan? > library(rpart) > #Build decision trees > # Adjust tree depth (cp parameter) and other hyperparameters as needed > gini_tree <- rpart(TARGET_BAD_FLAG ~ ., data = df, method = "class", parms = list(split = "gini"), cp = 0.01) > entropy_tree <- rpart(TARGET_BAD_FLAG ~ ., data = df, method = "class", parms = list(split = "information"), cp = 0.01) > library(rpart.plot) > rpart.plot(gini_tree, main = "Chong Gini Tree") > rpart.plot(entropy_tree, main = "Chong Gini Tree")

> rpart.plot(entropy_tree, main = "Chong Entropy Tree") > # List improtance variables for both trees > var_import_gini <- gini_tree$variable.importance > var_import_entropy <- entropy_tree$variable.importance > var_import_gini TARGET_LOSS_AMT M_DEBTINC IMP_DELINQ M_VALUE IMP_DEBTINC IMP_DEROG 1903.5970 488.3070 164.9037 156.8987 121.6765 108.8685 > var_import_entropy TARGET_LOSS_AMT M_DEBTINC IMP_DELINQ M_VALUE IMP_DEBTINC IMP_DEROG 1903.5970 488.3070 164.9037 156.8987 121.6765 108.8685 > #Create ROC curve and AUC for both trees > pred_gini <- predict(gini_tree, df, type="prob")[, 2] > pred_entropy <- predict(entropy_tree, df, type="prob")[, 2]

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version

Access to all documents
Unlimited textbook solutions
24/7 expert homework help

> observed <- df$TARGET_BAD_FLAG

Decision Tree for Predicting Loan Default (TARGET_BAD_FLAG): In this context, a decision tree is employed to determine whether a loan applicant will default (1/0). The significance of variables within the tree structure indicates their effectiveness as predictors of default. To construct a robust model, the "anova" method is a preferred choice. Insights derived from this tree can help identify pivotal variables that influence loan default. These insights empower lenders to make well-informed decisions when it comes to approving loans. Decision Tree for Estimating Loss Given Default (TARGET_LOSS_AMT): This decision tree is specifically tailored for cases where TARGET_BAD_FLAG equals 1, signifying a default occurrence. Its primary goal is to forecast the potential loss amount (TARGET_LOSS_AMT) in such instances. Variables with high importance in this tree unveil the factors that significantly impact the magnitude of losses in default scenarios. This tree aids lenders in assessing the severity of potential financial losses and provides valuable guidance for implementing risk mitigation strategies. Utilizing the TARGET_BAD_FLAG Decision Tree: The TARGET_BAD_FLAG decision tree can be employed to gain deeper insights into individuals who are more likely to default on a loan. It can help identify common traits or characteristics associated with default, such as low credit scores, high outstanding debt, or unstable income, among others.

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version

Access to all documents
Unlimited textbook solutions
24/7 expert homework help

Step 3: Regression Decision Tree  Use the rpart library to predict the variable TARGET_LOSS_AMT  Develop two decision trees, one using anova and the other using poisson  All other parameters such as tree depth are up to you.  Plot both decision trees  List the important variables for both trees  Calculate the Root Mean Square Error (RMSE) for both trees  Write a brief summary of the decision trees discussing whether or not they make sense. Which tree would you recommend using? What factors dictate a large loss of money?

> plot(poisson_tree) > text(poisson_tree, use.n = TRUE, all = TRUE)

> plot(anova_tree) > text(anova_tree, use.n = TRUE, all = TRUE)

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version

Access to all documents
Unlimited textbook solutions
24/7 expert homework help

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version

Access to all documents
Unlimited textbook solutions
24/7 expert homework help

Decision Tree for Predicting Loan Default (TARGET_BAD_FLAG): In this context, a decision tree is employed to predict whether a loan applicant will default (1/0). It identifies the significant factors that influence the probability of default. For this purpose, the "anova" method tree is recommended, as it is highly effective in capturing relationships between features and default probability. Decision Tree for Estimating Loss Given Default (TARGET_LOSS_AMT): Conversely, the decision tree for TARGET_LOSS_AMT focuses on cases where TARGET_BAD_FLAG is 1, indicating that applicants have defaulted. Its goal is to predict the potential loss amount (TARGET_LOSS_AMT) following a default. Variables with higher importance in this tree serve as indicators of factors that significantly affect the magnitude of losses when defaults occur. If our primary objective is to accurately predict loan defaults, the model for TARGET_BAD_FLAG is more pertinent. However, if our focus is on assessing the potential financial impact of defaults, the model for TARGET_LOSS_AMT is the preferred choice. Factors Influencing Significant Financial Losses: The "anova" decision tree for TARGET_LOSS_AMT serves as a valuable tool for gaining insights into the factors contributing to substantial financial losses in the event of loan defaults. Variables with higher importance in this tree, such as loan amount or loan type, can reveal which factors contribute most significantly to large losses. For example, a substantial loan amount may result in substantial losses when a borrower defaults. By understanding these influential factors, lenders can implement effective risk mitigation strategies.

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version

Access to all documents
Unlimited textbook solutions
24/7 expert homework help

Step 4: Probability / Severity Model Decision Tree (Optional Bonus Points)  Use the rpart library to predict the variable TARGET_BAD_FLAG  Use the rpart library to predict the variable TARGET_LOSS_AMT using only records where TARGET_BAD_FLAG is 1.  Plot both decision trees  List the important variables for both trees  Using your models, predict the probability of default and the loss given default.  Multiply the two values together for each record.  Calculate the RMSE value for the Probability / Severity model.  Comment on how this model compares to using the model from Step 3. Which one would your recommend using?

> bad_flag_tree <- rpart(TARGET_BAD_FLAG ~ ., data = df, method = "class", control = rpart.control(cp=0.01)) > bad_flag_tree > plot(bad_flag_tree) > text(bad_flag_tree, use.n = TRUE, all = TRUE)

> loss_amt_tree = rpart( data = df, TARGET_BAD_FLAG ~ ., control= rpart.control(cp=0.01), method="poisson" ) > tree_prob_loss <- rpart(TARGET_LOSS_AMT ~ ., data = df[df$TARGET_BAD_FLAG == 1,], method = "anova") > plot(tree_prob_loss) > text(tree_prob_loss, use.n = TRUE, all = TRUE) > prob_default <- predict(bad_flag_tree, df, type = "prob")[, 2]

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version

Access to all documents
Unlimited textbook solutions
24/7 expert homework help

> > severity <- predict(bad_flag_tree, data_bad_flag_1) > > rmse <- sqrt(mean(((prob_default * severity) - df$TARGET_LOSS_AMT)^2)) Model for TARGET_BAD_FLAG (Default or No Default): This model is utilized to forecast whether an applicant will default on a loan. It categorizes applicants into two groups: those likely to default and those unlikely to default. This categorization is crucial for early risk assessment. If our primary concern is identifying individuals at risk of loan default and making lending decisions based on default likelihood, this model should be our preferred choice. Model for TARGET_LOSS_AMT (Loss Given Default): The TARGET_LOSS_AMT model is designed to predict the potential financial loss in the event of an applicant's default. Its primary objective is to evaluate the financial impact of defaults, providing valuable insights into the extent of potential losses and the contributing factors. If our main focus is on assessing the financial consequences of loan defaults, understanding the factors influencing the magnitude of losses, optimizing risk mitigation strategies, and allocating resources for managing and covering potential losses, this model is the ideal choice.

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version

Access to all documents
Unlimited textbook solutions
24/7 expert homework help

HW3_Chong

Related Documents