HW3_Chong
docx
keyboard_arrow_up
School
Trine University *
*We aren’t endorsed by this school
Course
5213
Subject
Industrial Engineering
Date
Dec 6, 2023
Type
docx
Pages
22
Uploaded by Hank0908
Week 3: HW3 for R Programming
Chong Ren
Trine University
Course: IS 5213.7D2
Instructor: Fred Lumpkin
Due Date: Nov 9 2023
Step 1: Read in the Data
Read the data into R
List the structure of the data (str)
Execute a summary of the data
Print the first six records
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
> PATH = "/Users/chongren/Downloads/HMEQ_Scrubbed"
> FILE_NAME = "HMEQ_Scrubbed.csv"
> OUT_NAME = "HMEQ_Scrubbed_HW3.csv"
> INFILE = paste(PATH, FILE_NAME, sep="/")
> OUTFILE = paste(PATH, FILE_NAME, sep="/")
> #df = read.csv(INFILE)
> df = read.csv( INFILE )
> head(df)
> str(df)
> summary(df)
Step 2: Classification Decision Tree
Use the rpart library to predict the variable TARGET_BAD_FLAG
Develop two decision trees, one using Gini and the other using Entropy
All other parameters such as tree depth are up to you.
Plot both decision trees
List the important variables for both trees
Create a ROC curve for both trees
Write a brief summary of the decision trees discussing whether or not they
make sense. Which tree would you recommend using? What type of
person will default on a loan?
> library(rpart)
> #Build decision trees
> # Adjust tree depth (cp parameter) and other hyperparameters as needed
> gini_tree <- rpart(TARGET_BAD_FLAG ~ ., data = df, method = "class", parms = list(split =
"gini"), cp = 0.01)
> entropy_tree <- rpart(TARGET_BAD_FLAG ~ ., data = df, method = "class", parms = list(split =
"information"), cp = 0.01)
> library(rpart.plot)
> rpart.plot(gini_tree, main = "Chong Gini Tree")
> rpart.plot(entropy_tree, main = "Chong Gini Tree")
> rpart.plot(entropy_tree, main = "Chong Entropy Tree")
> # List improtance variables for both trees
> var_import_gini <- gini_tree$variable.importance
> var_import_entropy <- entropy_tree$variable.importance
> var_import_gini
TARGET_LOSS_AMT
M_DEBTINC
IMP_DELINQ
M_VALUE
IMP_DEBTINC
IMP_DEROG
1903.5970
488.3070
164.9037
156.8987
121.6765
108.8685
> var_import_entropy
TARGET_LOSS_AMT
M_DEBTINC
IMP_DELINQ
M_VALUE
IMP_DEBTINC
IMP_DEROG
1903.5970
488.3070
164.9037
156.8987
121.6765
108.8685
> #Create ROC curve and AUC for both trees
> pred_gini <- predict(gini_tree, df, type="prob")[, 2]
> pred_entropy <- predict(entropy_tree, df, type="prob")[, 2]
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
> observed <- df$TARGET_BAD_FLAG
Decision Tree for Predicting Loan Default (TARGET_BAD_FLAG):
In this context, a decision tree is employed to determine whether a loan applicant will default
(1/0). The significance of variables within the tree structure indicates their effectiveness as
predictors of default. To construct a robust model, the "anova" method is a preferred choice.
Insights derived from this tree can help identify pivotal variables that influence loan default.
These insights empower lenders to make well-informed decisions when it comes to approving
loans.
Decision Tree for Estimating Loss Given Default (TARGET_LOSS_AMT):
This decision tree is specifically tailored for cases where TARGET_BAD_FLAG equals 1, signifying
a default occurrence. Its primary goal is to forecast the potential loss amount
(TARGET_LOSS_AMT) in such instances. Variables with high importance in this tree unveil the
factors that significantly impact the magnitude of losses in default scenarios. This tree aids
lenders in assessing the severity of potential financial losses and provides valuable guidance for
implementing risk mitigation strategies.
Utilizing the TARGET_BAD_FLAG Decision Tree:
The TARGET_BAD_FLAG decision tree can be employed to gain deeper insights into individuals
who are more likely to default on a loan. It can help identify common traits or characteristics
associated with default, such as low credit scores, high outstanding debt, or unstable income,
among others.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Step 3: Regression Decision Tree
Use the rpart library to predict the variable TARGET_LOSS_AMT
Develop two decision trees, one using anova and the other using poisson
All other parameters such as tree depth are up to you.
Plot both decision trees
List the important variables for both trees
Calculate the Root Mean Square Error (RMSE) for both trees
Write a brief summary of the decision trees discussing whether or not they make sense.
Which tree would you recommend using? What factors dictate a large loss of money?
> plot(poisson_tree)
> text(poisson_tree, use.n = TRUE, all = TRUE)
> plot(anova_tree)
> text(anova_tree, use.n = TRUE, all = TRUE)
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Decision Tree for Predicting Loan Default (TARGET_BAD_FLAG):
In this context, a decision tree is employed to predict whether a loan applicant will default (1/0).
It identifies the significant factors that influence the probability of default. For this purpose, the
"anova" method tree is recommended, as it is highly effective in capturing relationships
between features and default probability.
Decision Tree for Estimating Loss Given Default (TARGET_LOSS_AMT):
Conversely, the decision tree for TARGET_LOSS_AMT focuses on cases where
TARGET_BAD_FLAG is 1, indicating that applicants have defaulted. Its goal is to predict the
potential loss amount (TARGET_LOSS_AMT) following a default. Variables with higher
importance in this tree serve as indicators of factors that significantly affect the magnitude of
losses when defaults occur.
If our primary objective is to accurately predict loan defaults, the model for TARGET_BAD_FLAG
is more pertinent. However, if our focus is on assessing the potential financial impact of
defaults, the model for TARGET_LOSS_AMT is the preferred choice.
Factors Influencing Significant Financial Losses:
The "anova" decision tree for TARGET_LOSS_AMT serves as a valuable tool for gaining insights
into the factors contributing to substantial financial losses in the event of loan defaults.
Variables with higher importance in this tree, such as loan amount or loan type, can reveal
which factors contribute most significantly to large losses. For example, a substantial loan
amount may result in substantial losses when a borrower defaults. By understanding these
influential factors, lenders can implement effective risk mitigation strategies.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Step 4: Probability / Severity Model Decision Tree (Optional Bonus Points)
Use the rpart library to predict the variable TARGET_BAD_FLAG
Use the rpart library to predict the variable TARGET_LOSS_AMT using only records
where TARGET_BAD_FLAG is 1.
Plot both decision trees
List the important variables for both trees
Using your models, predict the probability of default and the loss given default.
Multiply the two values together for each record.
Calculate the RMSE value for the Probability / Severity model.
Comment on how this model compares to using the model from Step 3. Which one would
your recommend using?
> bad_flag_tree <- rpart(TARGET_BAD_FLAG ~ ., data = df, method = "class", control =
rpart.control(cp=0.01))
> bad_flag_tree
> plot(bad_flag_tree)
> text(bad_flag_tree, use.n = TRUE, all = TRUE)
> loss_amt_tree = rpart( data = df, TARGET_BAD_FLAG ~ ., control= rpart.control(cp=0.01),
method="poisson" )
> tree_prob_loss <- rpart(TARGET_LOSS_AMT ~ ., data = df[df$TARGET_BAD_FLAG == 1,],
method = "anova")
> plot(tree_prob_loss)
> text(tree_prob_loss, use.n = TRUE, all = TRUE)
> prob_default <- predict(bad_flag_tree, df, type = "prob")[, 2]
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
>
> severity <- predict(bad_flag_tree, data_bad_flag_1)
>
> rmse <- sqrt(mean(((prob_default * severity) - df$TARGET_LOSS_AMT)^2))
Model for TARGET_BAD_FLAG (Default or No Default):
This model is utilized to forecast whether an applicant will default on a loan. It categorizes
applicants into two groups: those likely to default and those unlikely to default. This
categorization is crucial for early risk assessment. If our primary concern is identifying
individuals at risk of loan default and making lending decisions based on default likelihood, this
model should be our preferred choice.
Model for TARGET_LOSS_AMT (Loss Given Default):
The TARGET_LOSS_AMT model is designed to predict the potential financial loss in the event of
an applicant's default. Its primary objective is to evaluate the financial impact of defaults,
providing valuable insights into the extent of potential losses and the contributing factors. If our
main focus is on assessing the financial consequences of loan defaults, understanding the
factors influencing the magnitude of losses, optimizing risk mitigation strategies, and allocating
resources for managing and covering potential losses, this model is the ideal choice.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help