MAT 303 Module Five Problem Set Report Template
docx
keyboard_arrow_up
School
Southern New Hampshire University *
*We aren’t endorsed by this school
Course
303
Subject
Mathematics
Date
Feb 20, 2024
Type
docx
Pages
7
Uploaded by GeneralOxide15155
MAT 303 Module Five Problem Set Report
Logistic Regression
Brian Tynan
Brian.Tynan@snhu.edu
Southern New Hampshire University
1. Introduction I have been requested as a risk analyst to identify factors that are most predictive of customer credit card default. My analysis consists of and uses a credit card data set as well as statistical techniques such as logistic regression, Wald confidence intervals for slope parameters, Hosmer-
Lemeshow goodness of fit (GOF) test, and receiver operating characteristic curve (ROC curve).
2. Data Preparation The important variables in this data set consist of the default as the response variables and credit utilization, education, assets, and missed payments are the predictor variables. There are 8 columns and 600 rows in this data set.
3. First Logistic Regression Model
Reporting Results The general form of this regression model is as follows:
E(y)=
y is for defaulting on credit and 0 is not for defaulting on credit, x1 is credit utilization, and x2 and x3 are dummy variables used for education
The natural logarithmic format is rewritten as follows:
ln()=β0 + β1x1 + β2x2 + β3x3
π represents the individual defaulting, whereas π
1
−
π
represents individuals not defaulting.
The prediction model is as follows:
In( odds)=-8.8488 + 34.3869x
1
– 1.4975x
2 – 4.2540x
3
The estimated coefficient for the variable credit utilization is 34.3869. This means that on an average the change in log odds for defaulting on payment is 0.343869 and with each percentage increase in credit utilization, leaving all of the other variables constant. The estimated coefficient for a college education (college) is -1.4975 which indicates an average change in the log odds for defaulting is decreased by 0.014975 for each percentage that is increased for college education. The estimated coefficient for education (post graduate) is -4.2540 and this indicates that on an average the average change in the log odds for defaulting decreases by 0.0425 for each percentage increase in post graduate education.
Here are the counts for the true positives, true negatives, false positives, and false negatives from the confusion matrix are as follows:
True Positive (TP): The actual values is shown as 1 (default=1) and the predicted value is shown as 1 (default=1) making this a true positive.
True Negative (TN): The actual value is shown as 0 (default=0) and the predicted value is shown as 0 (default=0) making this a true negative.
False Positive (FP): The actual value is shown as 0 (default=0) and the predicted value is shown as 1 (default=1) making this a false positive. This is also known as a Type 1 Error.
2
False Negative (FN): The actual value is shown as 1 (default=1) and the predicted value is shown
as 0 (default=0) making this a false negative. This is also know as Type 2 Error.
Based on the confusion matrix, the Accuracy value is shown as 0.9279 which indicates a 92.79% accuracy level, a Precision value is 0.9352 indicates a 93.52% precision level and a recall value of 0.9323 which indicates a 93.23% recall level.
Evaluating Model Significance The Hosmer-Lemeshow goodness of fit test.
Null hypothesis: The model fits this data set well.
Alternative hypothesis: The modes does not fit this data set.
Test statistic, X-squared is 31.582, degrees of freedom is shown as 48 while the P-values is 0.9676 and the significance level is a = 0.05, the decision is a fail to reject the null hypothesis and with this we can conclude that there is not enough sufficient evidence to conclude that the model for this data set does not fit the data well. All of the variables within the model are statistically significant due to their p-value is less than the level of significance which is (0.05).
As you can see the curve is closer to the top-left corner, which indicates the model is good for distinguishing the difference between the positive and negative instances.
3
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
The value of AUC is shown as 0.9859. An AUC of 0.9859 indicates that the model can accurately classify 98.59% of the cases. This is a very high accuracy rate that suggests that the model Is likely to be of great use in practice.
Making Predictions Using Model The data shows that the probability of an individual defaulting on credit that has a credit utilization of 35% and that has a high school education is 0.9603. This data shows that there is 96.03% chance that this same individual will default on their credit. The data also shows that the probability of an individual defaulting on their credit when having a credit utilization of 35% and that has a post graduate education is 0.2559. This data shows that there isa 25.59% chance that this individual will default on their credit. 4. Second Logistic Regression Model
Reporting Results The general for this regression model is as follows:
E(y)=
y stand for 1 for defaulting on credit and the 0 is for not defaulting on credit, x1 is for credit utilization, x2, x3, and x4 are the dummy variables for the asset, and x5 is for the missed payments.
The logarithmic format can be rewritten as follows:
ln()=β0 + β1x1 + β2x2 + β3x3+ β4x4+ β5x5
The predictive model is as follows:
ln(odds )=−9.2371 + 32.2826x1 – 0.4827x2 – 3.0334x3-3.4568x4+1.4276x5
4
Confusion Matrix"
The confusion matrix shows that the Accuracy value is 0.9418 and this shows a 94.18% accuracy level, the precision value is 0.9352 and this shows a 93.52% precision level and a recall value of 0.9558 showing a 95.58% recall level.
Evaluating Model Significance Null hypothesis: This model is a good fit for this data set.
Alternative hypothesis: This model is not a goof fit for this data set.
Test statistic: X-squared = 26.733, the degrees of freedom is shown as df = 48 with a P-value of 0.9945. Due to the p-value is greater than the significance level of 0.05, with this information we fail to reject the null hypothesis. This information allows us to determine that the logistic regression model is good fir for this data set.
Asset 2 (car and housing) show all the variables are statistically significant due to their p-value is less than the level of significance. The curve is closer to the top left corner which shows that this model is better at identifying the positive and negative instances as compared to that of model 1. 5
An AUC value of 0.9874 means that this model is able to accurately classify 98.74% of the cases. This is This is an extremely high accuracy rate and this data shows that this model is likely to be very beneficial to the practice.
Making Predictions Using Model The data shows that the probability of an individual with a credit utilization of 35%, that owns a car, and has missed payments within the last 3 months the probability of defaulting on their credit is 0.9529. This shows or means that there is a 95.29% chance that this said individual will default on their credit. The data also shows that the probability of an individual with a credit utilization of 35%, owns a car and house, and that has not missed a payment within the last three months the probability of defaulting on their credit is 0.2746. This shows or means that there is a 27.46% chance that this said individual will default on their credit.
5. Conclusion I strongly recommend this model as it shows itself to be extremely accurate as it accurately classified 98.74% of the cases. This data and model could be extremely beneficial. The analysis shows three key predictors that are linked with credit card default, and they are as the credit utilization increases so does the change of default. The data also shows that individuals with lower-level education
have a higher chance of defaulting and those with fewer assets are at even more at risk. Those who have missed payments in the past are even more prone of going into default.
The analysis conducted with this data set is extremely beneficial and useful to assisting the credit card companies in identifying those that are at risk of defaulting on their debt and taking necessary steps to helping prevent this by offering financial counseling or other support. This analysis 6
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
can also assist with improving the accuracy of credit scoring models. The models from this data set assessed individuals credit worthiness to determine whether they should qualify for a loan or credit card. The factors identified during this analysis of these models can assist card companies with making informed decisions on who they should offer their services or credit to. 7