Assignment 5

pdf

School

University of Michigan, Dearborn *

*We aren’t endorsed by this school

Course

DS 633

Subject

Management

Date

Feb 20, 2024

Type

pdf

Pages

9

Uploaded by monikagautam93

Report
Assignment 5 Problem 1 Personal Loan Acceptance (revisited). The file UniversalBank.jmp contains data on 5000 customers of Universal Bank. The data include customer demographic information (age, income, etc.), the customer’s relationship with the bank (mortgage, securities account, etc.), and the customer response to the last personal loan campaign (Personal Loan). Among these 5000 customers, only 480 (= 9.6%) accepted the personal loan that was offered to them in the earlier campaign. Note that the textbook’s data file contains a Validation column. Keep that because we will use it below. For this problem, and the next one, let’s use a different Validation column: Add a Validation column (Validation 2) with a 70/30 Training/Validation split, using random seed 123. Next, fit a logistic regression model using all of the available predictors except ID and ZIP Code. Set the target to be Yes (either through the Target Level Column Property or on the fly in the Fit Model window). Actual Personal Loan Predicted Count Yes No Yes 85 49 No 16 1350 Actual Personal Loan Predicted Yes No Yes True Positive (TP) False Negative (FN) No False Positive (FP) True Negative (TN) a. Report the Misclassification Rate (under Fit Details) and the Confusion Matrix, for Validation. The Confusion Matrix should be: ?𝑖??𝑙???𝑖?𝑖???𝑖?? ???? = 49 + 16 85 + 49 + 16 + 1350 = 0.0433 ?? 4.33% b. For this Validation performance, calculate sensitivity. ????𝑖?𝑖?𝑖?𝑦 = ?? ?? + 𝐹? = 85 85 + 49 = 0.6343 ?? 63.43% c. Calculate specificity. ?pecificity = ?? 𝐹? + ?? = 1350 16 + 1350 = 0.9883 ?? 98.83%
d. Calculate precision. ????𝑖?𝑖?? = ?? ?? + 𝐹? = 85 85 + 16 = 0.8416 ?? 84.16% e. Calculate recall. ????𝑙𝑙 = ?? ?? + 𝐹? = ????𝑖?𝑖?𝑖?𝑦 = 85 85 + 49 = 0.6343 ?? 63.43% f. Calculate the true negative rate (TNR). True Negative Rate (TNR) = ?? ?????𝑙 ?????𝑖?? = ?? ?? + 𝐹? = ?pecificity = 1350 1350 + 16 0.9883 ?? 98.83% g. Calculate the true positive rate (TPR). True Positive Rate (TPR) = ?? ?????𝑙 ???𝑖?𝑖?? = ?? ?? + 𝐹? = ????𝑖?𝑖?𝑖?𝑦 = 85 85 + 49 = 0.6343 ?? 63.43% h. Report the ROC curve and AUC. It should be: ??? (????𝑖?𝑖?𝑖?𝑦 ?? ????𝑙𝑙) = ?? ?? + 𝐹? = 85 85 + 49 = 0.6343 ?? 63.43% 𝐹?? = 𝐹? 𝐹? + ?? = 16 16 + 1350 = 0.0117 ?? 1.17%
The ROC curve visually shows the trade-off between TPR and FPR at different thresholds, and the AUC quantifies the overall performance of the model. The ROC curve and AUC provide valuable information about the model's ability to distinguish between the two classes. Higher AUC values indicate better discriminative performance, with an AUC of 0.5 representing random chance, and an AUC of 1 representing perfect discrimination. In our scenario: The ROC curve will show a trade-off between correctly identifying customers who accepted personal loans (TPR) and incorrectly classifying customers who didn't accept loans as if they did (FPR). The TPR of approximately 0.6341 suggests that the model is able to correctly identify about 63.41% of customers who accepted personal loans. The FPR of approximately 0.0117 indicates that the model makes a relatively low number of false positive predictions, which is a positive aspect. The AUC value will provide an overall assessment of the model's performance. AUC is significantly greater than 0.5, it suggests that the model is better than random guessing. Problem 2 ROC Curve Construction (“Manually”): Recreate the ROC curve above “manually” using Excel. Since we are interested in the target Personal Loan = Yes, we are only interested in the red curve. To recreate this curve, run the model in JMP. Then, subset the Validation data and output the propensities for the Validation data to Excel. Use Excel to calculate the Sensitivity and Specificity for some arbitrary cutoff threshold, say t. Then, use a DataTable in Excel to calculate over the range of t from 0 to 1 (perhaps incremented by .01 or even .001). Finally, for the Validation set, plot the ROC curve, which will look like the plot above. Copy and paste the Excel curve into a Word doc. Please submit your Excel doc as well.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Problem 3 Recall that the ROC curve is useful when we do not have explicit costs. The curve is traced out for all possible propensity cutoff values. The 45 degree line that results is for average performance – averaged over many trials. Let’s suppose that one implemented the baseline policy. Recall that a baseline policy would not actually involve any data mining (DM) technique. Instead, what is something lazy that we could do? Per the note, the baseline policy would involve some randomness. For the exercise, simulate a single experiment (1 trial – that is, one column of RANDs) for the baseline policy of the UniversalBank problem (above), and create its ROC curve for the Validation data. In JMP Pro, there are random functions available: https://www.jmp.com/support/help/en/15.2/index.shtml#page/jmp/random-functions.shtml . (One can first create a new column in JMP, Col -> New Columns, and then right click on the column and select Formula… Alternatively, one can right click on a different, pre-existing column and select New Formula Column to transform it.) Or, since you will use Excel anyway to create an ROC curve, you can use the random functions in Excel: RAND(). For example, a 50/50 coin flip would be something like, IF(RAND() < .5, 1, 0), for heads (1), otherwise tails (0). For JMP it may be: 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.2 0.4 0.6 0.8 1 True Positive Rate (TPR) Sensitivity False Positive Rate (FPR) 1- Specificity Receiver Operating Characteristic (ROC)
Anyway, choose to either use JMP or Excel for random number generation and simulate the baseline policy. (If you use Excel, you should fix the random number draws. You may take a “snapshot” of the RANDs by copying and pasting over them by value. Otherwise, if they aren’t fixed, new RANDs will be drawn for every row of the Data Table, which isn’t correct. The column of RANDs should be the same for each cutoff in the What-If analysis.) If the theory is correct, the curve should be close to the 45 degree line, but due to the randomness (that is, simulation/sample error), it won’t be perfectly smooth. Submit your ROC curve and Excel work. Problem 4 Personal Loan Acceptance. Again, the file UniversalBank.jmp contains data on 5000 customers of Universal Bank. The data include customer demographic information (age, income, etc.), the customer’s relationship with the bank (mortgage, securities account, etc.), and the customer response to the last personal loan campaign (Personal Loan). Among these 5000 customers, only 480 (= 9.6%) accepted the personal loan that was offered to them in the earlier campaign. In this exercise we focus on two predictors: Online (whether the customer is an active user of online banking services) and Credit Card (abbreviated CC below) (does the customer hold a credit card issued by the bank), and the outcome Personal Loan (abbreviated Loan below). Check to make sure that the variables are coded as Nominal. (For the Naïve Bayes analysis, older versions of JMP also required that none of the variables has the Value Labels column property, and one would have to remove this column property as well. In JMP 16, we shouldn’t have to do this.) a. Create a summary of the data using Tabulate, with Online as a column variable, CC as a row variable, and Loan as a secondary row variable. The values inside the cells should convey the count (how many records are in that cell). Below, we will use this tabulation to “manually” perform a Bayes analysis. Note that the data from the textbook are partitioned into Training and Validation data. For this exercise, let’s use the textbook’s provided Validation column. You should create the tabulation for just the training data. You should get something like this (Use the “Group” by Validation option in the Formula Depot Model Comparison, OR add a Local Data Filter on Validation to switch between Training and Validation): 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.2 0.4 0.6 0.8 1 True Positive Rate (TPR) Sensitivity False Positive Rate (FPR) 1- Specificity Receiver Operating Characteristic (ROC)
b. Consider the task of classifying a customer that owns a bank credit card and is actively using online banking services. Looking at the tabulation, what is the probability that this customer will accept the loan offer? (This is the probability of loan acceptance (Loan =1) conditional on having a bank credit card (CC = 1) and being an active user of online banking services (Online = 1)). ??????𝑖𝑙𝑖?𝑦 = 62 62 + 552 = 0.10098 = 10.10% c. Create two tabular summaries of the data. One will have Loan (rows) as a function of Online columns) and the other will have Loan (rows) as a function of CC . Paste this tabulation into your solution file. Hopefully, it will look something like this:
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
d. Compute the following quantities [ ? ( ? | ? ) means ‘‘the probability of A given B’’]: i. ? ( ?? = 1| ???? = 1) (the proportion of credit card holders among the loan acceptors) = 110 346 = 0.3179 or 31.79% ii. ? ( ??𝑙𝑖?? = 1| ???? = 1) = 210 346 = 0.6069 or 60.69% iii. ? ( ???? = 1) (the proportion of loan acceptors) = 346 3500 = 0.0988 ?? 9.88% iv. ? ( ?? = 1| ???? = 0) = 911 3154 = 0.2888 ?? 28.88% v. ? ( ??𝑙𝑖?? = 1| ???? = 0) = 1882 3154 = 0.5967 ?? 59.67% vi. ? ( ???? = 0) = 3154 3500 = 0.9011 ?? 90.11% e. Use the quantities computed above to compute the Naive Bayes probability ? ( ???? = 1| ?? = 1, ??𝑙𝑖?? = 1). = (0.301 * 0.626 * 0.095)/ [(0.301 * 0.626 * 0.095) + (0.288 * 0.598 * 0.905)] = 0.3179 ∗ 0.6069 ∗ 0.0988 (0.3179 ∗ 0.6069 ∗ 0.0988) + (0.2888 ∗ 0.5967 ∗ 0.9011) = 0.109333558 ?? 10.93% f. Compare this value with the one obtained from the tabulation in (b). Are they the same? Are they close? The value in tabulation (b) is 0.10098 = 10.10% whereas the value which we calculated is 0.109333558 ?? 10.93% . So can comment that both are not exactly the same but close enough. g. So far you have performed “manual” calculations. JMP Pro does have a Naïve Bayes platform. Use it to see if you get the same answer as in part (e). Note: The numbers might not fully agree since there is an adjustment for low data amounts; however, the data might be enough that you won’t notice any difference. Does the platform result agree with (f)?
Considering only two predictors (CC and Online) not getting the good prediction this model is correct.
By running the Naïve Bayes model also not getting the same values.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help