HWCase2
docx
keyboard_arrow_up
School
North Carolina State University *
*We aren’t endorsed by this school
Course
551
Subject
Information Systems
Date
Feb 20, 2024
Type
docx
Pages
6
Uploaded by CaptainDiscovery13338
CASE 2
Instructions: Please use the Stolenrecords and crimebystate datasets to complete the following analysis. Please answer each questions fully and supply any supporting analysis or results (e.g. screenshots). Each question is worth 10 points unless otherwise noted. Background:
Your company recently had a security breach in which millions of customers private information was stolen from your company. Your company’s reputation
is at risk, so you are interested in providing assistance and guidance to these customers about protecting themselves from identity theft (thieves using the information to open other accounts or commit other illegal acts). You would like to identify which customers are more likely to be a victim. You have a file from a previous breach that has information on customers and which of the customers became a victim of identity theft (Stolenrecords). You also have a file of crime statistics by state (crimebystate). Use these two files to answer the following questions:
1. Build a classification tree for Identity theft by determining which variables to include as predictors (fit what you think is the best model). a. Which variables, if any, did you choose not to include in the model? Why? b. How many splits are in your final tree? (5 points)
The final tree consists of 379 Splits.
c. What is the misclassification rate for this model? Is the model better at predicting victims or non-victims? Explain.
Looking at sensitivity and specificity, we can see that the model is much better at predicting non-victims (95.37%) than victims (32.50%). This means that the model is more likely to correctly identify someone who will not be a victim of identity theft than someone who will be.
Overall, the model seems to be more reliable for identifying people who are not at risk of identity theft. However, it is important to remember that even with a high specificity, there is still a chance that the model will incorrectly identify someone as being at low risk when they are actually at high risk. Therefore, it is important to use this model in conjunction with other risk assessment methods.
Actual
Predicted No
0
230407
1
38796
d. What is the area under the ROC curve for Victims? Interpret this value. Does the model do a better job of classifying victims than a random model?
The area under the ROC curve (AUC) for victims is 0.8124, signifying the model's effectiveness in distinguishing victims from non-victims. An AUC above 0.5 indicates better-than-random classification, with values closer to 1 indicating superior differentiation between classes.
e. What is the lift for the model at portion = 0.1 and at portion = 0.20? Interpret these values. The lift for the model at portion = 0.1 means that at 10% of the population, the model is 3.5 times more effective at identifying victims compared to random chance. Similarly, at portion = 0.20, the model is 2.75 times more effective. This indicates the model's utility in identifying a larger
proportion of victims compared to random guessing.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
2. Use the Fit Model platform to create a logistic regression model for Victims? using the other variables as predictors (fit what you think is the best model). a. Which variables are statistically significant in predicting the probability of
being a victim (5 points)? After fitting a logistic regression model using the Fit Model platform in JMP, the following variables were found to be statistically significant predictors of the probability of being an identity theft victim:
ExistAccountCC (p < 0.0001)
ExistAccountBank (p < 0.0001)
PersonalInfo (p < 0.0001)
DirectDeposit (p < 0.0001)
ATMCard (p = 0.0005)
PreauthorizedDebts (p < 0.0001)
ComputerBanking (p < 0.0001)
InPersonBanking (p = 0.0002)
PhoneBanking (p = 0.0004)
MonitorBankStmt (p < 0.0001)
MonitorCCStmt (p < 0.0001)
MonitorCredreport (p < 0.0001)
Murder (p = 0.0166)
Robbery (p = 0.0008)
Aggravated assault (p = 0.0002)
Burglary (p < 0.0001)
Larceny-theft (p < 0.0001)
Motor vehicle theft (p = 0.0026)
b. What is the misclassification rate for your final logistic model? The misclassification rate for the final logistic model is 14.6%.
c. Compare the misclassification rates for the logistic model and the decision tree created above. Which model is better? Why? Comparing the misclassification rates between the logistic regression and decision tree models, it's evident that the decision tree model exhibits a higher misclassification rate at 25.1%, whereas the logistic regression model achieves a lower rate of 14.6%. Thus, the logistic regression model demonstrates superior predictive accuracy compared to the decision tree model.
d. Compare this model to the model produced using a classification tree. Which model would be easier to explain to a non-technical person? Why?
The classification tree model would be easier to explain to a non-
technical person due to its visual representation of decision rules. However,
the logistic regression model offers more detailed insights into the relationships between predictors and the probability of being an identity theft victim. It provides additional information such as odds ratios and confidence intervals for each predictor, which help assess the strength and direction of these relationships.
e. Based on the models you created (decision tree and logistic), what are your major conclusions about the relationships between these variables and the probability of being an identity theft victim? (20 points)
Based on the models created (decision tree and logistic regression), the following major conclusions can be drawn regarding the relationships between variables and the probability of being an identity theft victim:
Possessing a credit card or bank account is associated with an increased probability of being an identity theft victim.
Providing personal information such as name, address, and social security number elevates the probability of identity theft victimization.
Receiving paychecks via direct deposit correlates with an increased likelihood of being an identity theft victim.
Owning an ATM card, having preauthorized debts, and utilizing computer, in-person, or phone banking services are associated with a higher probability of identity theft victimization.
Regularly monitoring bank statements, credit card statements, and credit reports is linked to a decreased probability of being an identity theft victim.
Residing in areas characterized by higher crime rates, including murder, robbery, aggravated assault, burglary, larceny-theft, and motor vehicle theft, is correlated with an increased probability of being an identity theft victim.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help