Understanding Lift Curves for Fraud Detection in Data Models

Assignment 4 1. Problem 5.3 of the textbook. Consider Figure 5.17 in which a lift curve for the transaction data model is applied to new data. a. Interpret the meaning of the top curve (for Fraudulent = Yes), at portion = 0.2. The lift at the 20% threshold is approximately 2.2, signifying that in the top 20% of data, sorted by predicted probabilities of fraudulence, there are about 2.2 times as many actual fraudulent records as one would anticipate based on the overall proportion of fraudulence. This is in comparison to randomly selecting records without considering the probabilities. b. Explain how you might use this information in practice. Imagine a tax authority that needs to decide where to focus their efforts when investigating companies that may be submitting fraudulent tax returns. They have the capacity to audit only 20% of the companies. Instead of choosing companies randomly, they can opt to audit the top 20% of companies predicted to have the highest likelihood of fraudulent reporting, as indicated by the lift chart. Alternatively, to ensure fairness and maintain the possibility of auditing any company, they can set varying probabilities for selection, with those in the top segments having a much higher chance of being audited.

c. Another analyst comments that you could improve the accuracy of the model by classifying everything as nonfraudulent. If you do that, what is the error rate? If we were to classify everything as nonfraudulent, the resulting error rate would be, ??𝑎??????? ?𝑎??? ?𝑖???𝑎??𝑖?𝑖?? 𝑎? ?????𝑎??????? 𝑇??𝑎? 𝐶𝑎??? = 62 1040 = 0.0596 𝑜𝑟 5.96% and the accuracy would increase to 94.04% (1 - 0.0596 = 0.9404). However, it's important to note that in this scenario, the model loses its ability to effectively identify potentially fraudulent transactions. This means that while the error rate decreases and accuracy improves, the model becomes impractical for its intended purpose of detecting fraud. Here's a summary of the classification confusion matrix: Predicted Class Actual Class 1 (fraudulent) 0 (nonfraudulent) 1 (fraudulent) 0 32+30=62 0 (nonfraudulent) 0 920+58=978 In this case, all actual fraudulent transactions are misclassified as nonfraudulent, rendering the model ineffective for fraud detection. d. Comment on the usefulness, in this situation, of these two metrics of model performance (error rate and lift). The primary objective of this analysis is to pinpoint fraudulent records. The overall “error rate” isn't particularly useful for assessing different methods of achieving this goal. What truly matters here is the capability to recognize records with a high likelihood of being fraudulent, and this is precisely what the lift metric quantifies. Using the lift metric allows you to systematically go through the records in order of their probability of being fraudulent. At each step, you gain insight into how much more likely you are to encounter a fraudulent record compared to randomly selecting records. In contrast, the “error rate” metric doesn't provide any insights into the efficiency of identifying fraudulent records. Since the majority of records are non-fraudulent, correctly classifying these non-fraudulent records contributes significantly to the overall error rate. It's entirely possible to achieve a much lower error rate by classifying everything as non-fraudulent, but such an approach is not practically useful for fraud detection.

2. Consider the Universal Bank example from Assignment 3. Use the same Validation set that you used in Assignment 3: 60/40 Training/Validation set with seed 123. a. As you did in Assignment 3, fit a logistic regression model with all predictors except ID and Zip Code. Copy and paste the resulting Lift curve for Validation into Word.

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version

Access to all documents
Unlimited textbook solutions
24/7 expert homework help

b. Now, recreate the Lift Curve on Validation Data “manually.” That is, use graph builder rather than just the standard output. In addition to the final lift curve, also plot the Cumulative 1’s vs. Proportion along with the baseline performance line (similar to what we did in class). (The distinction is that the Lift Curve is the ratio of the cumulative 1’s to the reference line.) Use JMP for your work.

3. Predicting Delayed Flights. The file FlightDelaysBinned_fixed_bins.jmp (the data are in the Assignment, as they are not provided with the textbook) contains information on all commercial flights departing the Washington, DC area and arriving at New York during January 2004. For each flight there is information on the departure and arrival airports, the distance of the route, the scheduled time and date of the flight, and so on. The variable that we are trying to predict, Flight Status, is whether a flight is delayed. A delay is defined as an arrival that is at least 15 minutes later than scheduled. Data preprocessing (you do not have to reproduce this, and won’t be tested on binning, though it is possible that it could be useful for a group project someday): The scheduled departure time (CRS_DEP_TIME) has been binned into 8 bins (CRS_DEP_TIME Groups). This will avoid treating the departure time as a continuous predictor, because it is reasonable that delays are related to rushhour times. (Note that the original data were not stored in JMP with a time format. To bin, there are multiple options in JMP - two options are (1) via the formula editor and (2) using the Make Binning Formula column utility.) An alternative to using these included utilities is to use the Interactive Binner at (https://community.jmp.com/docs/DOC-6237 to facilitate binning this (click Add Cut points to create 8 bins, and then use Edit Cut Points from the lower red triangle to edit the starting values for the bins); see figure below. In this exercise you will compare different data mining approaches. (What is the first thing that should come to your mind then? Validation set! Make sure to create a Validation set. Let’s do a 60/40 Training/Validation set with seed 123.) a. Fit a classification tree to the flight delay variable using all the relevant predictors (use the binned version of the departure time) and the validation column. Do not include DEP TIME (actual departure time) in the model because it is unknown at the time of prediction (unless we are doing our predicting of delays after the plane takes off, which is unlikely). i. How many splits are in the final model? 30 Splits. RSquare N Number of Splits Training 0.436 1321 30 Validation -0.28 880

ii. How many variables are involved in the splits? 10 Variables. iii. Which variables contribute the most to the model? (Hint: Use Column Contributions.) Tail_NUM, CRS_DEP_TIME Groups, DAY_WEEK, Weather, CARRIER, DEST, DISTANCE, ORIGIN, DAY_OF_MONTH, FL_DATE. Term Number of Splits G^2 Portion TAIL_NUM 3 365.025697 0.6334 CRS_DEP_TIME Groups 4 43.8703232 0.0761 DAY_WEEK 5 30.7848875 0.0534 Weather 1 29.0821506 0.0505 CARRIER 4 28.7227262 0.0498 DEST 3 19.6411747 0.0341 DISTANCE 2 18.6874494 0.0324 ORIGIN 4 18.0446013 0.0313 DAY_OF_MONTH 3 16.2826699 0.0283 FL_DATE 1 6.16176143 0.0107 FL_NUM 0 0 0.0000 iv. Which variables were not involved in any of the splits? FL_NUM v. Express the resulting tree as a set of rules. (That is, just present the Leaf Report.)

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version

Access to all documents
Unlimited textbook solutions
24/7 expert homework help

vi. If you needed to fly between DCA and EWR on a Monday at 7 AM, would you be able to use this tree to predict whether the flight will be delayed? What other information would you need? Is this information available in practice? What information is redundant? It's not possible to directly use the tree to predict flight delays for a flight between DCA and EWR on a Monday at 7 AM. you would need additional information and clarification on how the tree is structured and which variables it uses for predictions. To make flight delay predictions, you would typically need the following information: TAIL_NUM, DAY_WEEK, Weather, CARRIER, DISTANCE, DAY_OF_MONTH, FL_DATE. b. Fit a logistic Regression model to this same data. Describe the approach that you took to building the logistic regression model (did you use any variable selection? It is totally up to you). Which variables are in your model? Also, what it the formula for the log odds of your target variable? Used nominal logistic regression approach. Variables used in logistic regression: Tail_NUM, CRS_DEP_TIME Groups, DAY_WEEK, Weather, CARRIER, DEST, DISTANCE, ORIGIN, DAY_OF_MONTH, FL_DATE, FL_NUM. Formula for log odds :

c. Finally, compare the performance of the classification tree to the logistic regression model. Which performs better? What is your criterion? Classification Tree: Logistic Regression: Accuracy: Training Data Validation Data Classification Tree 87.21% 77.61% Logistic Regression 90.84% 74.89% The classification tree model performs slightly better in terms of accuracy. It has an accuracy of approximately 77.61% on the validation data, while the logistic regression has an accuracy of approximately 74.89%. AUC-ROC: The logistic regression model outperforms the classification tree in terms of AUC- ROC on the validation data. The logistic regression model has an AUC-ROC of approximately 0.5774, while the classification tree has an AUC-ROC of approximately 0.5577.

Misclassification Rate: The classification tree has a lower misclassification rate on the validation data (approximately 22.34%) compared to the logistic regression model (approximately 25.11%). RASE (Root Average Squared Error): Classification Tree: Logistic Regression: The Classification tree has a lower RASE on the validation data (approximately 0.4149) compared to the logistic regression model (approximately 0.4728). RSquare (U): Classification Tree:

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version

Access to all documents
Unlimited textbook solutions
24/7 expert homework help

Logistic Regression: The RSquare (U) measure indicates that the classification tree performs better on the validation data, with a value of -0.284 for the classification tree and -2.078 for the logistic regression model. It's important to note that a higher RSquare (U) indicates better performance in this context. Overall, based on the above criteria, the classification tree tends to outperform the logistic regression model in predicting flight delays, especially on the validation data. However, the choice between these two models should also consider the specific goals of the analysis, interpretability, and other relevant factors.

Assignment 4

Related Documents