Assignment 9

pdf

School

University of Michigan, Dearborn *

*We aren’t endorsed by this school

Course

DS 633

Subject

Industrial Engineering

Date

Feb 20, 2024

Type

pdf

Pages

16

Uploaded by monikagautam93

Report
Assignment 9 1. Problem 9.3 Predicting Prices of Used Cars (Regression Trees). The file ToyotaCorolla.jmp contains the data on used cars (Toyota Corolla) on sale during late summer of 2004 in The Netherlands. It has 1436 records containing details on 38 attributes, including Price, Age, Kilometers, HP, and other specifications. The goal is to predict the price of a used Toyota Corolla based on its specifications. (The example in Section sec-regtrees is a subset of this dataset). Data preprocessing. Split the data into training (50%), validation (30%), and test (20%) datasets. Run a regression tree with the output variable Price and input variables Age−08−04, KM, Fuel−Type, HP, Automatic, Doors, Quarterly−Tax, Mfg−Guarantee, Guarantee−Period, Airco, Automatic−Airco, CD−Player, Powered−Windows, Sport−Model, and Tow−Bar. Set the minimum split size to 1, and use the split button repeatedly to create a full tree (hint, use the red triangle options to hide the tree and the graph). As you split, keep an eye on RMSE and RSquare for the training, validation and test sets. i. Describe what happens to the RSquare and RMSE for the training, validation and test sets as you continue to split the tree. RSquare RASE N Number of Splits AICc Training 0.639 2171.4819 718 1 13076.7 Validation 0.606 2178.3953 431 Test 0.721 2034.3537 287 RSquare RASE N Number of Splits AICc Training 0.802 1608.639 718 5 12653.9 Validation 0.781 1624.6516 431 Test 0.850 1493.9806 287 RSquare RASE N Number of Splits AICc Training 0.867 1317.707 718 10 12377.8 Validation 0.824 1456.1092 431 Test 0.900 1219.2657 287 RSquare RASE N Number of Splits AICc Training 0.880 1248.6408 718 15 12310.9 Validation 0.829 1437.3576 431 Test 0.905 1186.1872 287
RSquare RASE N Number of Splits AICc Training 0.888 1208.1286 718 20 12274.1 Validation 0.834 1414.5993 431 Test 0.910 1153.4376 287 RSquare RASE N Number of Splits AICc Training 0.894 1173.7013 718 25 12243.3 Validation 0.840 1390.5749 431 Test 0.911 1147.7832 287 RSquare RASE N Number of Splits AICc Training 0.900 1144.0907 718 30 12217.5 Validation 0.841 1382.6499 431 Test 0.919 1097.2567 287 RSquare RASE N Number of Splits AICc Training 0.916 1048.2752 718 40 12114.2 Validation 0.844 1369.8937 431 Test 0.918 1104.0191 287 RSquare RASE N Number of Splits AICc Training 0.920 1021.0167 718 50 12099.3 Validation 0.841 1384.3096 431 Test 0.915 1120.5823 287 RSquare RASE N Number of Splits AICc Training 0.925 988.27733 718 60 12076.1 Validation 0.833 1418.0472 431 Test 0.915 1121.0064 287 RSquare RASE N Number of Splits AICc Training 0.929 959.13913 718 80 12082.7 Validation 0.835 1408.9757 431 Test 0.911 1150.417 287 RSquare RASE N Number of Splits AICc Training 0.933 936.55618 718 100 12101.2 Validation 0.837 1400.0907 431 Test 0.908 1168.0779 287 As we split the regression tree from 1 to 100 times to predict used Toyota Corolla prices, we see that initially, the model gets better at understanding the training data, with both
RSquare (how well it fits the data) and RMSE (how accurate its predictions are) improving. However, after around 30 splits, the improvements start to slow down, especially for the validation and test sets. This suggests that making the model too detailed might not be helping much and could even be making it too focused on the training data. It's like memorizing a lesson without really understanding it. So, while the model gets really good at predicting prices for cars it has seen before, it might struggle with new cars. It's important to find the right balance to make sure the model works well not just on training data but also on new, unseen data. ii. How does the performance of the test set compare to the training and validation sets on these measures? Why does this occur? The performance of the test set in predicting used Toyota Corolla prices follows a similar pattern to the training and validation sets. At first, the model gets better at predicting prices for both the training and test sets as the tree splits. However, after around 30 splits, the improvements start to slow down for the test set, just like they do for the validation set. This could happen because the model becomes too focused on the details of the training data, making it less effective at predicting prices for new cars. It's like studying a specific set of problems too much and struggling when faced with new ones. Monitoring these patterns helps ensure the model works well not just on what it has seen before but also on new, unseen data.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
iii. Based on this tree, which are the most important car specifications for predicting the car's price? Column Contributions Term Number of Splits SS Portion Age_08_04 16 7685656170 0.8798 HP 8 459235381 0.0526 KM 20 262445186 0.0300 Powered_Windows 9 82264183 0.0094 Quarterly_Tax 5 60911462.2 0.0070 Doors 9 48787063.1 0.0056 Airco 7 33903425.1 0.0039 Automatic_airco 1 31965718.4 0.0037 Mfg_Guarantee 9 23867564.3 0.0027 Sport_Model 4 14827478.5 0.0017 CD_Player 2 11128462.8 0.0013 Tow_Bar 6 10267467.2 0.0012 Guarantee_Period 2 5813137.94 0.0007 Automatic 1 3052618.98 0.0003 Fuel_Type 1 1732503.71 0.0002 The most important car specifications for predicting the car's price are determined by the features with the highest "SS Portion," which represents the proportion of the sum of squares attributable to each specification. The age of the car is the most crucial factor, making up about 88% of the reason. It's like saying, "If a car is older, it tends to cost less." The power of the car's engine, known as Horsepower (HP), is also important, contributing about 5% to the price. Additionally, the number of kilometers the car has driven (KM) matters, making up around 3% of the reason. iv. Refit this model, and use the Go button to automatically split and prune the tree based on the validation RSquare. Save the prediction formula for this model to the data table. Save the Refit model in jmp file. v. How many splits are in the final tree? 36 splits. RSquare RASE N Number of Splits AICc Training 0.911 1074.441 718 36 12140.6 Validation 0.845 1366.0136 431 Test 0.921 1083.7582 287
vi. Compare RSquare and RMSE for the training, validation and test sets for the reduced model to the full model. Full Model Reduced Model
Full Model Reduced Model RSquare RASE RSquare RASE Training 0.933 936.55618 0.911 1074.441 Validation 0.837 1400.0907 0.845 1366.0136 Test 0.908 1168.0779 0.921 1083.7582 In the training data, the reduced model shows a lower RSquare and a higher RMSE compared to the full model, indicating that the full model explains more variance and has a better fit for the known training data. However, for the Validation and Test sets, the reduced model outperforms the full model, showing higher RSquare and lower RMSE. This suggests that the reduced model generalizes better to new, unseen data, making it more effective for predictions on Validation and Test sets. vii. Which model is better for making predictions? Why? The reduced model is preferred for making predictions. This choice is based on two key factors: the lower prediction error (lower RMSE), implying more accurate predictions, and the model's simplicity (parsimony). The reduced model is simpler, which is advantageous as it avoids overcomplicating the model while still achieving good prediction accuracy. Therefore, for practical prediction purposes, the reduced model stands out as the better choice.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
2. Problem 9.4. Predicting Used Car Prices (Bootstrap Forest and Boosted Trees). Return to the Toyota Corolla data, and refit the partition model. (Hint: Use the recall button in the partition dialog). This time, choose bootstrap forest from the dialog window. Use the default settings. i. Compared to final reduced tree above, how does the bootstrap forest behave in terms of overall error rate on the test set? Save the prediction formula for this model to the data table. Reduced Tree: Test RSquare: 0.921 Test RASE: 1083.7582 Bootstrap Forest: Test RSquare: 0.933 Test RASE: 995.36793 Comparing these values, we observe that the Bootstrap Forest performs slightly better on the test set. It has a higher RSquare (indicating a better fit) and a lower RASE (indicating lower prediction error) compared to the final reduced tree. ii. Run the same model again, but this time choose boosted tree from the partition dialog. Use the default settings. Save the model in jmp file. iii. How does the boosted tree behave in terms of the prediction error rate relative to the reduced model and the bootstrap forest. Save the prediction formula for this model to the data table. The Boosted Tree model exhibits impressive performance in terms of error rates relative to both the reduced model and the Bootstrap Forest. Across the three datasets (Training, Validation, and Test), the Boosted Tree consistently outperforms the other models. The Test RSquare for the Boosted Tree is notably higher at 0.943, surpassing the reduced model's 0.921 and the Bootstrap Forest's 0.933. Furthermore, the Test RASE (Root Average Squared Error) for the Boosted Tree is lower at 920.85, indicating a reduced prediction error compared to the reduced model (1083.76) and the Bootstrap Forest (995.37). These results suggest that the Boosted Tree, with 78 layers and a learning rate
of 0.097, offers superior predictive accuracy for estimating used car prices based on the given specifications. iv. To facilitate comparison of error rates for the different models, use the Model Comparison platform (under Analyze > Modeling). To view fit statistics for the different models, put the validation column in the Group field in the Model Comparison dialog. a. Which model performs best on the test set? The Boosted Tree model (Price Predictor 3) performs the best on the test set. It has the highest Test RSquare (0.9428), indicating a better fit to the test data compared to the other models. Additionally, the Test RASE (Root Average Squared Error) for the Boosted Tree is the lowest at 920.85, suggesting that it makes fewer prediction errors on the test set compared to the other models. And also has the lowest AAE (Average Absolute Error). Therefore, the Boosted Tree is the most effective model for predicting used car prices in this context. b. Explain why this model might have the best performance over the other models you fit. The reason this model could outperform the others is that it exhibits a lower prediction error on the test set. This implies that the model is more accurate and precise in its predictions when faced with new, unseen data. The ability to minimize errors on the test set suggests that the model generalizes well and captures patterns effectively, making it a strong performer in predicting used car prices.
3. Problem 9.5. Predicting Flight Delays (Bootstrap Forest and Boosted Trees). We return to the flight delays data for this exercise, and fit both a bootstrap forest and a boosted tree to the data. Use scheduled departure time (CRS_DEP_TIME) rather than the binned version for these models. i. Fit a bootstrap forest, with the default settings. Save the formula for this model to the data table. Created the model and save the prediction formula in jmp file. a. Look at the column contributions report. Which variables were involved in the most splits? Column Contributions CRS_DEP_Time, Day_Week, Carrier, Weather, and DEST are involved in the most splits. Distance and Origin are also involved in the splits, but they contribute the least. b. What is the error rate on the test Validation set? The error rate is 0.1807 Overall Statistics Measure Training Validation Definition Entropy RSquare 0.3455 0.0822 1-Loglike(model)/Loglike(0) Generalized RSquare 0.4605 0.1242 (1-(L(0)/L(model))^(2/n))/(1-L(0)^(2/n)) Mean -Log p 0.3225 0.4519 ∑ -Log(ρ[j])/n RASE 0.3199 0.3724 √ ∑(y[j]-ρ[j])²/n Mean Abs Dev 0.2249 0.2642 ∑ |y[j]-ρ[j]|/n Misclassification Rate 0.1537 0.1807 ∑ (ρ[j]≠ρMax)/n N 1321 880 N Term Number of Splits G^2 Portion CRS_DEP_TIME 4252 116.871467 0.3246 DAY_WEEK 4289 103.927386 0.2886 CARRIER 1553 48.2123204 0.1339 Weather 105 40.6662637 0.1129 DEST 808 21.7255141 0.0603 DISTANCE 808 17.0528003 0.0474 ORIGIN 714 11.6116091 0.0322
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
ii. Fit a boosted tree to the flight delays data, again with the default settings. Save the formula to the data table. Save the boosted tree model and prediction formula in jmp file. a. Which variables were involved in the most splits? Is this similiar to what you observed with the bootstrap forest model? Column Contributions Term Number of Splits G^2 Portion DEST 80 39831.3746 0.2528 ORIGIN 65 36443.3695 0.2313 DAY_WEEK 189 28301.1903 0.1796 CARRIER 115 16637.304 0.1056 CRS_DEP_TIME 148 15413.2272 0.0978 DISTANCE 93 15080.0041 0.0957 Weather 58 5852.88494 0.0371 The top variables are DEST, ORIGIN, Day_WEEK and Carrier. These are different than what we saw with bootstrap forest. b. What is the error rate on the Validation set for this model? The error rate is 0.1818.
iii. Use the Model Comparison platform to compare these models to the final reduced model found earlier (again, put the validation column in the Group field in the Model Comparison dialog. a. Which model has the lowest overall error rate on the validation set? Bootstrap Forest Misclassification Rate (Validation): 0.1807 Boosted Tree Misclassification Rate (Validation): 0.1818 Partition Misclassification Rate (Validation): 0.1807 Among these models, the one with the lowest misclassification rate on the validation set is the Bootstrap Forest, with a rate of 0.1807. Therefore, the Bootstrap Forest model has the lowest overall error rate on the validation set. b. Explain why this model might have the best performance over the other models you fit. The Bootstrap Forest model emerges as the top performer among the models considered, showcasing a noteworthy edge in predictive accuracy on the validation set. With a misclassification rate of 0.1807, it outperforms both the Boosted Tree (0.1818) and Partition (0.1807) models. The confusion matrix reveals the model's proficiency in predicting both delayed and on-time flights, particularly excelling in the accurate prediction of on-time flights. Additionally, measures such as Entropy RSquare and Generalized RSquare indicate a superior fit of the Bootstrap Forest model to the validation data. Its Root Average Squared Error (RASE) and Mean Absolute Deviation values are lower, signifying enhanced predictive accuracy and precision. The model's probability columns align well with actual class labels, and it strikes a commendable balance between complexity and predictive power, avoiding overfitting. In essence, the Bootstrap Forest model's comprehensive performance metrics, including the confusion matrix and measures of fit, collectively underscore its effectiveness in predicting flight status, making it the standout choice among the considered models.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
4. Problem 13.2. eBay Auctions—Boosting and Bagging. Using the eBay auction data (file eBayAuctions.jmp) with variable Competitive as the target, partition the data into training (50%), validation (30%), and test sets (20%). i. Create a classification tree. Use the Go button to create the model. Looking at the test set, what is the overall accuracy? What is the lift at portion = 0.2? Save the prediction formula to the data table. The overall accuracy on the test set is 1-Misclassification = 1 - 0.1041 = 0.8959. The lift at portion = 0.2 for Competitive = 1 is approximately 1.8, and the lift for Competitive = 0 is approximately 2.0.
ii. Run the same tree, but first select the Boosted Tree method. Use the default settings. For the test set, what is the overall accuracy? What is the lift at portion = 0.20? Save the prediction formula to the data table. The overall accuracy on the test set is 1 - 0.1345 = 0.8655. The lift at portion = 0.2 for Competitive = 1 is approximately 1.8, & the lift for Competitive = 0 is approximately 2.1. iii. Now try the same tree with the Bootstrap Forest method selected, and accept the default settings. What is the lift at portion = 0.2? Again, save the prediction formula to the data table. The overall accuracy on the test set is 1-Misclassification = 1 - 0.0964 = 0.9036. The lift at portion = 0.2 for Competitive = 1 is approximately 1.8, and the lift for Competitive = 0 is approximately 2.2
iv. Compare the three models using the Model Comparison platform under Analyze > Modeling. In the Model Comparison dialog, use the validation column as the By variable, but leave everything else blank. Compare the misclassification rates for the three models. Which model has the best accuracy? Compare the lift curves for the three models. Which model does the best job of sorting the response? The Bootstrap Forest model emerges as the most accurate on the validation set with a rate of 0.1132, outperforming both the Partition model (0.1216) and the Boosted Tree model (0.1318). The lift curves, which depict the models' ability to sort the response variable, demonstrate comparable performance among the models, with lift values around 1.8 to 2.2 at a portion of 0.2. These findings underscore the effectiveness of the Bootstrap Forest model in accurately predicting flight status and organizing the data according to its response variable.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
v. In the Model Comparison analysis window, select Model Averaging from the top red triangle. Return to the data table, and view the formulas for the new columns that have been saved. Describe how the models are being averaged. Two columns are produced – one for the probability that competitive = 0 and another for the probability that competitive = 1. The formulas are the averages of the of the probability formulas for the three models. So, for each row in the training set the values are the average of the predicted probabilities from each of the models. Competitive? 0 Avg Predictor = Competitive? 1 Avg Predictor =