Assignment 9
pdf
keyboard_arrow_up
School
University of Michigan, Dearborn *
*We aren’t endorsed by this school
Course
DS 633
Subject
Industrial Engineering
Date
Feb 20, 2024
Type
Pages
16
Uploaded by monikagautam93
Assignment 9
1.
Problem 9.3
Predicting Prices of Used Cars (Regression Trees). The file ToyotaCorolla.jmp contains the data on used cars (Toyota Corolla) on sale during late summer of 2004 in The Netherlands. It has 1436 records containing details on 38 attributes, including Price, Age, Kilometers, HP, and other specifications. The goal is to predict the price of a used Toyota Corolla based on its specifications. (The example in Section sec-regtrees is a subset of this dataset).
Data preprocessing. Split the data into training (50%), validation (30%), and test (20%) datasets.
Run a regression tree with the output variable Price and input variables Age−08−04, KM, Fuel−Type, HP, Automatic, Doors, Quarterly−Tax, Mfg−Guarantee, Guarantee−Period, Airco, Automatic−Airco, CD−Player, Powered−Windows, Sport−Model, and Tow−Bar. Set the minimum split size to 1, and use the split button repeatedly to create a full tree (hint, use the red triangle options to hide the tree and the graph). As you split, keep an eye on RMSE and RSquare for the training, validation and test sets.
i.
Describe what happens to the RSquare and RMSE for the training, validation and test sets as you continue to split the tree.
RSquare
RASE
N
Number of Splits
AICc
Training
0.639
2171.4819
718
1
13076.7
Validation
0.606
2178.3953
431
Test
0.721
2034.3537
287
RSquare
RASE
N
Number of Splits
AICc
Training
0.802
1608.639
718
5
12653.9
Validation
0.781
1624.6516
431
Test
0.850
1493.9806
287
RSquare
RASE
N
Number of Splits
AICc
Training
0.867
1317.707
718
10
12377.8
Validation
0.824
1456.1092
431
Test
0.900
1219.2657
287
RSquare
RASE
N
Number of Splits
AICc
Training
0.880
1248.6408
718
15
12310.9
Validation
0.829
1437.3576
431
Test
0.905
1186.1872
287
RSquare
RASE
N
Number of Splits
AICc
Training
0.888
1208.1286
718
20
12274.1
Validation
0.834
1414.5993
431
Test
0.910
1153.4376
287
RSquare
RASE
N
Number of Splits
AICc
Training
0.894
1173.7013
718
25
12243.3
Validation
0.840
1390.5749
431
Test
0.911
1147.7832
287
RSquare
RASE
N
Number of Splits
AICc
Training
0.900
1144.0907
718
30
12217.5
Validation
0.841
1382.6499
431
Test
0.919
1097.2567
287
RSquare
RASE
N
Number of Splits
AICc
Training
0.916
1048.2752
718
40
12114.2
Validation
0.844
1369.8937
431
Test
0.918
1104.0191
287
RSquare
RASE
N
Number of Splits
AICc
Training
0.920
1021.0167
718
50
12099.3
Validation
0.841
1384.3096
431
Test
0.915
1120.5823
287
RSquare
RASE
N
Number of Splits
AICc
Training
0.925
988.27733
718
60
12076.1
Validation
0.833
1418.0472
431
Test
0.915
1121.0064
287
RSquare
RASE
N
Number of Splits
AICc
Training
0.929
959.13913
718
80
12082.7
Validation
0.835
1408.9757
431
Test
0.911
1150.417
287
RSquare
RASE
N
Number of Splits
AICc
Training
0.933
936.55618
718
100
12101.2
Validation
0.837
1400.0907
431
Test
0.908
1168.0779
287
As we split the regression tree from 1 to 100 times to predict used Toyota Corolla prices, we see that initially, the model gets better at understanding the training data, with both
RSquare (how well it fits the data) and RMSE (how accurate its predictions are) improving. However, after around 30 splits, the improvements start to slow down, especially for the validation and test sets. This suggests that making the model too detailed might not be helping much and could even be making it too focused on the training data. It's like memorizing a lesson without really understanding it. So, while the model gets really good at predicting prices for cars it has seen before, it might struggle with new cars. It's important to find the right balance to make sure the model works well not just on training data but also on new, unseen data.
ii.
How does the performance of the test set compare to the training and validation sets on these measures? Why does this occur?
The performance of the test set in predicting used Toyota Corolla prices follows a similar pattern to the training and validation sets. At first, the model gets better at predicting prices for both the training and test sets as the tree splits. However, after around 30 splits, the improvements start to slow down for the test set, just like they do for the validation set. This could happen because the model becomes too focused on the details of the training data, making it less effective at predicting prices for new cars. It's like studying a specific set of problems too much and struggling when faced with new ones. Monitoring these patterns helps ensure the model works well not just on what it has seen before but also on new, unseen data.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
iii.
Based on this tree, which are the most important car specifications for predicting the car's price?
Column Contributions
Term
Number of Splits
SS
Portion
Age_08_04
16
7685656170
0.8798
HP
8
459235381
0.0526
KM
20
262445186
0.0300
Powered_Windows
9
82264183
0.0094
Quarterly_Tax
5
60911462.2
0.0070
Doors
9
48787063.1
0.0056
Airco
7
33903425.1
0.0039
Automatic_airco
1
31965718.4
0.0037
Mfg_Guarantee
9
23867564.3
0.0027
Sport_Model
4
14827478.5
0.0017
CD_Player
2
11128462.8
0.0013
Tow_Bar
6
10267467.2
0.0012
Guarantee_Period
2
5813137.94
0.0007
Automatic
1
3052618.98
0.0003
Fuel_Type
1
1732503.71
0.0002
The most important car specifications for predicting the car's price are determined by the features with the highest "SS Portion," which represents the proportion of the sum of squares attributable to each specification. The age of the car is the most crucial factor, making up about 88% of the reason. It's like saying, "If a car is older, it tends to cost less." The power of the car's engine, known as Horsepower (HP), is also important, contributing about 5% to the price. Additionally, the number of kilometers the car has driven (KM) matters, making up around 3% of the reason.
iv.
Refit this model, and use the Go button to automatically split and prune the tree based on the validation RSquare. Save the prediction formula for this model to the data table.
Save the Refit model in jmp file.
v.
How many splits are in the final tree?
36 splits.
RSquare
RASE
N
Number of Splits
AICc
Training
0.911
1074.441
718
36
12140.6
Validation
0.845
1366.0136
431
Test
0.921
1083.7582
287
vi.
Compare RSquare and RMSE for the training, validation and test sets for the reduced model to the full model.
Full Model
Reduced Model
Full Model
Reduced Model
RSquare
RASE
RSquare
RASE
Training
0.933
936.55618
0.911
1074.441
Validation
0.837
1400.0907
0.845
1366.0136
Test
0.908
1168.0779
0.921
1083.7582
In the training data, the reduced model shows a lower RSquare and a higher RMSE compared to the full model, indicating that the full model explains more variance and has a better fit for the known training data. However, for the Validation and Test sets, the reduced model outperforms the full model, showing higher RSquare and lower RMSE. This suggests that the reduced model generalizes better to new, unseen data, making it more effective for predictions on Validation and Test sets.
vii.
Which model is better for making predictions? Why?
The reduced model is preferred for making predictions. This choice is based on two key factors: the lower prediction error (lower RMSE), implying more accurate predictions, and the model's simplicity (parsimony). The reduced model is simpler, which is advantageous as it avoids overcomplicating the model while still achieving good prediction accuracy. Therefore, for practical prediction purposes, the reduced model stands out as the better choice.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
2.
Problem 9.4. Predicting Used Car Prices (Bootstrap Forest and Boosted Trees). Return to the Toyota Corolla data, and refit the partition model. (Hint: Use the recall button in the partition dialog). This time, choose bootstrap forest from the dialog window. Use the default settings.
i.
Compared to final reduced tree above, how does the bootstrap forest behave in terms of overall error rate on the test set? Save the prediction formula for this model to the data table.
Reduced Tree:
•
Test RSquare: 0.921
•
Test RASE: 1083.7582
Bootstrap Forest:
•
Test RSquare: 0.933
•
Test RASE: 995.36793
Comparing these values, we observe that the Bootstrap Forest performs slightly better on the test set. It has a higher RSquare (indicating a better fit) and a lower RASE (indicating lower prediction error) compared to the final reduced tree.
ii.
Run the same model again, but this time choose boosted tree from the partition dialog. Use the default settings.
Save the model in jmp file.
iii.
How does the boosted tree behave in terms of the prediction error rate relative to the reduced model and the bootstrap forest. Save the prediction formula for this model to the data table.
The Boosted Tree model exhibits impressive performance in terms of error rates relative to both the reduced model and the Bootstrap Forest. Across the three datasets (Training, Validation, and Test), the Boosted Tree consistently outperforms the other models. The Test RSquare for the Boosted Tree is notably higher at 0.943, surpassing the reduced model's 0.921 and the Bootstrap Forest's 0.933. Furthermore, the Test RASE (Root Average Squared Error) for the Boosted Tree is lower at 920.85, indicating a reduced prediction error compared to the reduced model (1083.76) and the Bootstrap Forest (995.37). These results suggest that the Boosted Tree, with 78 layers and a learning rate
of 0.097, offers superior predictive accuracy for estimating used car prices based on the given specifications. iv.
To facilitate comparison of error rates for the different models, use the Model Comparison platform (under Analyze > Modeling). To view fit statistics for the different models, put the validation column in the Group field in the Model Comparison dialog.
a.
Which model performs best on the test set?
The Boosted Tree model (Price Predictor 3) performs the best on the test set. It has the highest Test RSquare (0.9428), indicating a better fit to the test data compared to the other models. Additionally, the Test RASE (Root Average Squared Error) for the Boosted Tree is the lowest at 920.85, suggesting that it makes fewer prediction errors on the test set compared to the other models. And also has the lowest AAE (Average Absolute Error). Therefore, the Boosted Tree is the most effective model for predicting used car prices in this context.
b.
Explain why this model might have the best performance over the other models you fit.
The reason this model could outperform the others is that it exhibits a lower prediction error on the test set. This implies that the model is more accurate and precise in its predictions when faced with new, unseen data. The ability to minimize errors on the test set suggests that the model generalizes well and captures patterns effectively, making it a strong performer in predicting used car prices.
3.
Problem 9.5.
Predicting Flight Delays (Bootstrap Forest and Boosted Trees). We return to the flight delays data for this exercise, and fit both a bootstrap forest and a boosted tree to the data. Use scheduled departure time (CRS_DEP_TIME) rather than the binned version for these models.
i.
Fit a bootstrap forest, with the default settings. Save the formula for this model to the data table.
Created the model and save the prediction formula in jmp file.
a.
Look at the column contributions report. Which variables were involved in the most splits?
Column Contributions
CRS_DEP_Time, Day_Week, Carrier, Weather, and DEST are involved in the most splits. Distance and Origin are also involved in the splits, but they contribute the least.
b.
What is the error rate on the test Validation set?
The error rate is 0.1807
Overall Statistics
Measure
Training
Validation
Definition
Entropy RSquare
0.3455
0.0822
1-Loglike(model)/Loglike(0)
Generalized RSquare
0.4605
0.1242
(1-(L(0)/L(model))^(2/n))/(1-L(0)^(2/n))
Mean -Log p
0.3225
0.4519
∑ -Log(ρ[j])/n
RASE
0.3199
0.3724
√ ∑(y[j]-ρ[j])²/n
Mean Abs Dev
0.2249
0.2642
∑ |y[j]-ρ[j]|/n
Misclassification Rate
0.1537
0.1807
∑ (ρ[j]≠ρMax)/n
N
1321
880
N
Term
Number of Splits
G^2
Portion
CRS_DEP_TIME
4252
116.871467
0.3246
DAY_WEEK
4289
103.927386
0.2886
CARRIER
1553
48.2123204
0.1339
Weather
105
40.6662637
0.1129
DEST
808
21.7255141
0.0603
DISTANCE
808
17.0528003
0.0474
ORIGIN
714
11.6116091
0.0322
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
ii.
Fit a boosted tree to the flight delays data, again with the default settings. Save the formula to the data table.
Save the boosted tree model and prediction formula in jmp file.
a.
Which variables were involved in the most splits? Is this similiar to what you observed with the bootstrap forest model?
Column Contributions
Term
Number of Splits
G^2
Portion
DEST
80
39831.3746
0.2528
ORIGIN
65
36443.3695
0.2313
DAY_WEEK
189
28301.1903
0.1796
CARRIER
115
16637.304
0.1056
CRS_DEP_TIME
148
15413.2272
0.0978
DISTANCE
93
15080.0041
0.0957
Weather
58
5852.88494
0.0371
The top variables are DEST, ORIGIN, Day_WEEK and Carrier. These are different than what we saw with bootstrap forest.
b.
What is the error rate on the Validation set for this model?
The error rate is 0.1818.
iii.
Use the Model Comparison platform to compare these models to the final reduced model found earlier (again, put the validation column in the Group field in the Model Comparison dialog.
a.
Which model has the lowest overall error rate on the validation set?
Bootstrap Forest Misclassification Rate (Validation): 0.1807
Boosted Tree Misclassification Rate (Validation): 0.1818
Partition Misclassification Rate (Validation): 0.1807
Among these models, the one with the lowest misclassification rate on the validation set is the Bootstrap Forest, with a rate of 0.1807. Therefore, the Bootstrap Forest model has the lowest overall error rate on the validation set.
b.
Explain why this model might have the best performance over the other models you fit.
The Bootstrap Forest model emerges as the top performer among the models considered, showcasing a noteworthy edge in predictive accuracy on the validation set. With a misclassification rate of 0.1807, it outperforms both the Boosted Tree (0.1818) and Partition (0.1807) models. The confusion matrix reveals the model's proficiency in predicting both delayed and on-time flights, particularly excelling in the accurate prediction of on-time flights. Additionally, measures such as Entropy RSquare and Generalized RSquare indicate a superior fit of the Bootstrap Forest model to the validation data. Its Root Average Squared Error (RASE) and Mean Absolute Deviation values are lower, signifying enhanced predictive accuracy and precision. The model's probability columns align well with actual class labels, and it strikes a commendable balance between complexity and predictive power, avoiding overfitting. In essence, the Bootstrap Forest model's comprehensive performance metrics, including the confusion matrix and measures of fit, collectively underscore its effectiveness in predicting flight status, making it the standout choice among the considered models.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
4.
Problem 13.2. eBay Auctions—Boosting and Bagging. Using the eBay auction data (file eBayAuctions.jmp) with variable Competitive as the target, partition the data into training (50%), validation (30%), and test sets (20%).
i.
Create a classification tree. Use the Go button to create the model. Looking at the test set, what is the overall accuracy? What is the lift at portion = 0.2? Save the prediction formula to the data table.
The overall accuracy on the test set is 1-Misclassification = 1 - 0.1041 = 0.8959.
The lift at portion = 0.2 for Competitive = 1 is approximately 1.8, and the lift for Competitive = 0 is approximately 2.0.
ii.
Run the same tree, but first select the Boosted Tree method. Use the default settings. For the test set, what is the overall accuracy? What is the lift at portion = 0.20? Save the prediction formula to the data table.
The overall accuracy on the test set is 1 - 0.1345 = 0.8655. The lift at portion = 0.2 for Competitive = 1 is approximately 1.8, & the lift for Competitive = 0 is approximately 2.1.
iii.
Now try the same tree with the Bootstrap Forest method selected, and accept the default settings. What is the lift at portion = 0.2? Again, save the prediction formula to the data table.
The overall accuracy on the test set is 1-Misclassification = 1 - 0.0964 = 0.9036. The lift at portion = 0.2 for Competitive = 1 is approximately 1.8, and the lift for Competitive = 0 is approximately 2.2
iv.
Compare the three models using the Model Comparison platform under Analyze > Modeling. In the Model Comparison dialog, use the validation column as the By variable, but leave everything else blank. Compare the misclassification rates for the three models. Which model has the best accuracy? Compare the lift curves for the three models. Which model does the best job of sorting the response?
The Bootstrap Forest model emerges as the most accurate on the validation set with a rate of 0.1132, outperforming both the Partition model (0.1216) and the Boosted Tree model (0.1318). The lift curves, which depict the models' ability to sort the response variable, demonstrate comparable performance among the models, with lift values around 1.8 to 2.2 at a portion of 0.2. These findings underscore the effectiveness of the Bootstrap Forest model in accurately predicting flight status and organizing the data according to its response variable.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
v.
In the Model Comparison analysis window, select Model Averaging from the top red triangle. Return to the data table, and view the formulas for the new columns that have been saved. Describe how the models are being averaged.
Two columns are produced – one for the probability that competitive = 0 and another for the probability that competitive = 1. The formulas are the averages of the of the probability formulas for the three models. So, for each row in the training set the values are the average of the predicted probabilities from each of the models.
Competitive? 0 Avg Predictor = Competitive? 1 Avg Predictor =