Predicting Used Car Prices Using Regression Trees in JMP

Assignment 9 1. Problem 9.3 Predicting Prices of Used Cars (Regression Trees). The file ToyotaCorolla.jmp contains the data on used cars (Toyota Corolla) on sale during late summer of 2004 in The Netherlands. It has 1436 records containing details on 38 attributes, including Price, Age, Kilometers, HP, and other specifications. The goal is to predict the price of a used Toyota Corolla based on its specifications. (The example in Section sec-regtrees is a subset of this dataset). Data preprocessing. Split the data into training (50%), validation (30%), and test (20%) datasets. Run a regression tree with the output variable Price and input variables Age−08−04, KM, Fuel−Type, HP, Automatic, Doors, Quarterly−Tax, Mfg−Guarantee, Guarantee−Period, Airco, Automatic−Airco, CD−Player, Powered−Windows, Sport−Model, and Tow−Bar. Set the minimum split size to 1, and use the split button repeatedly to create a full tree (hint, use the red triangle options to hide the tree and the graph). As you split, keep an eye on RMSE and RSquare for the training, validation and test sets. i. Describe what happens to the RSquare and RMSE for the training, validation and test sets as you continue to split the tree. RSquare RASE N Number of Splits AICc Training 0.639 2171.4819 718 1 13076.7 Validation 0.606 2178.3953 431 Test 0.721 2034.3537 287 RSquare RASE N Number of Splits AICc Training 0.802 1608.639 718 5 12653.9 Validation 0.781 1624.6516 431 Test 0.850 1493.9806 287 RSquare RASE N Number of Splits AICc Training 0.867 1317.707 718 10 12377.8 Validation 0.824 1456.1092 431 Test 0.900 1219.2657 287 RSquare RASE N Number of Splits AICc Training 0.880 1248.6408 718 15 12310.9 Validation 0.829 1437.3576 431 Test 0.905 1186.1872 287

RSquare RASE N Number of Splits AICc Training 0.888 1208.1286 718 20 12274.1 Validation 0.834 1414.5993 431 Test 0.910 1153.4376 287 RSquare RASE N Number of Splits AICc Training 0.894 1173.7013 718 25 12243.3 Validation 0.840 1390.5749 431 Test 0.911 1147.7832 287 RSquare RASE N Number of Splits AICc Training 0.900 1144.0907 718 30 12217.5 Validation 0.841 1382.6499 431 Test 0.919 1097.2567 287 RSquare RASE N Number of Splits AICc Training 0.916 1048.2752 718 40 12114.2 Validation 0.844 1369.8937 431 Test 0.918 1104.0191 287 RSquare RASE N Number of Splits AICc Training 0.920 1021.0167 718 50 12099.3 Validation 0.841 1384.3096 431 Test 0.915 1120.5823 287 RSquare RASE N Number of Splits AICc Training 0.925 988.27733 718 60 12076.1 Validation 0.833 1418.0472 431 Test 0.915 1121.0064 287 RSquare RASE N Number of Splits AICc Training 0.929 959.13913 718 80 12082.7 Validation 0.835 1408.9757 431 Test 0.911 1150.417 287 RSquare RASE N Number of Splits AICc Training 0.933 936.55618 718 100 12101.2 Validation 0.837 1400.0907 431 Test 0.908 1168.0779 287 As we split the regression tree from 1 to 100 times to predict used Toyota Corolla prices, we see that initially, the model gets better at understanding the training data, with both

RSquare (how well it fits the data) and RMSE (how accurate its predictions are) improving. However, after around 30 splits, the improvements start to slow down, especially for the validation and test sets. This suggests that making the model too detailed might not be helping much and could even be making it too focused on the training data. It's like memorizing a lesson without really understanding it. So, while the model gets really good at predicting prices for cars it has seen before, it might struggle with new cars. It's important to find the right balance to make sure the model works well not just on training data but also on new, unseen data. ii. How does the performance of the test set compare to the training and validation sets on these measures? Why does this occur? The performance of the test set in predicting used Toyota Corolla prices follows a similar pattern to the training and validation sets. At first, the model gets better at predicting prices for both the training and test sets as the tree splits. However, after around 30 splits, the improvements start to slow down for the test set, just like they do for the validation set. This could happen because the model becomes too focused on the details of the training data, making it less effective at predicting prices for new cars. It's like studying a specific set of problems too much and struggling when faced with new ones. Monitoring these patterns helps ensure the model works well not just on what it has seen before but also on new, unseen data.

Your preview ends here