Ensemble Methods for Predicting Car Gas Mileage

1 ISYE 7401 Homework 5: Ensemble Methods: Random forest and boosting Introduction The cleaned Auto MPG dataset consists of 8 variables of specifications of 392 cars. The binary variable ‘mpg01’ is created to indicate if the car’s mpg is above or equal to (1) or below (0) the median. This report aims to determine methods that can be used to guide the manufacturing or buying of high-gas-mileage cars. Objectives This report uses different classification methods to predict mpg01: The statistical methods below are used: 1. Random forest 2. Boosting 3. LDA 4. QDA 5. Naïve Bayes 6. Logistic Regression 7. KNN with several values of K For random forest and boosting, parameter tuning is done using cross-validation on train data: o Using B=100 loops, the mean and variance of the train error for each combination of tuning parameters are captured and compared o The tuning parameter combination with the lower train error would be used to build the model, and compared with the other methods. The models would then be evaluated using the methods below: o Train/Test error: 𝐹???? 𝑃??𝑖?𝑖??+𝐹???? 𝑁????𝑖?? 𝑁????? ?? ???? ????????𝑖??? o Specificity: 𝑇𝑃 𝑇𝑃+𝐹𝑁 o Sensitivity: 𝑇𝑁 𝑇𝑁+𝐹𝑃 o Monte Carlo Cross Validation using B=100 loops ▪ Compare the average mean and variance of the train/test errors for each model ▪ Compare the mean of the specificity and sensitivity for each model

2 Exploratory Data Analysis The 8 variables in the dataset are described below: Table 1: Overview of the variables in the cleaned dataset Variable name Description Type mpg01 0 if the car’s mpg is below the median, 1 if the car’s mpg is above the median Binary (True/False) cylinders Number of cylinders between 4 and 8 Numerical displacement Engine displacement (cu. inches) Numerical horsepower Engine horsepower Numerical weight Vehicle weight (lbs.) Numerical acceleration Time to accelerate from 0 to 60 mph (sec.) Numerical year Model year (modulo 100) Numerical origin Origin of car (1. American, 2. European, 3. Japanese) Categorical As the origin variable is categorical, one hot encoding is applied to create 3 columns, one for each origin category. The median value of the original mpg variable is 22.75. Creating the mpg01 variable based on the median value ensures an equal split of 196 ‘TRUE’ and 196 ‘FALSE’ values. This eliminates the need to balance the dataset for classification modelling. Figure 1: Original mpg variable, with median line shown.

3 Figure 2: Boxplot of the auto dataset The boxplots show that there are some outliers present in the data, particularly in the variable ‘weight.’ The scale of the variables also differs, ‘weight’ has a magnitude of at least 10 times larger than the other variables, which could impact the classification models. Table 2: Descriptive stats of variables Variable Min Median Max cylinders 3 4 8 displacement 68 151 455 horsepower 46 93.5 230 weight 1613 2804 5140 acceleration 8 15.5 24.8 year 70 76 82 origin1 0 1 1 origin2 0 0 1 origin3 0 0 1

Your preview ends here