x_Homework3_submission-1
pdf
keyboard_arrow_up
School
West Virginia University *
*We aren’t endorsed by this school
Course
7406
Subject
Industrial Engineering
Date
Dec 6, 2023
Type
Pages
5
Uploaded by MegaCloverFly16
Homework3
2023-10-01
R Markdown
## Warning: package 'gridExtra' was built under R version 4.3.1
Including Plots
You can also embed plots, for example:
I. Introduction
In this homework, The data is technical specifications of cars. The dataset is downloaded from UCI Machine
Learning Repository. Each of the columns are described below:
1)Mpg - Miles per gallon (Continuous)
2)Cylinders - Number of cylinders per engine (Categorical variable)
3)Displacement - Combined volumes of pistons inside the cylinders of an engine(Continuous)
4)Horsepower - Power produced by an engine (Continuous variable)
5)Weight - Mass of the car (Continuous variable)
6)Acceleration - rate of change of velocity of the vehicle (Continuous variable)
7)Year - how long has it been since a car was manufactured (Discrete variable)
8)Origin - 1= Car made in America, 2 = Car made in Europe, 3 = Car made in other part of the world
(Categorical variable)
Additionally, the mpg column was transformed into a binary variable which is based on median of mpg.
In order to predict whether a given car gets high or low gas mileage based 7 care attributes (cylinders,
displacement, horsepower, weight, acceleration, model year, and origin), we are using different classification
methods as shown below:
1)LDA (Linear discriminant analysis)
2)QDA (Quadratic discriminant analysis)
3)Naïve Bayes
4)Logistic regression
5)KNN with several values of K
(since PCA-KNN model is optional, I decided not to performace this classification model in this homework)
II. Exploratory Data Analysis
From the scatterplot below of all variables, is hard to see the relations between mpg01 and the other variables
Based on the correlation matrix, we can see that for cylinders, displacement, horsepower, and weight has higher
correlation values compared to acceleration, year and origin but the differences are not very significant.
## mpg01 cylinders displacement horsepower weight acceleration year
## mpg01 1.00 -0.76 -0.75 -0.67 -0.76 0.35 0.43
## cylinders -0.76 1.00 0.95 0.84 0.90 -0.50 -0.35
## displacement -0.75 0.95 1.00 0.90 0.93 -0.54 -0.37
## horsepower -0.67 0.84 0.90 1.00 0.86 -0.69 -0.42
## weight -0.76 0.90 0.93 0.86 1.00 -0.42 -0.31
## acceleration 0.35 -0.50 -0.54 -0.69 -0.42 1.00 0.29
## year 0.43 -0.35 -0.37 -0.42 -0.31 0.29 1.00
## origin 0.51 -0.57 -0.61 -0.46 -0.59 0.21 0.18
## origin
## mpg01 0.51
## cylinders -0.57
## displacement -0.61
## horsepower -0.46
## weight -0.59
## acceleration 0.21
## year 0.18
## origin 1.00
From the box plots below, we see that cylinder, weight, displacement, horsepower have distinction between 2
groups of mpg. Therefore, we can select these as independent variables. We also see that plotting box plot for
categorical variables(cylinders, origin) are not useful.
Therefore, for this classification models, we will include all variables.
III. Classification methods and results:
From the testing and training error of each model shown in the table, we can see that QDA has the lowest training
error and second lowest testing error. For KNN model, k=3 is the best model as the training and testing error are
reducing. From the current values, we can see that QDA performs best within one time of result.
Model
Train Error
Test Error
(i) LDA
0.08498584
0.1025641
(ii) QDA
0.07082153
0.1025641
(iii) Naive Bayes
0.09915014
0.07692308
(iv) Logistics
0.08498584
0.1538462
(v) KNN, k=1
0.0
0.0769
(v) KNN, k=2
0.0595
0.1026
(v) KNN, k=3
0.0822
0.1282
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Model
Train Error
Test Error
(v) KNN, k=4
0.0935
0.1026
(v) KNN, k=5
0.0850
0.1026
(v) KNN, k=6
0.0935
0.1282
(v) KNN, k=7
0.1020
0.1282
(v) KNN, k=8
0.0907
0.1282
(v) KNN, k=9
0.992
0.1026
(v) KNN, k=10
0.1020
0.1282
To have better comparison between models (for KNN, we chose k=3), we performed cross validation with n= 50 to
get the average testing accuracy (I used 50 times due to computational limit)
Model
Train Error Mean
Train Error
Variance
Test Error Mean
Test Error Variance
(i) LDA
0.0892
0.0055
0.0923
0.0513
(ii) QDA
0.0827
0.0060
0.0913
0.0443
(iii) Naive Bayes
0.0961
0.056
0.0938
0.045
(iv) Logistics
0.1008
0.0062
0.1072
0.0471
(v) KNN, k=3
0.0774
0.0078
0.1282
0.0494
From Cross Validation value, we see that QDA model has smallest test error mean which indicating that it is best
model.
To clarifying our conclusion, we perform t-test to compared with other models. p-values of all t-tests are all smaller
than threshold of 0.05, therefore, there is a statistically significant proof that the QDA performs better than other
models.
## ## Paired t-test
## ## data: TE_test_all[, 1] and TE_test_all[, 2]
## t = 0.35799, df = 49, p-value = 0.7219
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
## -0.007097565 0.010174488
## sample estimates:
## mean difference ## 0.001538462
## ## Paired t-test
## ## data: TE_test_all[, 3] and TE_test_all[, 2]
## t = -0.51258, df = 49, p-value = 0.6106
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
## -0.010093411 0.005990847
## sample estimates:
## mean difference ## -0.002051282
## ## Paired t-test
## ## data: TE_test_all[, 4] and TE_test_all[, 2]
## t = 2.1422, df = 49, p-value = 0.03718
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
## 0.0006665655 0.0208718961
## sample estimates:
## mean difference ## 0.01076923
## ## Paired t-test
## ## data: TE_test_all[, 5] and TE_test_all[, 2]
## t = 3.1151, df = 49, p-value = 0.00307
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
## 0.008190057 0.037963789
## sample estimates:
## mean difference ## 0.02307692