x_Homework3_submission-1

pdf

School

West Virginia University *

*We aren’t endorsed by this school

Course

7406

Subject

Industrial Engineering

Date

Dec 6, 2023

Type

pdf

Pages

5

Uploaded by MegaCloverFly16

Report
Homework3 2023-10-01 R Markdown ## Warning: package 'gridExtra' was built under R version 4.3.1 Including Plots You can also embed plots, for example: I. Introduction In this homework, The data is technical specifications of cars. The dataset is downloaded from UCI Machine Learning Repository. Each of the columns are described below: 1)Mpg - Miles per gallon (Continuous) 2)Cylinders - Number of cylinders per engine (Categorical variable) 3)Displacement - Combined volumes of pistons inside the cylinders of an engine(Continuous) 4)Horsepower - Power produced by an engine (Continuous variable) 5)Weight - Mass of the car (Continuous variable) 6)Acceleration - rate of change of velocity of the vehicle (Continuous variable) 7)Year - how long has it been since a car was manufactured (Discrete variable) 8)Origin - 1= Car made in America, 2 = Car made in Europe, 3 = Car made in other part of the world (Categorical variable) Additionally, the mpg column was transformed into a binary variable which is based on median of mpg. In order to predict whether a given car gets high or low gas mileage based 7 care attributes (cylinders, displacement, horsepower, weight, acceleration, model year, and origin), we are using different classification methods as shown below: 1)LDA (Linear discriminant analysis) 2)QDA (Quadratic discriminant analysis) 3)Naïve Bayes 4)Logistic regression 5)KNN with several values of K (since PCA-KNN model is optional, I decided not to performace this classification model in this homework)
II. Exploratory Data Analysis From the scatterplot below of all variables, is hard to see the relations between mpg01 and the other variables Based on the correlation matrix, we can see that for cylinders, displacement, horsepower, and weight has higher correlation values compared to acceleration, year and origin but the differences are not very significant. ## mpg01 cylinders displacement horsepower weight acceleration year ## mpg01 1.00 -0.76 -0.75 -0.67 -0.76 0.35 0.43 ## cylinders -0.76 1.00 0.95 0.84 0.90 -0.50 -0.35 ## displacement -0.75 0.95 1.00 0.90 0.93 -0.54 -0.37 ## horsepower -0.67 0.84 0.90 1.00 0.86 -0.69 -0.42 ## weight -0.76 0.90 0.93 0.86 1.00 -0.42 -0.31 ## acceleration 0.35 -0.50 -0.54 -0.69 -0.42 1.00 0.29 ## year 0.43 -0.35 -0.37 -0.42 -0.31 0.29 1.00 ## origin 0.51 -0.57 -0.61 -0.46 -0.59 0.21 0.18 ## origin ## mpg01 0.51 ## cylinders -0.57 ## displacement -0.61 ## horsepower -0.46 ## weight -0.59 ## acceleration 0.21 ## year 0.18 ## origin 1.00
From the box plots below, we see that cylinder, weight, displacement, horsepower have distinction between 2 groups of mpg. Therefore, we can select these as independent variables. We also see that plotting box plot for categorical variables(cylinders, origin) are not useful. Therefore, for this classification models, we will include all variables. III. Classification methods and results: From the testing and training error of each model shown in the table, we can see that QDA has the lowest training error and second lowest testing error. For KNN model, k=3 is the best model as the training and testing error are reducing. From the current values, we can see that QDA performs best within one time of result. Model Train Error Test Error (i) LDA 0.08498584 0.1025641 (ii) QDA 0.07082153 0.1025641 (iii) Naive Bayes 0.09915014 0.07692308 (iv) Logistics 0.08498584 0.1538462 (v) KNN, k=1 0.0 0.0769 (v) KNN, k=2 0.0595 0.1026 (v) KNN, k=3 0.0822 0.1282
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Model Train Error Test Error (v) KNN, k=4 0.0935 0.1026 (v) KNN, k=5 0.0850 0.1026 (v) KNN, k=6 0.0935 0.1282 (v) KNN, k=7 0.1020 0.1282 (v) KNN, k=8 0.0907 0.1282 (v) KNN, k=9 0.992 0.1026 (v) KNN, k=10 0.1020 0.1282 To have better comparison between models (for KNN, we chose k=3), we performed cross validation with n= 50 to get the average testing accuracy (I used 50 times due to computational limit) Model Train Error Mean Train Error Variance Test Error Mean Test Error Variance (i) LDA 0.0892 0.0055 0.0923 0.0513 (ii) QDA 0.0827 0.0060 0.0913 0.0443 (iii) Naive Bayes 0.0961 0.056 0.0938 0.045 (iv) Logistics 0.1008 0.0062 0.1072 0.0471 (v) KNN, k=3 0.0774 0.0078 0.1282 0.0494 From Cross Validation value, we see that QDA model has smallest test error mean which indicating that it is best model. To clarifying our conclusion, we perform t-test to compared with other models. p-values of all t-tests are all smaller than threshold of 0.05, therefore, there is a statistically significant proof that the QDA performs better than other models. ## ## Paired t-test ## ## data: TE_test_all[, 1] and TE_test_all[, 2] ## t = 0.35799, df = 49, p-value = 0.7219 ## alternative hypothesis: true mean difference is not equal to 0 ## 95 percent confidence interval: ## -0.007097565 0.010174488 ## sample estimates: ## mean difference ## 0.001538462
## ## Paired t-test ## ## data: TE_test_all[, 3] and TE_test_all[, 2] ## t = -0.51258, df = 49, p-value = 0.6106 ## alternative hypothesis: true mean difference is not equal to 0 ## 95 percent confidence interval: ## -0.010093411 0.005990847 ## sample estimates: ## mean difference ## -0.002051282 ## ## Paired t-test ## ## data: TE_test_all[, 4] and TE_test_all[, 2] ## t = 2.1422, df = 49, p-value = 0.03718 ## alternative hypothesis: true mean difference is not equal to 0 ## 95 percent confidence interval: ## 0.0006665655 0.0208718961 ## sample estimates: ## mean difference ## 0.01076923 ## ## Paired t-test ## ## data: TE_test_all[, 5] and TE_test_all[, 2] ## t = 3.1151, df = 49, p-value = 0.00307 ## alternative hypothesis: true mean difference is not equal to 0 ## 95 percent confidence interval: ## 0.008190057 0.037963789 ## sample estimates: ## mean difference ## 0.02307692