HW5 Submission 29Oct23

pdf

School

West Virginia University *

*We aren’t endorsed by this school

Course

7406

Subject

Industrial Engineering

Date

Dec 6, 2023

Type

pdf

Pages

50

Uploaded by MegaCloverFly16

Report
1 ISYE 7401 Homework 5: Ensemble Methods: Random forest and boosting Introduction The cleaned Auto MPG dataset consists of 8 variables of specifications of 392 cars. The binary variable ‘mpg01’ is created to indicate if the car’s mpg is above or equal to (1) or below (0) the median. This report aims to determine methods that can be used to guide the manufacturing or buying of high-gas-mileage cars. Objectives This report uses different classification methods to predict mpg01: The statistical methods below are used: 1. Random forest 2. Boosting 3. LDA 4. QDA 5. Naïve Bayes 6. Logistic Regression 7. KNN with several values of K For random forest and boosting, parameter tuning is done using cross-validation on train data: o Using B=100 loops, the mean and variance of the train error for each combination of tuning parameters are captured and compared o The tuning parameter combination with the lower train error would be used to build the model, and compared with the other methods. The models would then be evaluated using the methods below: o Train/Test error: 𝐹???? 𝑃??𝑖?𝑖??+𝐹???? 𝑁????𝑖?? 𝑁????? ?? ???? ????????𝑖??? o Specificity: 𝑇𝑃 𝑇𝑃+𝐹𝑁 o Sensitivity: 𝑇𝑁 𝑇𝑁+𝐹𝑃 o Monte Carlo Cross Validation using B=100 loops Compare the average mean and variance of the train/test errors for each model Compare the mean of the specificity and sensitivity for each model
2 Exploratory Data Analysis The 8 variables in the dataset are described below: Table 1: Overview of the variables in the cleaned dataset Variable name Description Type mpg01 0 if the car’s mpg is below the median, 1 if the car’s mpg is above the median Binary (True/False) cylinders Number of cylinders between 4 and 8 Numerical displacement Engine displacement (cu. inches) Numerical horsepower Engine horsepower Numerical weight Vehicle weight (lbs.) Numerical acceleration Time to accelerate from 0 to 60 mph (sec.) Numerical year Model year (modulo 100) Numerical origin Origin of car (1. American, 2. European, 3. Japanese) Categorical As the origin variable is categorical, one hot encoding is applied to create 3 columns, one for each origin category. The median value of the original mpg variable is 22.75. Creating the mpg01 variable based on the median value ensures an equal split of 196 ‘TRUE’ and 196 ‘FALSE’ values. This eliminates the need to balance the dataset for classification modelling. Figure 1: Original mpg variable, with median line shown.
3 Figure 2: Boxplot of the auto dataset The boxplots show that there are some outliers present in the data, particularly in the variable ‘weight.’ The scale of the variables also differs, ‘weight’ has a magnitude of at least 10 times larger than the other variables, which could impact the classification models. Table 2: Descriptive stats of variables Variable Min Median Max cylinders 3 4 8 displacement 68 151 455 horsepower 46 93.5 230 weight 1613 2804 5140 acceleration 8 15.5 24.8 year 70 76 82 origin1 0 1 1 origin2 0 0 1 origin3 0 0 1
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
4 Figure 3: Correlation heatmap of auto dataset Figure 3 above shows the correlation heatmap of the auto dataset. Based on the coefficient coefficients: mpg01 has strong negative correlation (<-0.7) with: o cylinders o displacement o weight mpg01 has moderate correlation with: o Moderate positive correlation (0.5 to 0.7) with origin1 o Moderate negative correlation (-0.5 to -0.7) with horsepower mpg01 has low correlation with: o acceleration o year o origin2, origin3 In addition, the correlation heatmap also suggests multicollinearity in the data, as the four variables: cylinders, displacement, horsepower, and weight have strong positive correlation (>0.7) with each other. Most notably, cylinders and displacement have the strongest mpg01 cylinders displace ment horse power weight accele ration year origin1 origin2 origin3 mpg01 1.00 -0.76 -0.75 -0.67 -0.76 0.35 0.43 -0.53 0.27 0.39 cylinders -0.76 1.00 0.95 0.84 0.90 -0.50 -0.35 0.61 -0.35 -0.40 displacement -0.75 0.95 1.00 0.90 0.93 -0.54 -0.37 0.66 -0.37 -0.44 horsepower -0.67 0.84 0.90 1.00 0.86 -0.69 -0.42 0.49 -0.28 -0.32 weight -0.76 0.90 0.93 0.86 1.00 -0.42 -0.31 0.60 -0.29 -0.45 acceleration 0.35 -0.50 -0.54 -0.69 -0.42 1.00 0.29 -0.26 0.21 0.12 year 0.43 -0.35 -0.37 -0.42 -0.31 0.29 1.00 -0.14 -0.04 0.20 origin1 -0.53 0.61 0.66 0.49 0.60 -0.26 -0.14 1.00 -0.59 -0.65 origin2 0.27 -0.35 -0.37 -0.28 -0.29 0.21 -0.04 -0.59 1.00 -0.23 origin3 0.39 -0.40 -0.44 -0.32 -0.45 0.12 0.20 -0.65 -0.23 1.00
5 correlation (0.95) followed by displacement and weight (0.93) and displacement and horsepower (0.90). Table 3 shows the calculated Variance Inflation Factors (VIF) to determine if multicollinearity is present in the data. A VIF above 5 indicates high and VIF>10 indicates very high collinearity. The barplot in Figure 4 clearly shows that ‘ cylinders and ‘weight’ have high collinearity (VIF>5) and ‘displacement’ has very high collinearity (VIF>10) to other predictors in the model. Multicollinearity needs to be addressed as it could impact the classification models built, due to potentially causing instability in the coefficient estimates. There is a need to select features. Figure 4: Barplot of VIF of all variables Table 3: VIF of all variables Variable VIF cylinders 5.41 displacement 11.6 horsepower 2.69 weight 6.60 acceleration 2.52 year 2.16 origin1 2.94 origin2 1.96
6 To select the variables to use in the model, stepwise variable selection using AIC (Akaike Information Criterion) is run for B=100 loops. Figure 5 shows the number of times the variable was selected, out of the 100 loops. ‘weight’ and ‘year’ are consistently selected, followed by horsepower (91 out of 100 times). Variable selection Figure 5: Number of times the variable was selected in the model using stepwise variable selection From Figure 5, the variables horsepower, weight, year, origin1 were selected in over half of the loops (>50 counts). Referring to Table 4, the VIF values of the model with these selected variables only, the VIFs are all <5.0, indicating that there is no problematic multicollinearity amongst the predictors in this reduced model. As such, the models are built using the variables horsepower + weight + year + origin1. Table 4: VIF only with horsepower, weight, year, origin1 variables Variable VIF horsepower 1.14 weight 1.64 year 1.63 origin1 1.16
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
7 Methods: Monte-Carlo Cross validation As using a single split of train and test dataset is not sufficiently robust to evaluate model performance, single split results are not used. Instead, Monte Carlo cross-validation of B=100 loops is used for parameter tuning and model comparisons. For each loop, the dataset is randomly split into 80% train, 20% test set. (n=314 for the train set, and n=78 for the test set). A 80/20 split is used to give sufficiently robust sample for the test set. A seed is set for replicability. Within each loop, the models are trained on the training data of that specific loop. For parameter tuning, performance is evaluated on the train data for that loop. The reason for using train data for parameter tuning is that in real-world applications, there is no access to the true response values in the test data. Model comparison is done using the sample mean and variance of the test errors, since the response values of the test data are known. Results Parameter tuning Random Forest Table 5 shows the mean of train errors for B=100 loops Monte Carlo Cross validation of the models with selected horsepower + weight + year + origin1 variables only, with different combinations of n.trees, mtry and nodesize. Table 5: Random Forest: Mean of train errors across the 100 loops of Monte Carlo Cross validation, for different parameters n.trees, mtry and nodesize n.trees = 200 n.trees = 300 n.trees = 500 mtry nodesize Avg Train Error Rank Train Error mtry nodesize Avg Train Error Rank Train Error mtry nodesize Avg_Train_Error Rank Train Error 1 1 0.027 19 1 1 0.028 21 1 1 0.027 20 1 2 0.029 22 1 2 0.029 23 1 2 0.029 23 1 5 0.035 30 1 5 0.034 26 1 5 0.035 28 1 10 0.040 35 1 10 0.041 36 1 10 0.040 34 2 1 0.000 6 2 1 0.000 4 2 1 0.000 5 2 2 0.005 12 2 2 0.005 11 2 2 0.005 10 2 5 0.022 14 2 5 0.023 15 2 5 0.022 13 2 10 0.035 29 2 10 0.033 25 2 10 0.034 27 3 1 0.000 1 3 1 0.000 1 3 1 0.000 1 3 2 0.002 9 3 2 0.001 7 3 2 0.001 8 3 5 0.024 17 3 5 0.025 18 3 5 0.024 16 3 10 0.039 31 3 10 0.039 33 3 10 0.039 32
8 From Table 5, increasing the number of trees did not result in large reductions in train error. This suggests that increasing the number of trees beyond a certain point may not significantly improve the model's performance on the training data. ‘mtry’ represents the number of variables randomly sampled as candidates at each split. From Table 5, as mtry increases from 1 to 2, the average train error tends to decrease. However, when mtry increases from 2 to 3, the average train error tends to increase. This suggests that mtry = 2 gives a better performance of the model. ‘nodesize’ indicates the minimum size of terminal notes in the trees, with a smaller nodesize allowing the trees to grow deeper and fit the train data more closely. From Table 5, as nodesize increases, the average train error tends to increase, suggesting that a smaller nodesize may have better performance. Even though nodesize=1 tends to have low training error (0) this could indicate overfitting. As such nodesize = 2 is used for building the model. Based on the above, n.trees = 300, mtry= 2 and nodesize = 2 are used as the parameters for Random Forest. Parameter tuning Boosting Table 6: Boosting: Mean of train errors across the 100 loops of Monte Carlo Cross validation, for different parameters n.trees, shrinkage, interaction.depth n.trees = 200 n.trees = 300 n.trees = 500 Shrink- age Interact- ion.depth Avg Train Error Rank Train Error Shrink- age Interact- ion.depth Avg Train Error Rank Train Error Shrink- age Interact- ion.depth Avg Train Error Rank Train Error 0.01 1 0.093 27 0.01 1 0.077 26 0.01 1 0.070 24 0.01 2 0.074 25 0.01 2 0.067 23 0.01 2 0.059 21 0.01 3 0.064 22 0.01 3 0.057 20 0.01 3 0.047 18 0.1 1 0.047 19 0.1 1 0.041 17 0.1 1 0.029 15 0.1 2 0.025 13 0.1 2 0.015 11 0.1 2 0.001 7 0.1 3 0.013 10 0.1 3 0.002 8 0.1 3 0.000 1 0.2 1 0.035 16 0.2 1 0.027 14 0.2 1 0.017 12 0.2 2 0.008 9 0.2 2 0.000 6 0.2 2 0.000 1 0.2 3 0.000 5 0.2 3 0.000 1 0.2 3 0.000 1 Table 6 shows that overall, increasing the number of trees, shrinkage value and interaction.depth shows a decrease in train error. In fact, for the highest shrinkage (0.2) and interaction.depth (3), average train error is 0 for all n.trees tested (200,300,500). However, this could indicate overfitting to the train data. As such to avoid overfitting to train data, for boosting algorithm n.trees = 200, shrinkage = 0.1 and interaction.depth = 2 are used as the parameters.
9 Findings Comparison of Model performance Using the selected tuned parameters for Random Forest (n.trees = 300, mtry= 2 and nodesize = 2) and Boosting (n.trees = 200, shrinkage = 0.1 and interaction.depth = 2), Monte Carlo Cross Validation is run for 100 loops, with train and test errors plotted below. The table of train and test error mean and variance of all models is shown in the Appendix. Figure 6a: Train errors, 6b:Test errors of all models (horsepower + weight + year + origin1 variables only) From Figure 6a and 6b of train and test errors boxplot, Random Forest (RF) has the best performance in terms of the smallest train and test errors, followed by Boosting. The variance of RF and Boosting in both train and test errors are amongst the lowest of the models, suggesting a more consistent performance of these models. The other models have larger train errors than Random Forest and Boosting. Looking at test errors, RF and Boosting have higher test errors than in train datasets, indicating that the two models might be overfitting to the training data. However, overall RF and Boosting still have better performance than the other models on test data. LDA and Logistic regression have higher test errors, followed by QDA and Naïve Bayes. KNN models (across all k values) have higher test errors overall. KNN is a distance-based model and might not be as suitable for this unscaled dataset. From the EDA, the weight variable is in the range of 1,613 to 5,140, which is at least 10 times larger in magnitude compared to the year variable. Scaling of the variables could potentially lead to better performance of the model. Notably, LDA and QDA assume normality in the predictor variables. Given that the variables do not follow a normal distribution, LDA and QDA might not be recommended models for this dataset. The performance of LDA and QDA could be improved using transformation methods, such as Box-Cox transformation.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
10 Table 7a: Train errors: Comparison of Random Forest vs Other methods KNN Boost LDA QDA NB L.R. k=1 k=3 k=5 k=7 k=9 k=11 k=13 k=15 t test 1.43E -54 5.15E -108 1.17E -103 1.14E -108 4.07E -91 6.84E -35 4.15E -93 5.46E -97 1.38E -105 1.42E -106 2.77E -107 5.53E -105 1.50E -106 w test 3.50E -18 3.48E -18 3.61E -18 3.38E -18 3.66E -18 4.12E -17 3.45E -18 3.65E -18 3.53E -18 3.55E -18 3.44E -18 3.44E -18 3.39E -18 Table 7b: Test errors: Comparison of Random Forest vs Other methods KNN Boost LDA QDA NB L.R. k=1 k=3 k=5 k=7 k=9 k=11 k=13 k=15 t test 0.313 2.30E -06 1.17E -08 1.03E -10 9.35E -06 1.29E -26 2.21E -19 5.27E -23 9.22E -22 2.52E -21 1.35E -20 8.96E -20 2.62E -19 w test 0.294 1.15E -05 6.59E -08 2.06E -09 3.61E -05 6.82E -17 2.98E -14 1.39E -15 2.60E -15 3.25E -15 6.41E -15 1.64E -14 2.24E -14 Statistical testing is run on both train and test errors to determine if Random Forest performance is statistically significant compared to other methods. Table 7a and 7b show the results of both pairwise t-test and w-test. For train errors (Table 7a), the p-value of Random Forest is <0.05 compared to all other methods, indicating statistically significant better performance on train datasets. For test errors (Table 7b) Random Forest is statistically significant compared to LDA, QDA, NB, LR, KNN. However, compared to Boosting test errors, the p-value is >0.05. This indicates that the performance of both Random Forest and Boosting on the test datasets are comparable, and both are suitable models for this classification task.
11 Sensitivity and Specificity Figure 7: Sensitivity of all models (horsepower + weight + year + origin1 only) Figure 8: Specificity of all models (horsepower + weight + year + origin1 only) Based on Figure 7 sensitivity boxplot, of all models, Random Forest, followed by Boosting has the highest sensitivity. This indicates that of the models, Random Forest is best able to correctly identify a car with mpg01 = 1, the true positives. Given that the objective is to guide the manufacturing or purchase of high gas mileage cars, logistic regression is recommended, based on higher sensitivity.
12 On the other hand, from Figure 8, Random Forest and Boosting have slightly lower average specificity compared to methods like LDA. This means that RF and Boosting are less likely to correctly identify the negatives, the mpg01 = 0 cars. However, given the main objective of identifying high mileage cars, the low-test errors of RF and Boosting and high sensitivity suggest that these are good models for this context. Conclusion In this report, various methods were used to classify mpg01. Based on the boxplot and variable selection using stepwise selection (AIC criterion), the variables horsepower + weight + year + origin1 were used to create the models. Parameter tuning of Random Forest and Boosting was done using Monte Carlo Cross Validation B=100 loops, to obtain parameters with lower train errors. Based on average train and test errors, Random Forest has the lowest train and test error, followed by Boosting, compared to other methods. RF and Boosting also have low variance, suggesting more consistent predictive performance. The train data performance of RF is statistically significantly better than all other methods. On test datasets, Random Forest and Boosting have comparable performance, and these 2 models are statistically significant compared to other methods (LDA, QDA, NB, LR, KNN). In addition, given the context of identifying true high mileage cars (mpg01 = 1), the high sensitivity of Random Forest and Boosting suggest they are suitable models for this context.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
13 Appendix - Results Table: Mean and variance of train error, test error, specificity and sensitivity from 100 loops of Monte Carlo Cross validation. Models built using: horsepower + weight + year + origin1 Train Error Test Error Specificity Sensitivity Method Mean Variance Mean Variance Mean Variance Mean Variance Random Forest 0.0052 0.0000 0.0767 0.0008 0.9233 0.0020 0.9237 0.0017 Boosting 0.0252 0.0000 0.0809 0.0008 0.9356 0.0013 0.9052 0.0024 LDA 0.0899 0.0001 0.0956 0.0009 0.9561 0.0010 0.8649 0.0023 QDA 0.0932 0.0001 0.1023 0.0010 0.9444 0.0014 0.8621 0.0027 Naïve Bayes 0.1035 0.0001 0.1069 0.0012 0.9444 0.0017 0.8542 0.0026 Logistic Regression 0.0883 0.0001 0.0936 0.0007 0.9204 0.0019 0.8960 0.0019 KNN - k =1 0.0000 0.0000 0.1388 0.0010 0.8568 0.0025 0.8662 0.0025 KNN - k =3 0.0735 0.0001 0.1242 0.0011 0.8785 0.0030 0.8743 0.0022 KNN - k =5 0.0944 0.0001 0.1281 0.0009 0.8668 0.0026 0.8778 0.0020 KNN - k =7 0.1047 0.0001 0.1272 0.0009 0.8652 0.0026 0.8817 0.0023 KNN - k =9 0.1109 0.0001 0.1262 0.0009 0.8689 0.0026 0.8796 0.0021 KNN - k =11 0.1127 0.0001 0.1240 0.0009 0.8738 0.0024 0.8787 0.0021 KNN - k =13 0.1108 0.0001 0.1238 0.0010 0.8781 0.0025 0.8748 0.0023 KNN - k =15 0.1107 0.0001 0.1227 0.0010 0.8797 0.0025 0.8756 0.0023
14 Figures 9: Box plot of variables by mpg01
Appendix Read Data ## Read data df <- read.table( file= "Auto.csv" , sep = "," , header= TRUE); n = dim(df)[ 1 ]; ### total number of observations n1 = round(n/ 5 ); ### number of observations randomly selected for testing data mpg_org <- df$mpg median(mpg_org) ## [1] 22.75 hist(mpg_org, main = "Histogram of MPG" , xlab = "Miles Per Gallon" , ylab = "Frequency" ) abline( v = median(mpg_org), col = "red" , lty = 2 ) Histogram of MPG Miles Per Gallon Frequency 10 20 30 40 50 0 20 40 60 80 1
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Recode mpg01 mpg01 = I(df$mpg >= median(df$mpg)) df = data.frame(mpg01, df[,- 1 ]); ## replace column "mpg" by "mpg01". One hot encoding df$origin <- factor(df$origin) # Apply one-hot encoding df <- cbind(df, model.matrix(~ origin - 1 , data = df)) df <- df[, !colnames(df) %in% c( ' origin ' )] EDA table(df$mpg01) ## ## FALSE TRUE ## 196 196 boxplot(df, main= "Auto Dataset Boxplot" , ylab= "Value" , las= 2 ) 2
mpg01 cylinders splacement horsepower weight acceleration year origin1 origin2 origin3 0 1000 2000 3000 4000 5000 Auto Dataset Boxplot Value selected_columns <- df[, 2 : 10 ] # Combine selected columns with mpg01 data_for_boxplot <- cbind(selected_columns, mpg01 = factor(df$mpg01)) # Create box plots for each variable against mpg01 for (col in names(selected_columns)) { boxplot(data_for_boxplot[, col] ~ data_for_boxplot$mpg01, main = paste( "Boxplot of" , col, "by mpg01" ), ylab = col, xlab = "mpg01" , col = c( "#d7658b" , "#54bebe" )) } 3
FALSE TRUE 3 4 5 6 7 8 Boxplot of cylinders by mpg01 mpg01 cylinders 4
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
FALSE TRUE 100 200 300 400 Boxplot of displacement by mpg01 mpg01 displacement 5
FALSE TRUE 50 100 150 200 Boxplot of horsepower by mpg01 mpg01 horsepower 6
FALSE TRUE 1500 2500 3500 4500 Boxplot of weight by mpg01 mpg01 weight 7
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
FALSE TRUE 10 15 20 25 Boxplot of acceleration by mpg01 mpg01 acceleration 8
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
FALSE TRUE 70 72 74 76 78 80 82 Boxplot of year by mpg01 mpg01 year 9
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
FALSE TRUE 0.0 0.2 0.4 0.6 0.8 1.0 Boxplot of origin1 by mpg01 mpg01 origin1 10
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
FALSE TRUE 0.0 0.2 0.4 0.6 0.8 1.0 Boxplot of origin2 by mpg01 mpg01 origin2 11
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
FALSE TRUE 0.0 0.2 0.4 0.6 0.8 1.0 Boxplot of origin3 by mpg01 mpg01 origin3 corr <- cor(df) VIF library(car) ## Loading required package: carData model_reg <- glm(mpg01~cylinders + displacement + horsepower + weight+ acceleration+ year+origin1 + ori data= df) vif_values <- vif(model_reg) print(vif_values) ## cylinders displacement horsepower weight acceleration year ## 5.410047 11.604209 2.694488 6.602953 2.522107 2.163868 ## origin1 origin2 ## 2.944230 1.963661 12
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
par( mar = c( 5 , 10 , 4 , 2 ) + 0.1 ) # Create a bar plot of VIF values barplot(vif_values, main = "Barplot of VIF Values" , col = "steelblue" , names.arg = names(vif_values), ce abline( h = 5 , col = "red" , lwd = 2 , lty = 2 ) cylinders displacement horsepower weight acceleration year origin1 origin2 Barplot of VIF Values 0 2 4 6 8 10 ## Variable selection ########## Monte Carlo: Variable selection ########### set.seed( 7406 ) B = 100 cvmod3_sel <- data.frame(matrix( ncol = ncol(df), nrow = B)) # Create a dataframe to store selected vari for (b in 1 :B) { flag <- sort(sample( 1 :n, n1)) traintemp <- df[-flag,] testtemp <- df[flag,] # Linear Regression mod3_cv <- step(glm(mpg01~., family= binomial, data= traintemp), trace= 0 ) # Store selected variables in the dataframe selected_vars <- colnames(mod3_cv$model)[- 1 ] # Exclude the intercept term cvmod3_sel[b,] <- 0 # Initialize the row 13
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
cvmod3_sel[b, selected_vars] <- 1 # Set selected variables to 1 } ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred selected_cols <- cvmod3_sel[, 11 : 18 ] summary(selected_cols) ## horsepower weight year origin1 displacement ## Min. :0.00 Min. :1 Min. :1 Min. :0.00 Min. :0.0000 ## 1st Qu.:1.00 1st Qu.:1 1st Qu.:1 1st Qu.:0.00 1st Qu.:0.0000 ## Median :1.00 Median :1 Median :1 Median :1.00 Median :0.0000 ## Mean :0.91 Mean :1 Mean :1 Mean :0.63 Mean :0.1531 ## 3rd Qu.:1.00 3rd Qu.:1 3rd Qu.:1 3rd Qu.:1.00 3rd Qu.:0.0000 ## Max. :1.00 Max. :1 Max. :1 Max. :1.00 Max. :1.0000 ## NA’s :2 ## origin2 acceleration cylinders ## Min. :0.0000 Min. :0.00000 Min. :0.00000 ## 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.00000 ## Median :0.0000 Median :0.00000 Median :0.00000 ## Mean :0.4286 Mean :0.08421 Mean :0.06849 ## 3rd Qu.:1.0000 3rd Qu.:0.00000 3rd Qu.:0.00000 ## Max. :1.0000 Max. :1.00000 Max. :1.00000 ## NA’s :2 NA’s :5 NA’s :27 library(car) model_reg_sel <- glm(mpg01~horsepower+weight+ year+origin1, family= binomial, data= df) #model_reg_sel <- lm(mpg01 ~ horsepower + weight+ year+origin1, data = df) vif_values_sel <- vif(model_reg_sel) print(vif_values_sel) ## horsepower weight year origin1 ## 1.142649 1.642401 1.633459 1.162831 14
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
par( mar = c( 5 , 10 , 4 , 2 ) + 0.1 ) # Create a bar plot of VIF values barplot(vif_values_sel, main = "Barplot of VIF Values - Selected vars" , col = "steelblue" , names.arg = n abline( h = 5 , col = "red" , lwd = 2 , lty = 2 ) horsepower weight year origin1 Barplot of VIF Values - Selected vars 0.0 0.5 1.0 1.5 Parameter Tuning - Random Forest library(randomForest) ## randomForest 4.7-1.1 ## Type rfNews() to see new features/changes/bug fixes. set.seed( 7406 ) B <- 100 cverror_rf <- NULL cm_rf <- NULL # Define a grid of hyperparameters for Random Forest 15
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
n_trees_rf <- c( 100 , 200 , 300 , 500 ) m_try_rf <- c( 1 , 2 , 3 ) nodesize_rf <- c( 1 , 2 , 5 , 10 ) # Nested loop for hyperparameter tuning for (trees in n_trees_rf) { for (mtry in m_try_rf) { for (nodesize in nodesize_rf) { cverrortrain_temp <- NULL cverror_temp <- NULL cm_temp <- NULL specificity_temp <- NULL sensitivity_temp <- NULL for (b in 1 :B) { flag <- sort(sample( 1 :n, n1)) traintemp <- df[-flag,]; testtemp <- df[flag,]; # Build the Random Forest model mod_rf <- randomForest(as.factor(mpg01) ~ horsepower + weight + year + origin1, data = traintemp ntree = trees, mtry = mtry, nodesize = nodesize) # Calculate train error predtrain_rf <- predict(mod_rf, traintemp, type = "class" ) trainerror_rf <- mean(predtrain_rf != traintemp$mpg01) cverrortrain_temp <- c(cverrortrain_temp, trainerror_rf) # Calculate test error predtest_rf <- predict(mod_rf, testtemp, type = "class" ) temptesterror_rf <- mean(predtest_rf != testtemp$mpg01) cverror_temp <- c(cverror_temp, temptesterror_rf) # Confusion Matrix confusion_matrix <- table(predtest_rf, testtemp$mpg01) TP <- confusion_matrix[ 2 , 2 ] FP <- confusion_matrix[ 1 , 2 ] TN <- confusion_matrix[ 1 , 1 ] FN <- confusion_matrix[ 2 , 1 ] # Calculate specificity and sensitivity specificity <- TN / (TN + FP) sensitivity <- TP / (TP + FN) cm_temp <- rbind(cm_temp, c(specificity, sensitivity)) specificity_temp <- c(specificity_temp, specificity) sensitivity_temp <- c(sensitivity_temp, sensitivity) } 16
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
# Store the results cverror_rf <- rbind(cverror_rf, c(trees, mtry, nodesize, mean(cverrortrain_temp), var(cverrortrain_temp), mean(cverror_temp), var(cverror_temp), mean(specificity_temp), var(specificity_temp), mean(sensitivity_temp), var(sensitivity_temp))) cm_rf <- rbind(cm_rf, c(trees, mtry, nodesize, colMeans(cm_temp)[ 1 ], colMeans(cm_temp)[ 2 ])) } } } # Naming the columns of cm_rf colnames(cverror_rf) <- c( "n.trees" , "mtry" , "nodesize" , "Avg_Train_Error" , "Var_Train_Error" , "Avg_Test_Error" , "Var_Test_Error" , "Avg_Specificity" , "Var_Specificity" , "Avg_Sensitivity" , "Var_Sensitivity" ) colnames(cm_rf) <- c( "n.trees" , "mtry" , "nodesize" , "Specificity" , "Sensitivity" ) Parameter Tuning - Boosting set.seed( 7406 ) library(gbm) ## Warning: package ’gbm’ was built under R version 4.2.3 ## Loaded gbm 2.1.8.1 B <- 100 cverror_gbm <- NULL cm_gbm <- NULL # Define a grid of hyperparameters n_trees <- c( 100 , 200 , 300 , 500 , 1000 ) shrinkage <- c( 0.01 , 0.1 , 0.2 ) interaction_depth <- c( 1 , 2 , 3 ) # Nested loop for hyperparameter tuning for (trees in n_trees) { for (shrink in shrinkage) { for (depth in interaction_depth) { cverror_temp <- NULL trainerror_temp <- NULL cm_temp <- NULL specificity_temp <- NULL sensitivity_temp <- NULL 17
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
for (b in 1 :B) { flag <- sort(sample( 1 :n, n1)) traintemp <- df[-flag,]; testtemp <- df[flag,]; # Build the gbm model modcv_4 <- gbm(mpg01 ~ horsepower + weight+ year+origin1, data = traintemp, distribution = "bern n.trees = trees, shrinkage = shrink, interaction.depth = depth) # Calculate train error predtrain_4 <- ifelse(predict(modcv_4, traintemp, n.trees = trees, type = "response" ) > 0.5 , 1 , trainerror_4 <- mean(predtrain_4 != traintemp$mpg01) trainerror_temp <- c(trainerror_temp, trainerror_4) # Calculate test error predtest_4 <- ifelse(predict(modcv_4, testtemp, n.trees = trees, type = "response" ) > 0.5 , 1 , 0 ) temptesterror_4 <- mean(predtest_4 != testtemp$mpg01) cverror_temp <- c(cverror_temp, temptesterror_4) # Confusion Matrix confusion_matrix <- table(predtest_4, testtemp$mpg01) TP <- confusion_matrix[ 2 , 2 ] FP <- confusion_matrix[ 1 , 2 ] TN <- confusion_matrix[ 1 , 1 ] FN <- confusion_matrix[ 2 , 1 ] # Calculate specificity and sensitivity specificity <- TN / (TN + FP) sensitivity <- TP / (TP + FN) cm_temp <- rbind(cm_temp, c(specificity, sensitivity)) specificity_temp <- c(specificity_temp, specificity) sensitivity_temp <- c(sensitivity_temp, sensitivity) } # Store the results cverror_gbm <- rbind(cverror_gbm, c(trees, shrink, depth, mean(trainerror_temp), var(trainerror_temp), mean(cverror_temp), var(cverror_temp), mean(specificity_temp), var(specificity_temp), mean(sensitivity_temp), var(sensitivity_temp))) cm_gbm <- rbind(cm_gbm, c(trees, shrink, depth, colMeans(cm_temp)[ 1 ], colMeans(cm_temp)[ 2 ])) } } } # Naming the columns of cm_gbm colnames(cverror_gbm) <- c( "n.trees" , "shrinkage" , "interaction.depth" , "Avg_Train_Error" , "Var_Train_Error" , "Avg_Test_Error" , "Var_Test_Error" , "Avg_Specificity" , "Var_Specificity" , "Avg_Sensitivity" , "Var_Sensitivity" ) 18
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
colnames(cm_gbm) <- c( "n.trees" , "shrinkage" , "interaction.depth" , "Specificity" , "Sensitivity" ) Model evaluation: ## Model 1: RF - with tuned parameters set.seed( 7406 ); ### set the seed for randomization B = 100 ; ### number of loops library(randomForest) cverror_rf <- NULL; cm_rf <- NULL; trainerror_rf <- NULL; for (b in 1 :B) { ## Step 1: Split to temp train, test set flag <- sort(sample( 1 :n, n1)) traintemp <- df[-flag,]; testtemp <- df[flag,]; ## Step 2: Build model mod_rf <- randomForest(as.factor(mpg01) ~ horsepower + weight + year + origin1, data = traintemp, ntree = 300 , mtry = 2 , nodesize = 2 ) ### Step 3a: Calculate train error predtrain_rf <- predict(mod_rf, traintemp, type = "class" ) temptrainerror_rf <- c(mean(predtrain_rf != traintemp$mpg01)) trainerror_rf <- rbind(trainerror_rf, temptrainerror_rf) # Step 3: Calculate test error predtest_rf <- predict(mod_rf,testtemp, type= "class" ) temptesterror_rf <- c(mean(predtest_rf!=testtemp$mpg01)); cverror_rf <- rbind(cverror_rf,temptesterror_rf) # Step 4: CM confusion_matrix <- table(predtest_rf, testtemp[, 1 ]) # Extract TP, FP, TN, FN TP <- confusion_matrix[ 2 , 2 ] FP <- confusion_matrix[ 1 , 2 ] TN <- confusion_matrix[ 1 , 1 ] FN <- confusion_matrix[ 2 , 1 ] # Calculate specificity and sensitivity specificity <- TN / (TN + FP) sensitivity <- TP / (TP + FN) 19
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
cm_rf <- rbind(cm_rf,c(specificity,sensitivity)) colnames(cm_rf) <- c( "Specificity" , "Sensitivity" ) cm_rf <- as.data.frame(cm_rf) } library(gbm) ## Model 2: GBM - with tuned parameters set.seed( 7406 ); ### set the seed for randomization B = 100 ; ### number of loops cverror_gbmsel <- NULL; cm_gbmsel <- NULL; trainerror_gbmsel <- NULL; for (b in 1 :B) { ## Step 1: Split to temp train, test set flag <- sort(sample( 1 :n, n1)) traintemp <- df[-flag,]; testtemp <- df[flag,]; #print(traintemp[1]) ## Step 2: Build model mod_gbmsel <- gbm(mpg01 ~ horsepower + weight+ year+origin1, data = traintemp, distribution = "bernoul ### Step 3a: Calculate train error predtrain_gbmsel <- ifelse(predict(mod_gbmsel, traintemp, n.trees= 200 , type = "response" ) > 0.5 , 1 , 0 ) temptrainerror_gbmsel <- c(mean(predtrain_gbmsel != traintemp$mpg01)) trainerror_gbmsel <- rbind(trainerror_gbmsel, temptrainerror_gbmsel) # Step 3: Calculate test error predtest_gbmsel <- ifelse(predict(mod_gbmsel, testtemp, n.trees= 200 , type = "response" ) > 0.5 , 1 , 0 ) temptesterror_gbmsel <- c(mean(predtest_gbmsel!=testtemp$mpg01)); cverror_gbmsel <- rbind(cverror_gbmsel,temptesterror_gbmsel) # Step 4: CM confusion_matrix <- table(predtest_gbmsel, testtemp[, 1 ]) # Extract TP, FP, TN, FN TP <- confusion_matrix[ 2 , 2 ] FP <- confusion_matrix[ 1 , 2 ] TN <- confusion_matrix[ 1 , 1 ] FN <- confusion_matrix[ 2 , 1 ] # Calculate specificity and sensitivity specificity <- TN / (TN + FP) sensitivity <- TP / (TP + FN) 20
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
cm_gbmsel <- rbind(cm_gbmsel,c(specificity,sensitivity)) colnames(cm_gbmsel) <- c( "Specificity" , "Sensitivity" ) cm_gbmsel <- as.data.frame(cm_gbmsel) } ## Model 3: LDA set.seed( 7406 ); ### set the seed for randomization B = 100 ; ### number of loops library(MASS) cverror_lda <- NULL; cm_lda <- NULL; trainerror_lda <- NULL; for (b in 1 :B) { ## Step 1: Split to temp train, test set flag <- sort(sample( 1 :n, n1)) traintemp <- df[-flag,]; testtemp <- df[flag,]; ## Step 2: Build model modcv_1 <- lda(traintemp[,c( 4 , 5 , 7 , 8 )], traintemp[, 1 ]); # Step 3a: Calculate training error predtrain_1 <- predict(modcv_1, traintemp[,c( 4 , 5 , 7 , 8 )])$class; trainerror_1 <- c(mean(predtrain_1 != traintemp$mpg01)) trainerror_lda <- rbind(trainerror_lda, trainerror_1) # Step 3: Calculate test error predtest_1 <- predict(modcv_1,testtemp[,c( 4 , 5 , 7 , 8 )])$class; temptesterror_1 <- c(mean(predtest_1!=testtemp$mpg01)); cverror_lda <- rbind(cverror_lda,temptesterror_1) # Step 4: CM confusion_matrix <- table(predtest_1, testtemp[, 1 ]) # Extract TP, FP, TN, FN TP <- confusion_matrix[ 2 , 2 ] FP <- confusion_matrix[ 1 , 2 ] TN <- confusion_matrix[ 1 , 1 ] FN <- confusion_matrix[ 2 , 1 ] # Calculate specificity and sensitivity specificity <- TN / (TN + FP) sensitivity <- TP / (TP + FN) 21
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
cm_lda <- rbind(cm_lda,c(specificity,sensitivity)) colnames(cm_lda) <- c( "Specificity" , "Sensitivity" ) cm_lda <- as.data.frame(cm_lda) } ## Model 4: QDA set.seed( 7406 ); ### set the seed for randomization B = 100 ; ### number of loops library(MASS) cverror_qda <- NULL; cm_qda <- NULL; trainerror_qda <- NULL; for (b in 1 :B) { ## Step 1: Split to temp train, test set flag <- sort(sample( 1 :n, n1)) traintemp <- df[-flag,]; testtemp <- df[flag,]; ## Step 2: Build model modcv_2 <- qda(traintemp[,c( 4 , 5 , 7 , 8 )], traintemp[, 1 ]); # Step 3a: Calculate training error predtrain_2 <- predict(modcv_2, traintemp[,c( 4 , 5 , 7 , 8 )])$class; trainerror_2 <- c(mean(predtrain_2 != traintemp$mpg01)) trainerror_qda <- rbind(trainerror_qda, trainerror_2) # Step 3: Calculate test error predtest_2 <- predict(modcv_2,testtemp[,c( 4 , 5 , 7 , 8 )])$class; temptesterror_2 <- c(mean(predtest_2!=testtemp$mpg01)); cverror_qda <- rbind(cverror_qda,temptesterror_2) # Step 4: CM confusion_matrix <- table(predtest_2, testtemp[, 1 ]) # Extract TP, FP, TN, FN TP <- confusion_matrix[ 2 , 2 ] FP <- confusion_matrix[ 1 , 2 ] TN <- confusion_matrix[ 1 , 1 ] FN <- confusion_matrix[ 2 , 1 ] # Calculate specificity and sensitivity specificity <- TN / (TN + FP) sensitivity <- TP / (TP + FN) cm_qda <- rbind(cm_qda,c(specificity,sensitivity)) 22
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
colnames(cm_qda) <- c( "Specificity" , "Sensitivity" ) cm_qda <- as.data.frame(cm_qda) } ## Model 5: Naive Bayes set.seed( 7406 ); ### set the seed for randomization library(e1071) B = 100 ; ### number of loops cverror_nb <- NULL; cm_nb <- NULL; trainerror_nb <- NULL; for (b in 1 :B) { ## Step 1: Split to temp train, test set flag <- sort(sample( 1 :n, n1)) traintemp <- df[-flag,]; testtemp <- df[flag,]; ## Step 2: Build model modcv_3 <- naiveBayes(traintemp[,c( 4 , 5 , 7 , 8 )], traintemp[, 1 ]); # Step 3a: Calculate training error predtrain_3 <- predict(modcv_3, traintemp[,c( 4 , 5 , 7 , 8 )]); trainerror_3 <- c(mean(predtrain_3 != traintemp$mpg01)) trainerror_nb <- rbind(trainerror_nb, trainerror_3) # Step 3: Calculate test error predtest_3 <- predict(modcv_3,testtemp[,c( 4 , 5 , 7 , 8 )]); temptesterror_3 <- c(mean(predtest_3!=testtemp$mpg01)); cverror_nb <- rbind(cverror_nb,temptesterror_3) # Step 4: CM confusion_matrix <- table(predtest_3, testtemp[, 1 ]) # Extract TP, FP, TN, FN TP <- confusion_matrix[ 2 , 2 ] FP <- confusion_matrix[ 1 , 2 ] TN <- confusion_matrix[ 1 , 1 ] FN <- confusion_matrix[ 2 , 1 ] # Calculate specificity and sensitivity specificity <- TN / (TN + FP) sensitivity <- TP / (TP + FN) cm_nb <- rbind(cm_nb,c(specificity,sensitivity)) 23
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
colnames(cm_nb) <- c( "Specificity" , "Sensitivity" ) cm_nb <- as.data.frame(cm_nb) } ## Model 6: Logistic Regression set.seed( 7406 ); ### set the seed for randomization B = 100 ; ### number of loops cverror_lr <- NULL; cm_lr <- NULL; trainerror_lr <- NULL; for (b in 1 :B) { ## Step 1: Split to temp train, test set flag <- sort(sample( 1 :n, n1)) traintemp <- df[-flag,]; testtemp <- df[flag,]; ## Step 2: Build model modcv_4 <- glm(mpg01 ~ horsepower+weight+ year+origin1, data = traintemp, family = binomial); # Step 3a: Calculate training error predtrain_4 <- ifelse(predict(modcv_4,traintemp, type = "response" ) > 0.5 , 1 , 0 ); trainerror_4 <- c(mean(predtrain_4 != traintemp$mpg01)) trainerror_lr <- rbind(trainerror_lr, trainerror_4) # Step 3: Calculate test error predtest_4 <- ifelse(predict(modcv_4,testtemp, type = "response" ) > 0.5 , 1 , 0 ); temptesterror_4 <- c(mean(predtest_4!=testtemp$mpg01)); cverror_lr <- rbind(cverror_lr,temptesterror_4) # Step 4: CM confusion_matrix <- table(predtest_4, testtemp[, 1 ]) # Extract TP, FP, TN, FN TP <- confusion_matrix[ 2 , 2 ] FP <- confusion_matrix[ 1 , 2 ] TN <- confusion_matrix[ 1 , 1 ] FN <- confusion_matrix[ 2 , 1 ] # Calculate specificity and sensitivity specificity <- TN / (TN + FP) sensitivity <- TP / (TP + FN) cm_lr <- rbind(cm_lr,c(specificity,sensitivity)) colnames(cm_lr) <- c( "Specificity" , "Sensitivity" ) 24
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
cm_lr <- as.data.frame(cm_lr) } ## Model 7: KNN library(class); set.seed( 7406 ); ### set the seed for randomization kk <- c( 1 , 3 , 5 , 7 , 9 , 11 , 13 , 15 ); B = 100 ; ### number of loops CVALL_cv_train <- NULL; CVALL_cv <- NULL; CVALL_sens_kNN <- NULL; CVALL_spec_kNN <- NULL; confusion_matrices <- list() # Initialize list to store confusion matrices # Initialize a data frame to store TP, FP, TN, FN for each iteration result_df <- data.frame( TP = numeric(B), FP = numeric(B), TN = numeric(B), FN = numeric(B)) for (b in 1 :B) { ## For knn cverrortrain_knn <- NULL; cverror_knn <- NULL; spec_knn <- NULL; sens_knn <- NULL; ## Step 1: Split to temp train, test set flag <- sort(sample( 1 :n, n1)) traintemp <- df[-flag,]; testtemp <- df[flag,]; ## Step 2: Build model for (i in 1 :length(kk)) { # KNN xnew_knn <- testtemp[,c( 4 , 5 , 7 , 8 )] ypred_knn.train <- knn(traintemp[,c( 4 , 5 , 7 , 8 )], traintemp[,c( 4 , 5 , 7 , 8 )], traintemp$mpg01, k = kk[i]) temptrainerror_knn <- mean(ypred_knn.train != traintemp$mpg01) cverrortrain_knn <- cbind(cverrortrain_knn, temptrainerror_knn) ypred_knn.test <- knn(traintemp[,c( 4 , 5 , 7 , 8 )], xnew_knn, traintemp$mpg01, k = kk[i]) temptesterror_knn <- mean(ypred_knn.test != testtemp$mpg01) cverror_knn <- cbind(cverror_knn, temptesterror_knn) # Create confusion matrix confusion <- table(ypred_knn.test, testtemp$mpg01) confusion_matrices <- append(confusion_matrices, list(confusion)) 25
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
# Extract TP, FP, TN, FN TP <- confusion[ 2 , 2 ] FP <- confusion[ 1 , 2 ] TN <- confusion[ 1 , 1 ] FN <- confusion[ 2 , 1 ] # Calculate specificity and sensitivity specificity <- TN / (TN + FP) sensitivity <- TP / (TP + FN) spec_knn <- cbind(spec_knn, specificity) sens_knn <- cbind(sens_knn, sensitivity) } CVALL_cv_train <- rbind(CVALL_cv_train,cverrortrain_knn) CVALL_cv <- rbind(CVALL_cv,cverror_knn) CVALL_spec_kNN <- rbind(CVALL_spec_kNN,spec_knn) CVALL_sens_kNN <- rbind(CVALL_sens_kNN,sens_knn) } # Now, ' confusion_matrices ' contains the confusion matrices for each value of k. # ' result_df ' contains TP, FP, TN, FN for each iteration. Evaluation Metrics: TrainALL = cbind(trainerror_rf,trainerror_gbmsel ,trainerror_lda,trainerror_qda,trainerror_nb,trainerror boxplot(TrainALL, main = "Train Error Boxplot" ) 26
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
temptrainerror_knn temptrainerror_knn 0.00 0.04 0.08 0.12 Train Error Boxplot TEALL = cbind(cverror_rf,cverror_gbmsel ,cverror_lda,cverror_qda,cverror_nb,cverror_lr,CVALL_cv) boxplot(TEALL, main = "Test Error Boxplot" ) 27
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
temptesterror_knn temptesterror_knn 0.05 0.10 0.15 0.20 0.25 Test Error Boxplot SpecALL = cbind(cm_rf$Specificity,cm_gbmsel$Specificity,cm_lda$Specificity,cm_qda$Specificity,cm_nb$Spec boxplot(SpecALL, main = "Specificity Boxplot" ) 28
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
specificity specificity specificity 0.75 0.85 0.95 Specificity Boxplot SensALL = cbind(cm_rf$Sensitivity,cm_gbmsel$Sensitivity,cm_lda$Sensitivity,cm_qda$Sensitivity,cm_nb$Sens boxplot(SensALL, main = "Sensitivity Boxplot" ) 29
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
sensitivity sensitivity sensitivity 0.70 0.80 0.90 1.00 Sensitivity Boxplot means <- apply(TEALL, 2 , mean) variances <- apply(TEALL, 2 , var) summary_stats <- data.frame( Column = colnames(TEALL), Mean = means, Variance = variances) print(summary_stats) ## Column Mean Variance ## 1 0.07666667 0.0007736139 ## 2 0.08089744 0.0008025521 ## 3 0.09564103 0.0008979340 ## 4 0.10230769 0.0009794859 ## 5 0.10692308 0.0011795137 ## 6 0.09358974 0.0007421354 ## 7 temptesterror_knn 0.13884615 0.0010129401 ## 8 temptesterror_knn 0.12423077 0.0011113601 ## 9 temptesterror_knn 0.12807692 0.0008550163 ## 10 temptesterror_knn 0.12717949 0.0009486051 ## 11 temptesterror_knn 0.12615385 0.0009487379 ## 12 temptesterror_knn 0.12397436 0.0009033630 ## 13 temptesterror_knn 0.12384615 0.0009902443 ## 14 temptesterror_knn 0.12269231 0.0009903605 30
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
summary_spec <- data.frame( Column = colnames(SpecALL), Mean = apply(SpecALL, 2 , mean) , Variance = apply print(summary_spec) ## Column Mean Variance ## 1 0.9232738 0.001959364 ## 2 0.9355589 0.001320847 ## 3 0.9561488 0.001036278 ## 4 0.9443722 0.001369056 ## 5 0.9444319 0.001711889 ## 6 0.9204442 0.001929678 ## 7 specificity 0.8567752 0.002518453 ## 8 specificity 0.8784851 0.002984309 ## 9 specificity 0.8667769 0.002605161 ## 10 specificity 0.8652357 0.002576972 ## 11 specificity 0.8688997 0.002633592 ## 12 specificity 0.8737502 0.002421178 ## 13 specificity 0.8780773 0.002453583 ## 14 specificity 0.8796575 0.002496284 summary_sens <- data.frame( Column = colnames(SensALL), Mean = apply(SensALL, 2 , mean) , Variance = apply print(summary_sens) ## Column Mean Variance ## 1 0.9236596 0.001738185 ## 2 0.9051825 0.002357728 ## 3 0.8649023 0.002298335 ## 4 0.8621474 0.002666568 ## 5 0.8541714 0.002572463 ## 6 0.8960307 0.001912041 ## 7 sensitivity 0.8661883 0.002493456 ## 8 sensitivity 0.8743033 0.002201731 ## 9 sensitivity 0.8778338 0.001956006 ## 10 sensitivity 0.8816503 0.002320618 ## 11 sensitivity 0.8795674 0.002131426 ## 12 sensitivity 0.8786573 0.002124985 ## 13 sensitivity 0.8748121 0.002348846 ## 14 sensitivity 0.8755619 0.002266427 Significance testing t-test # t test p_values_df <- data.frame( Column1 = integer( 0 ), Column2 = integer( 0 ), P_Value = numeric( 0 ), Significant # Generate combinations of column pairs combinations <- combn(ncol(TEALL), 2 ) # Perform paired t-tests for each combination for (i in seq_len(ncol(combinations))) { 31
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
col1 <- combinations[ 1 , i] col2 <- combinations[ 2 , i] result <- t.test(TEALL[, col1], TEALL[, col2], paired = TRUE) # Add p-value and significance to data frame significant <- ifelse(result$p.value < 0.05 , "Yes" , "No" ) p_values_df <- rbind(p_values_df, data.frame( Column1 = col1, Column2 = col2, P_Value = result$p.value, } # Print the data frame of p-values print(p_values_df) ## Column1 Column2 P_Value Significant ## 1 1 2 3.129546e-01 No ## 2 1 3 2.296811e-06 Yes ## 3 1 4 1.167770e-08 Yes ## 4 1 5 1.030949e-10 Yes ## 5 1 6 9.353614e-06 Yes ## 6 1 7 1.286170e-26 Yes ## 7 1 8 2.209082e-19 Yes ## 8 1 9 5.267922e-23 Yes ## 9 1 10 9.218069e-22 Yes ## 10 1 11 2.515879e-21 Yes ## 11 1 12 1.345765e-20 Yes ## 12 1 13 8.963262e-20 Yes ## 13 1 14 2.624135e-19 Yes ## 14 2 3 4.828501e-04 Yes ## 15 2 4 1.342677e-06 Yes ## 16 2 5 1.413071e-07 Yes ## 17 2 6 1.191119e-03 Yes ## 18 2 7 1.665457e-25 Yes ## 19 2 8 1.107275e-16 Yes ## 20 2 9 1.500643e-20 Yes ## 21 2 10 4.511518e-19 Yes ## 22 2 11 1.221572e-18 Yes ## 23 2 12 1.695142e-17 Yes ## 24 2 13 3.716715e-17 Yes ## 25 2 14 1.563016e-16 Yes ## 26 3 4 5.052609e-04 Yes ## 27 3 5 5.855067e-07 Yes ## 28 3 6 4.004751e-01 No ## 29 3 7 1.879294e-15 Yes ## 30 3 8 1.869527e-08 Yes ## 31 3 9 1.625460e-10 Yes ## 32 3 10 8.125927e-10 Yes ## 33 3 11 1.497783e-09 Yes ## 34 3 12 6.207109e-09 Yes ## 35 3 13 7.390469e-09 Yes ## 36 3 14 2.352131e-08 Yes ## 37 4 5 5.471170e-03 Yes ## 38 4 6 8.644515e-04 Yes ## 39 4 7 2.732528e-12 Yes ## 40 4 8 1.039836e-05 Yes 32
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
## 41 4 9 7.617537e-08 Yes ## 42 4 10 4.577721e-07 Yes ## 43 4 11 9.461594e-07 Yes ## 44 4 12 3.931840e-06 Yes ## 45 4 13 8.343329e-06 Yes ## 46 4 14 2.227131e-05 Yes ## 47 5 6 1.261662e-06 Yes ## 48 5 7 1.641560e-09 Yes ## 49 5 8 5.457933e-04 Yes ## 50 5 9 1.471888e-05 Yes ## 51 5 10 5.195035e-05 Yes ## 52 5 11 8.278948e-05 Yes ## 53 5 12 3.250051e-04 Yes ## 54 5 13 4.564791e-04 Yes ## 55 5 14 1.315725e-03 Yes ## 56 6 7 5.182996e-19 Yes ## 57 6 8 1.130561e-10 Yes ## 58 6 9 5.388557e-13 Yes ## 59 6 10 4.119388e-12 Yes ## 60 6 11 1.419441e-11 Yes ## 61 6 12 1.599015e-11 Yes ## 62 6 13 5.998276e-11 Yes ## 63 6 14 3.332584e-10 Yes ## 64 7 8 1.252486e-05 Yes ## 65 7 9 2.558681e-04 Yes ## 66 7 10 1.826809e-04 Yes ## 67 7 11 1.011440e-04 Yes ## 68 7 12 1.214581e-05 Yes ## 69 7 13 1.473735e-05 Yes ## 70 7 14 1.599119e-05 Yes ## 71 8 9 9.054984e-02 No ## 72 8 10 2.700545e-01 No ## 73 8 11 4.485337e-01 No ## 74 8 12 9.219066e-01 No ## 75 8 13 8.874992e-01 No ## 76 8 14 6.039986e-01 No ## 77 9 10 5.715115e-01 No ## 78 9 11 2.208576e-01 No ## 79 9 12 2.359098e-02 Yes ## 80 9 13 3.583216e-02 Yes ## 81 9 14 2.359580e-02 Yes ## 82 10 11 4.310488e-01 No ## 83 10 12 6.429585e-02 No ## 84 10 13 1.003370e-01 No ## 85 10 14 6.139859e-02 No ## 86 11 12 8.433192e-02 No ## 87 11 13 1.424806e-01 No ## 88 11 14 9.354490e-02 No ## 89 12 13 9.099457e-01 No ## 90 12 14 4.138163e-01 No ## 91 13 14 3.480105e-01 No 33
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Wilcox test # Create an empty data frame to store p-values wilcox_values_df <- data.frame( Column1 = integer( 0 ), Column2 = integer( 0 ), P_Value = numeric( 0 ), Signifi # Generate combinations of column pairs combinations <- combn(ncol(TEALL), 2 ) # Perform paired t-tests for each combination for (i in seq_len(ncol(combinations))) { col1 <- combinations[ 1 , i] col2 <- combinations[ 2 , i] result <- wilcox.test(TEALL[, col1], TEALL[, col2], paired = TRUE) # Add p-value and significance to data frame significant <- ifelse(result$p.value < 0.05 , "Yes" , "No" ) wilcox_values_df <- rbind(wilcox_values_df, data.frame( Column1 = col1, Column2 = col2, P_Value = resul } ## Warning in wilcox.test.default(TEALL[, col1], TEALL[, col2], paired = TRUE): ## cannot compute exact p-value with ties ## Warning in wilcox.test.default(TEALL[, col1], TEALL[, col2], paired = TRUE): ## cannot compute exact p-value with zeroes ## Warning in wilcox.test.default(TEALL[, col1], TEALL[, col2], paired = TRUE): ## cannot compute exact p-value with ties ## Warning in wilcox.test.default(TEALL[, col1], TEALL[, col2], paired = TRUE): ## cannot compute exact p-value with zeroes ## Warning in wilcox.test.default(TEALL[, col1], TEALL[, col2], paired = TRUE): ## cannot compute exact p-value with ties ## Warning in wilcox.test.default(TEALL[, col1], TEALL[, col2], paired = TRUE): ## cannot compute exact p-value with zeroes ## Warning in wilcox.test.default(TEALL[, col1], TEALL[, col2], paired = TRUE): ## cannot compute exact p-value with ties ## Warning in wilcox.test.default(TEALL[, col1], TEALL[, col2], paired = TRUE): ## cannot compute exact p-value with zeroes # Print the data frame of p-values print(wilcox_values_df) ## Column1 Column2 P_Value Significant ## 1 1 2 2.936516e-01 No ## 2 1 3 1.152155e-05 Yes ## 3 1 4 6.586686e-08 Yes 34
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
## 4 1 5 2.060770e-09 Yes ## 5 1 6 3.609859e-05 Yes ## 6 1 7 6.816979e-17 Yes ## 7 1 8 2.976688e-14 Yes ## 8 1 9 1.394741e-15 Yes ## 9 1 10 2.604831e-15 Yes ## 10 1 11 3.250826e-15 Yes ## 11 1 12 6.406268e-15 Yes ## 12 1 13 1.636190e-14 Yes ## 13 1 14 2.237652e-14 Yes ## 14 2 3 1.375518e-03 Yes ## 15 2 4 6.228213e-06 Yes ## 16 2 5 5.719007e-07 Yes ## 17 2 6 1.993251e-03 Yes ## 18 2 7 6.700298e-17 Yes ## 19 2 8 8.138846e-13 Yes ## 20 2 9 1.438259e-14 Yes ## 21 2 10 3.862181e-14 Yes ## 22 2 11 1.094281e-13 Yes ## 23 2 12 9.516703e-14 Yes ## 24 2 13 1.908234e-13 Yes ## 25 2 14 6.363138e-13 Yes ## 26 3 4 5.345427e-04 Yes ## 27 3 5 4.731274e-07 Yes ## 28 3 6 3.527105e-01 No ## 29 3 7 3.198704e-12 Yes ## 30 3 8 1.652952e-07 Yes ## 31 3 9 2.455649e-09 Yes ## 32 3 10 1.446520e-08 Yes ## 33 3 11 1.392513e-08 Yes ## 34 3 12 5.629992e-08 Yes ## 35 3 13 1.693237e-08 Yes ## 36 3 14 5.571670e-08 Yes ## 37 4 5 8.610989e-03 Yes ## 38 4 6 8.163688e-04 Yes ## 39 4 7 3.356121e-10 Yes ## 40 4 8 1.074098e-05 Yes ## 41 4 9 1.036627e-07 Yes ## 42 4 10 6.460885e-07 Yes ## 43 4 11 9.895693e-07 Yes ## 44 4 12 5.365999e-06 Yes ## 45 4 13 2.333991e-06 Yes ## 46 4 14 1.031598e-05 Yes ## 47 5 6 1.654251e-06 Yes ## 48 5 7 1.035835e-08 Yes ## 49 5 8 3.542278e-04 Yes ## 50 5 9 5.736968e-06 Yes ## 51 5 10 1.052197e-05 Yes ## 52 5 11 3.681363e-05 Yes ## 53 5 12 1.944498e-04 Yes ## 54 5 13 6.867979e-05 Yes ## 55 5 14 3.713461e-04 Yes ## 56 6 7 6.514968e-14 Yes ## 57 6 8 1.452611e-09 Yes 35
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
## 58 6 9 2.456485e-11 Yes ## 59 6 10 2.199424e-10 Yes ## 60 6 11 4.072362e-10 Yes ## 61 6 12 3.579193e-10 Yes ## 62 6 13 3.580718e-10 Yes ## 63 6 14 7.713454e-10 Yes ## 64 7 8 1.971053e-05 Yes ## 65 7 9 4.110505e-04 Yes ## 66 7 10 1.075116e-04 Yes ## 67 7 11 8.182855e-05 Yes ## 68 7 12 2.131455e-05 Yes ## 69 7 13 1.283626e-05 Yes ## 70 7 14 4.425322e-05 Yes ## 71 8 9 8.109189e-02 No ## 72 8 10 3.118137e-01 No ## 73 8 11 4.552619e-01 No ## 74 8 12 8.789112e-01 No ## 75 8 13 8.831379e-01 No ## 76 8 14 8.330633e-01 No ## 77 9 10 2.255099e-01 No ## 78 9 11 1.649483e-01 No ## 79 9 12 2.559919e-02 Yes ## 80 9 13 4.040016e-02 Yes ## 81 9 14 3.743972e-02 Yes ## 82 10 11 3.596937e-01 No ## 83 10 12 1.542088e-01 No ## 84 10 13 3.331840e-01 No ## 85 10 14 3.009194e-01 No ## 86 11 12 1.314828e-01 No ## 87 11 13 3.349624e-01 No ## 88 11 14 3.562930e-01 No ## 89 12 13 8.495031e-01 No ## 90 12 14 8.107766e-01 No ## 91 13 14 9.675252e-01 No 36
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help