HW5 Submission 29Oct23
pdf
keyboard_arrow_up
School
West Virginia University *
*We aren’t endorsed by this school
Course
7406
Subject
Industrial Engineering
Date
Dec 6, 2023
Type
Pages
50
Uploaded by MegaCloverFly16
1 ISYE 7401 Homework 5: Ensemble Methods: Random forest and boosting Introduction The cleaned Auto MPG dataset consists of 8 variables of specifications of 392 cars. The binary variable ‘mpg01’ is created to indicate if the car’s mpg is above or equal to (1) or below (0) the median. This report aims to determine methods that can be used to guide the manufacturing or buying of high-gas-mileage cars. Objectives This report uses different classification methods to predict mpg01: The statistical methods below are used: 1. Random forest 2. Boosting 3. LDA 4. QDA 5. Naïve Bayes 6. Logistic Regression 7. KNN with several values of K For random forest and boosting, parameter tuning is done using cross-validation on train data: o
Using B=100 loops, the mean and variance of the train error for each combination of tuning parameters are captured and compared o
The tuning parameter combination with the lower train error would be used to build the model, and compared with the other methods. The models would then be evaluated using the methods below: o
Train/Test error: 𝐹???? 𝑃??𝑖?𝑖??+𝐹???? 𝑁????𝑖??
𝑁????? ?? ???? ????????𝑖??? o
Specificity: 𝑇𝑃
𝑇𝑃+𝐹𝑁
o
Sensitivity: 𝑇𝑁
𝑇𝑁+𝐹𝑃
o
Monte Carlo Cross Validation using B=100 loops ▪
Compare the average mean and variance of the train/test errors for each model ▪
Compare the mean of the specificity and sensitivity for each model
2 Exploratory Data Analysis The 8 variables in the dataset are described below: Table 1: Overview of the variables in the cleaned dataset Variable name Description Type mpg01 0 if the car’s mpg is below the median, 1 if the car’s mpg is above the median
Binary (True/False) cylinders Number of cylinders between 4 and 8 Numerical displacement Engine displacement (cu. inches) Numerical horsepower Engine horsepower Numerical weight Vehicle weight (lbs.) Numerical acceleration Time to accelerate from 0 to 60 mph (sec.) Numerical year Model year (modulo 100) Numerical origin Origin of car (1. American, 2. European, 3. Japanese) Categorical As the origin variable is categorical, one hot encoding is applied to create 3 columns, one for each origin category. The median value of the original mpg variable is 22.75. Creating the mpg01 variable based on the median value ensures an equal split of 196 ‘TRUE’ and 196 ‘FALSE’ values. This eliminates the need to balance the dataset for classification modelling. Figure 1: Original mpg variable, with median line shown.
3 Figure 2: Boxplot of the auto dataset The boxplots show that there are some outliers present in the data, particularly in the variable ‘weight.’ The scale of the variables also differs, ‘weight’ has a magnitude of at least 10 times larger than the other variables, which could impact the classification models. Table 2: Descriptive stats of variables Variable Min Median Max cylinders 3 4 8 displacement 68 151 455 horsepower 46 93.5 230 weight 1613 2804 5140 acceleration 8 15.5 24.8 year 70 76 82 origin1 0 1 1 origin2 0 0 1 origin3 0 0 1
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
4 Figure 3: Correlation heatmap of auto dataset Figure 3 above shows the correlation heatmap of the auto dataset. Based on the coefficient coefficients: •
mpg01 has strong negative correlation (<-0.7) with: o
cylinders o
displacement o
weight •
mpg01 has moderate correlation with: o
Moderate positive correlation (0.5 to 0.7) with origin1 o
Moderate negative correlation (-0.5 to -0.7) with horsepower •
mpg01 has low correlation with: o
acceleration o
year o
origin2, origin3 In addition, the correlation heatmap also suggests multicollinearity in the data, as the four variables: cylinders, displacement, horsepower, and weight have strong positive correlation (>0.7) with each other. Most notably, cylinders and displacement have the strongest mpg01
cylinders
displace
ment
horse
power
weight
accele
ration
year
origin1
origin2
origin3
mpg01
1.00
-0.76
-0.75
-0.67
-0.76
0.35
0.43
-0.53
0.27
0.39
cylinders
-0.76
1.00
0.95
0.84
0.90
-0.50
-0.35
0.61
-0.35
-0.40
displacement
-0.75
0.95
1.00
0.90
0.93
-0.54
-0.37
0.66
-0.37
-0.44
horsepower
-0.67
0.84
0.90
1.00
0.86
-0.69
-0.42
0.49
-0.28
-0.32
weight
-0.76
0.90
0.93
0.86
1.00
-0.42
-0.31
0.60
-0.29
-0.45
acceleration
0.35
-0.50
-0.54
-0.69
-0.42
1.00
0.29
-0.26
0.21
0.12
year
0.43
-0.35
-0.37
-0.42
-0.31
0.29
1.00
-0.14
-0.04
0.20
origin1
-0.53
0.61
0.66
0.49
0.60
-0.26
-0.14
1.00
-0.59
-0.65
origin2
0.27
-0.35
-0.37
-0.28
-0.29
0.21
-0.04
-0.59
1.00
-0.23
origin3
0.39
-0.40
-0.44
-0.32
-0.45
0.12
0.20
-0.65
-0.23
1.00
5 correlation (0.95) followed by displacement and weight (0.93) and displacement and horsepower (0.90). Table 3 shows the calculated Variance Inflation Factors (VIF) to determine if multicollinearity is present in the data. A VIF above 5 indicates high and VIF>10 indicates very high collinearity. The barplot in Figure 4 clearly shows that ‘
cylinders
’
and ‘weight’
have high collinearity (VIF>5) and ‘displacement’ has very high collinearity (VIF>10) to other predictors in the model. Multicollinearity needs to be addressed as it could impact the classification models built, due to potentially causing instability in the coefficient estimates. There is a need to select features. Figure 4: Barplot of VIF of all variables Table 3: VIF of all variables Variable VIF cylinders 5.41 displacement 11.6 horsepower 2.69 weight 6.60 acceleration 2.52 year 2.16 origin1 2.94 origin2 1.96
6 To select the variables to use in the model, stepwise variable selection using AIC (Akaike Information Criterion) is run for B=100 loops. Figure 5 shows the number of times the variable was selected, out of the 100 loops. ‘weight’ and ‘year’ are consistently selected, followed by horsepower (91 out of 100 times).
Variable selection Figure 5: Number of times the variable was selected in the model using stepwise variable selection From Figure 5, the variables horsepower, weight, year, origin1 were selected in over half of the loops (>50 counts). Referring to Table 4, the VIF values of the model with these selected variables only, the VIFs are all <5.0, indicating that there is no problematic multicollinearity amongst the predictors in this reduced model. As such, the models are built using the variables horsepower + weight + year + origin1. Table 4: VIF only with horsepower, weight, year, origin1 variables Variable VIF horsepower 1.14 weight 1.64 year 1.63 origin1 1.16
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
7 Methods: Monte-Carlo Cross validation As using a single split of train and test dataset is not sufficiently robust to evaluate model performance, single split results are not used. Instead, Monte Carlo cross-validation of B=100 loops is used for parameter tuning and model comparisons. For each loop, the dataset is randomly split into 80% train, 20% test set. (n=314 for the train set, and n=78 for the test set). A 80/20 split is used to give sufficiently robust sample for the test set. A seed is set for replicability. Within each loop, the models are trained on the training data of that specific loop. For parameter tuning, performance is evaluated on the train data for that loop. The reason for using train data for parameter tuning is that in real-world applications, there is no access to the true response values in the test data. Model comparison is done using the sample mean and variance of the test errors, since the response values of the test data are known. Results Parameter tuning –
Random Forest Table 5 shows the mean of train errors for B=100 loops Monte Carlo Cross validation of the models with selected horsepower + weight + year + origin1 variables only, with different combinations of n.trees, mtry and nodesize. Table 5: Random Forest: Mean of train errors across the 100 loops of Monte Carlo Cross validation, for different parameters n.trees, mtry and nodesize n.trees = 200 n.trees = 300 n.trees = 500 mtry nodesize Avg Train Error Rank Train Error mtry nodesize Avg Train Error Rank Train Error mtry nodesize Avg_Train_Error Rank Train Error 1 1 0.027 19 1 1 0.028 21 1 1 0.027 20 1 2 0.029 22 1 2 0.029 23 1 2 0.029 23 1 5 0.035 30 1 5 0.034 26 1 5 0.035 28 1 10 0.040 35 1 10 0.041 36 1 10 0.040 34 2 1 0.000 6 2 1 0.000 4 2 1 0.000 5 2 2 0.005 12 2 2 0.005 11 2 2 0.005 10 2 5 0.022 14 2 5 0.023 15 2 5 0.022 13 2 10 0.035 29 2 10 0.033 25 2 10 0.034 27 3 1 0.000 1 3 1 0.000 1 3 1 0.000 1 3 2 0.002 9 3 2 0.001 7 3 2 0.001 8 3 5 0.024 17 3 5 0.025 18 3 5 0.024 16 3 10 0.039 31 3 10 0.039 33 3 10 0.039 32
8 From Table 5, increasing the number of trees did not result in large reductions in train error. This suggests that increasing the number of trees beyond a certain point may not significantly improve the model's performance on the training data. ‘mtry’ represents the number of variables randomly sampled as candidates at each split. From Table 5, as mtry increases from 1 to 2, the average train error tends to decrease. However, when mtry increases from 2 to 3, the average train error tends to increase. This suggests that mtry = 2 gives a better performance of the model. ‘nodesize’ indicates the minimum size of terminal notes in the trees, with a smaller nodesize allowing the trees to grow deeper and fit the train data more closely. From Table 5, as nodesize increases, the average train error tends to increase, suggesting that a smaller nodesize may have better performance. Even though nodesize=1 tends to have low training error (0) this could indicate overfitting. As such nodesize = 2 is used for building the model. Based on the above, n.trees = 300, mtry= 2 and nodesize = 2 are used as the parameters for Random Forest. Parameter tuning –
Boosting Table 6: Boosting: Mean of train errors across the 100 loops of Monte Carlo Cross validation, for different parameters n.trees, shrinkage, interaction.depth n.trees = 200 n.trees = 300 n.trees = 500 Shrink-
age Interact-
ion.depth Avg Train Error Rank Train Error Shrink-
age Interact-
ion.depth Avg Train Error Rank Train Error Shrink-
age Interact-
ion.depth
Avg Train Error Rank Train Error 0.01 1 0.093 27 0.01 1 0.077 26 0.01 1 0.070 24 0.01 2 0.074 25 0.01 2 0.067 23 0.01 2 0.059 21 0.01 3 0.064 22 0.01 3 0.057 20 0.01 3 0.047 18 0.1 1 0.047 19 0.1 1 0.041 17 0.1 1 0.029 15 0.1 2 0.025 13 0.1 2 0.015 11 0.1 2 0.001 7 0.1 3 0.013 10 0.1 3 0.002 8 0.1 3 0.000 1 0.2 1 0.035 16 0.2 1 0.027 14 0.2 1 0.017 12 0.2 2 0.008 9 0.2 2 0.000 6 0.2 2 0.000 1 0.2 3 0.000 5 0.2 3 0.000 1 0.2 3 0.000 1 Table 6 shows that overall, increasing the number of trees, shrinkage value and interaction.depth shows a decrease in train error. In fact, for the highest shrinkage (0.2) and interaction.depth (3), average train error is 0 for all n.trees tested (200,300,500). However, this could indicate overfitting to the train data. As such to avoid overfitting to train data, for boosting algorithm n.trees = 200, shrinkage = 0.1 and interaction.depth = 2 are used as the parameters.
9 Findings –
Comparison of Model performance Using the selected tuned parameters for Random Forest (n.trees = 300, mtry= 2 and nodesize = 2) and Boosting (n.trees = 200, shrinkage = 0.1 and interaction.depth = 2), Monte Carlo Cross Validation is run for 100 loops, with train and test errors plotted below. The table of train and test error mean and variance of all models is shown in the Appendix. Figure 6a: Train errors, 6b:Test errors of all models (horsepower + weight + year + origin1 variables only) From Figure 6a and 6b of train and test errors boxplot, Random Forest (RF) has the best performance in terms of the smallest train and test errors, followed by Boosting. The variance of RF and Boosting in both train and test errors are amongst the lowest of the models, suggesting a more consistent performance of these models. The other models have larger train errors than Random Forest and Boosting. Looking at test errors, RF and Boosting have higher test errors than in train datasets, indicating that the two models might be overfitting to the training data. However, overall RF and Boosting still have better performance than the other models on test data. LDA and Logistic regression have higher test errors, followed by QDA and Naïve Bayes. KNN models (across all k values) have higher test errors overall. KNN is a distance-based model and might not be as suitable for this unscaled dataset. From the EDA, the weight variable is in the range of 1,613 to 5,140, which is at least 10 times larger in magnitude compared to the year variable. Scaling of the variables could potentially lead to better performance of the model. Notably, LDA and QDA assume normality in the predictor variables. Given that the variables do not follow a normal distribution, LDA and QDA might not be recommended models for this dataset. The performance of LDA and QDA could be improved using transformation methods, such as Box-Cox transformation.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
10 Table 7a: Train errors: Comparison of Random Forest vs Other methods KNN Boost
LDA QDA NB L.R. k=1 k=3 k=5 k=7 k=9 k=11 k=13 k=15 t test 1.43E
-54 5.15E
-108 1.17E
-103 1.14E
-108 4.07E
-91 6.84E
-35 4.15E
-93 5.46E
-97 1.38E
-105 1.42E
-106 2.77E
-107 5.53E
-105 1.50E
-106 w test 3.50E
-18 3.48E
-18 3.61E
-18 3.38E
-18 3.66E
-18 4.12E
-17 3.45E
-18 3.65E
-18 3.53E
-18 3.55E
-18 3.44E
-18 3.44E
-18 3.39E
-18 Table 7b: Test errors: Comparison of Random Forest vs Other methods KNN Boost
LDA QDA NB L.R. k=1 k=3 k=5 k=7 k=9 k=11 k=13 k=15 t test 0.313 2.30E
-06 1.17E
-08 1.03E
-10 9.35E
-06 1.29E
-26 2.21E
-19 5.27E
-23 9.22E
-22 2.52E
-21 1.35E
-20 8.96E
-20 2.62E
-19 w test 0.294 1.15E
-05 6.59E
-08 2.06E
-09 3.61E
-05 6.82E
-17 2.98E
-14 1.39E
-15 2.60E
-15 3.25E
-15 6.41E
-15 1.64E
-14 2.24E
-14 Statistical testing is run on both train and test errors to determine if Random Forest performance is statistically significant compared to other methods. Table 7a and 7b show the results of both pairwise t-test and w-test. For train errors (Table 7a), the p-value of Random Forest is <0.05 compared to all other methods, indicating statistically significant better performance on train datasets. For test errors (Table 7b) Random Forest is statistically significant compared to LDA, QDA, NB, LR, KNN. However, compared to Boosting test errors, the p-value is >0.05. This indicates that the performance of both Random Forest and Boosting on the test datasets are comparable, and both are suitable models for this classification task.
11 Sensitivity and Specificity
Figure 7: Sensitivity of all models (horsepower + weight + year + origin1 only) Figure 8: Specificity of all models (horsepower + weight + year + origin1 only) Based on Figure 7 sensitivity boxplot, of all models, Random Forest, followed by Boosting has the highest sensitivity. This indicates that of the models, Random Forest is best able to correctly identify a car with mpg01 = 1, the true positives. Given that the objective is to guide the manufacturing or purchase of high gas mileage cars, logistic regression is recommended, based on higher sensitivity.
12 On the other hand, from Figure 8, Random Forest and Boosting have slightly lower average specificity compared to methods like LDA. This means that RF and Boosting are less likely to correctly identify the negatives, the mpg01 = 0 cars. However, given the main objective of identifying high mileage cars, the low-test errors of RF and Boosting and high sensitivity suggest that these are good models for this context. Conclusion In this report, various methods were used to classify mpg01. Based on the boxplot and variable selection using stepwise selection (AIC criterion), the variables horsepower + weight + year + origin1 were used to create the models. Parameter tuning of Random Forest and Boosting was done using Monte Carlo Cross Validation B=100 loops, to obtain parameters with lower train errors. Based on average train and test errors, Random Forest has the lowest train and test error, followed by Boosting, compared to other methods. RF and Boosting also have low variance, suggesting more consistent predictive performance. The train data performance of RF is statistically significantly better than all other methods. On test datasets, Random Forest and Boosting have comparable performance, and these 2 models are statistically significant compared to other methods (LDA, QDA, NB, LR, KNN). In addition, given the context of identifying true high mileage cars (mpg01 = 1), the high sensitivity of Random Forest and Boosting suggest they are suitable models for this context.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
13 Appendix - Results Table: Mean and variance of train error, test error, specificity and sensitivity from 100 loops of Monte Carlo Cross validation. Models built using: horsepower + weight + year + origin1 Train Error Test Error Specificity Sensitivity Method Mean Variance Mean Variance Mean Variance Mean Variance Random Forest 0.0052 0.0000 0.0767 0.0008 0.9233 0.0020 0.9237 0.0017 Boosting 0.0252 0.0000 0.0809 0.0008 0.9356 0.0013 0.9052 0.0024 LDA 0.0899 0.0001 0.0956 0.0009 0.9561 0.0010 0.8649 0.0023 QDA 0.0932 0.0001 0.1023 0.0010 0.9444 0.0014 0.8621 0.0027 Naïve Bayes 0.1035 0.0001 0.1069 0.0012 0.9444 0.0017 0.8542 0.0026 Logistic Regression 0.0883 0.0001 0.0936 0.0007 0.9204 0.0019 0.8960 0.0019 KNN - k =1 0.0000 0.0000 0.1388 0.0010 0.8568 0.0025 0.8662 0.0025 KNN - k =3 0.0735 0.0001 0.1242 0.0011 0.8785 0.0030 0.8743 0.0022 KNN - k =5 0.0944 0.0001 0.1281 0.0009 0.8668 0.0026 0.8778 0.0020 KNN - k =7 0.1047 0.0001 0.1272 0.0009 0.8652 0.0026 0.8817 0.0023 KNN - k =9 0.1109 0.0001 0.1262 0.0009 0.8689 0.0026 0.8796 0.0021 KNN - k =11 0.1127 0.0001 0.1240 0.0009 0.8738 0.0024 0.8787 0.0021 KNN - k =13 0.1108 0.0001 0.1238 0.0010 0.8781 0.0025 0.8748 0.0023 KNN - k =15 0.1107 0.0001 0.1227 0.0010 0.8797 0.0025 0.8756 0.0023
14 Figures 9: Box plot of variables by mpg01
Appendix
Read Data
## Read data
df
<-
read.table(
file=
"Auto.csv"
,
sep =
","
,
header=
TRUE);
n
=
dim(df)[
1
];
### total number of observations
n1
=
round(n/
5
);
### number of observations randomly selected for testing data
mpg_org
<-
df$mpg
median(mpg_org)
## [1] 22.75
hist(mpg_org,
main =
"Histogram of MPG"
,
xlab =
"Miles Per Gallon"
,
ylab =
"Frequency"
)
abline(
v =
median(mpg_org),
col =
"red"
,
lty =
2
)
Histogram of MPG
Miles Per Gallon
Frequency
10
20
30
40
50
0
20
40
60
80
1
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Recode mpg01
mpg01
=
I(df$mpg >= median(df$mpg))
df
=
data.frame(mpg01, df[,-
1
]);
## replace column "mpg" by "mpg01".
One hot encoding
df$origin
<-
factor(df$origin)
# Apply one-hot encoding
df
<-
cbind(df, model.matrix(~ origin -
1
,
data =
df))
df
<-
df[, !colnames(df) %in% c(
'
origin
'
)]
EDA
table(df$mpg01)
##
## FALSE
TRUE
##
196
196
boxplot(df,
main=
"Auto Dataset Boxplot"
,
ylab=
"Value"
,
las=
2
)
2
mpg01
cylinders
splacement
horsepower
weight
acceleration
year
origin1
origin2
origin3
0
1000
2000
3000
4000
5000
Auto Dataset Boxplot
Value
selected_columns
<-
df[,
2
:
10
]
# Combine selected columns with mpg01
data_for_boxplot
<-
cbind(selected_columns,
mpg01 =
factor(df$mpg01))
# Create box plots for each variable against mpg01
for
(col
in
names(selected_columns)) {
boxplot(data_for_boxplot[, col] ~ data_for_boxplot$mpg01,
main =
paste(
"Boxplot of"
, col,
"by mpg01"
),
ylab =
col,
xlab =
"mpg01"
,
col =
c(
"#d7658b"
,
"#54bebe"
))
}
3
FALSE
TRUE
3
4
5
6
7
8
Boxplot of cylinders by mpg01
mpg01
cylinders
4
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
FALSE
TRUE
100
200
300
400
Boxplot of displacement by mpg01
mpg01
displacement
5
FALSE
TRUE
50
100
150
200
Boxplot of horsepower by mpg01
mpg01
horsepower
6
FALSE
TRUE
1500
2500
3500
4500
Boxplot of weight by mpg01
mpg01
weight
7
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
FALSE
TRUE
10
15
20
25
Boxplot of acceleration by mpg01
mpg01
acceleration
8
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
FALSE
TRUE
70
72
74
76
78
80
82
Boxplot of year by mpg01
mpg01
year
9
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
FALSE
TRUE
0.0
0.2
0.4
0.6
0.8
1.0
Boxplot of origin1 by mpg01
mpg01
origin1
10
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
FALSE
TRUE
0.0
0.2
0.4
0.6
0.8
1.0
Boxplot of origin2 by mpg01
mpg01
origin2
11
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
FALSE
TRUE
0.0
0.2
0.4
0.6
0.8
1.0
Boxplot of origin3 by mpg01
mpg01
origin3
corr
<-
cor(df)
VIF
library(car)
## Loading required package: carData
model_reg
<-
glm(mpg01~cylinders + displacement + horsepower + weight+ acceleration+ year+origin1 + ori
data=
df)
vif_values
<-
vif(model_reg)
print(vif_values)
##
cylinders displacement
horsepower
weight acceleration
year
##
5.410047
11.604209
2.694488
6.602953
2.522107
2.163868
##
origin1
origin2
##
2.944230
1.963661
12
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
par(
mar =
c(
5
,
10
,
4
,
2
) +
0.1
)
# Create a bar plot of VIF values
barplot(vif_values,
main =
"Barplot of VIF Values"
,
col =
"steelblue"
,
names.arg =
names(vif_values),
ce
abline(
h =
5
,
col =
"red"
,
lwd =
2
,
lty =
2
)
cylinders
displacement
horsepower
weight
acceleration
year
origin1
origin2
Barplot of VIF Values
0
2
4
6
8
10
## Variable selection
########## Monte Carlo: Variable selection ###########
set.seed(
7406
)
B
=
100
cvmod3_sel
<-
data.frame(matrix(
ncol =
ncol(df),
nrow =
B))
# Create a dataframe to store selected vari
for
(b
in
1
:B) {
flag
<-
sort(sample(
1
:n, n1))
traintemp
<-
df[-flag,]
testtemp
<-
df[flag,]
# Linear Regression
mod3_cv
<-
step(glm(mpg01~.,
family=
binomial,
data=
traintemp),
trace=
0
)
# Store selected variables in the dataframe
selected_vars
<-
colnames(mod3_cv$model)[-
1
]
# Exclude the intercept term
cvmod3_sel[b,]
<-
0
# Initialize the row
13
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
cvmod3_sel[b, selected_vars]
<-
1
# Set selected variables to 1
}
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
selected_cols
<-
cvmod3_sel[,
11
:
18
]
summary(selected_cols)
##
horsepower
weight
year
origin1
displacement
##
Min.
:0.00
Min.
:1
Min.
:1
Min.
:0.00
Min.
:0.0000
##
1st Qu.:1.00
1st Qu.:1
1st Qu.:1
1st Qu.:0.00
1st Qu.:0.0000
##
Median :1.00
Median :1
Median :1
Median :1.00
Median :0.0000
##
Mean
:0.91
Mean
:1
Mean
:1
Mean
:0.63
Mean
:0.1531
##
3rd Qu.:1.00
3rd Qu.:1
3rd Qu.:1
3rd Qu.:1.00
3rd Qu.:0.0000
##
Max.
:1.00
Max.
:1
Max.
:1
Max.
:1.00
Max.
:1.0000
##
NA’s
:2
##
origin2
acceleration
cylinders
##
Min.
:0.0000
Min.
:0.00000
Min.
:0.00000
##
1st Qu.:0.0000
1st Qu.:0.00000
1st Qu.:0.00000
##
Median :0.0000
Median :0.00000
Median :0.00000
##
Mean
:0.4286
Mean
:0.08421
Mean
:0.06849
##
3rd Qu.:1.0000
3rd Qu.:0.00000
3rd Qu.:0.00000
##
Max.
:1.0000
Max.
:1.00000
Max.
:1.00000
##
NA’s
:2
NA’s
:5
NA’s
:27
library(car)
model_reg_sel
<-
glm(mpg01~horsepower+weight+ year+origin1,
family=
binomial,
data=
df)
#model_reg_sel <- lm(mpg01 ~ horsepower + weight+ year+origin1, data = df)
vif_values_sel
<-
vif(model_reg_sel)
print(vif_values_sel)
## horsepower
weight
year
origin1
##
1.142649
1.642401
1.633459
1.162831
14
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
par(
mar =
c(
5
,
10
,
4
,
2
) +
0.1
)
# Create a bar plot of VIF values
barplot(vif_values_sel,
main =
"Barplot of VIF Values - Selected vars"
,
col =
"steelblue"
,
names.arg =
n
abline(
h =
5
,
col =
"red"
,
lwd =
2
,
lty =
2
)
horsepower
weight
year
origin1
Barplot of VIF Values - Selected vars
0.0
0.5
1.0
1.5
Parameter Tuning - Random Forest
library(randomForest)
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
set.seed(
7406
)
B
<-
100
cverror_rf
<-
NULL
cm_rf
<-
NULL
# Define a grid of hyperparameters for Random Forest
15
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
n_trees_rf
<-
c(
100
,
200
,
300
,
500
)
m_try_rf
<-
c(
1
,
2
,
3
)
nodesize_rf
<-
c(
1
,
2
,
5
,
10
)
# Nested loop for hyperparameter tuning
for
(trees
in
n_trees_rf) {
for
(mtry
in
m_try_rf) {
for
(nodesize
in
nodesize_rf) {
cverrortrain_temp
<-
NULL
cverror_temp
<-
NULL
cm_temp
<-
NULL
specificity_temp
<-
NULL
sensitivity_temp
<-
NULL
for
(b
in
1
:B) {
flag
<-
sort(sample(
1
:n, n1))
traintemp
<-
df[-flag,];
testtemp
<-
df[flag,];
# Build the Random Forest model
mod_rf
<-
randomForest(as.factor(mpg01) ~ horsepower + weight + year + origin1,
data =
traintemp
ntree =
trees,
mtry =
mtry,
nodesize =
nodesize)
# Calculate train error
predtrain_rf
<-
predict(mod_rf, traintemp,
type =
"class"
)
trainerror_rf
<-
mean(predtrain_rf != traintemp$mpg01)
cverrortrain_temp
<-
c(cverrortrain_temp, trainerror_rf)
# Calculate test error
predtest_rf
<-
predict(mod_rf, testtemp,
type =
"class"
)
temptesterror_rf
<-
mean(predtest_rf != testtemp$mpg01)
cverror_temp
<-
c(cverror_temp, temptesterror_rf)
# Confusion Matrix
confusion_matrix
<-
table(predtest_rf, testtemp$mpg01)
TP
<-
confusion_matrix[
2
,
2
]
FP
<-
confusion_matrix[
1
,
2
]
TN
<-
confusion_matrix[
1
,
1
]
FN
<-
confusion_matrix[
2
,
1
]
# Calculate specificity and sensitivity
specificity
<-
TN / (TN + FP)
sensitivity
<-
TP / (TP + FN)
cm_temp
<-
rbind(cm_temp, c(specificity, sensitivity))
specificity_temp
<-
c(specificity_temp, specificity)
sensitivity_temp
<-
c(sensitivity_temp, sensitivity)
}
16
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
# Store the results
cverror_rf
<-
rbind(cverror_rf, c(trees, mtry, nodesize,
mean(cverrortrain_temp), var(cverrortrain_temp),
mean(cverror_temp), var(cverror_temp),
mean(specificity_temp), var(specificity_temp),
mean(sensitivity_temp), var(sensitivity_temp)))
cm_rf
<-
rbind(cm_rf, c(trees, mtry, nodesize, colMeans(cm_temp)[
1
], colMeans(cm_temp)[
2
]))
}
}
}
# Naming the columns of cm_rf
colnames(cverror_rf)
<-
c(
"n.trees"
,
"mtry"
,
"nodesize"
,
"Avg_Train_Error"
,
"Var_Train_Error"
,
"Avg_Test_Error"
,
"Var_Test_Error"
,
"Avg_Specificity"
,
"Var_Specificity"
,
"Avg_Sensitivity"
,
"Var_Sensitivity"
)
colnames(cm_rf)
<-
c(
"n.trees"
,
"mtry"
,
"nodesize"
,
"Specificity"
,
"Sensitivity"
)
Parameter Tuning - Boosting
set.seed(
7406
)
library(gbm)
## Warning: package ’gbm’ was built under R version 4.2.3
## Loaded gbm 2.1.8.1
B
<-
100
cverror_gbm
<-
NULL
cm_gbm
<-
NULL
# Define a grid of hyperparameters
n_trees
<-
c(
100
,
200
,
300
,
500
,
1000
)
shrinkage
<-
c(
0.01
,
0.1
,
0.2
)
interaction_depth
<-
c(
1
,
2
,
3
)
# Nested loop for hyperparameter tuning
for
(trees
in
n_trees) {
for
(shrink
in
shrinkage) {
for
(depth
in
interaction_depth) {
cverror_temp
<-
NULL
trainerror_temp
<-
NULL
cm_temp
<-
NULL
specificity_temp
<-
NULL
sensitivity_temp
<-
NULL
17
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
for
(b
in
1
:B) {
flag
<-
sort(sample(
1
:n, n1))
traintemp
<-
df[-flag,];
testtemp
<-
df[flag,];
# Build the gbm model
modcv_4
<-
gbm(mpg01 ~ horsepower + weight+ year+origin1,
data =
traintemp,
distribution =
"bern
n.trees =
trees,
shrinkage =
shrink,
interaction.depth =
depth)
# Calculate train error
predtrain_4
<-
ifelse(predict(modcv_4, traintemp,
n.trees =
trees,
type =
"response"
) >
0.5
,
1
,
trainerror_4
<-
mean(predtrain_4 != traintemp$mpg01)
trainerror_temp
<-
c(trainerror_temp, trainerror_4)
# Calculate test error
predtest_4
<-
ifelse(predict(modcv_4, testtemp,
n.trees =
trees,
type =
"response"
) >
0.5
,
1
,
0
)
temptesterror_4
<-
mean(predtest_4 != testtemp$mpg01)
cverror_temp
<-
c(cverror_temp, temptesterror_4)
# Confusion Matrix
confusion_matrix
<-
table(predtest_4, testtemp$mpg01)
TP
<-
confusion_matrix[
2
,
2
]
FP
<-
confusion_matrix[
1
,
2
]
TN
<-
confusion_matrix[
1
,
1
]
FN
<-
confusion_matrix[
2
,
1
]
# Calculate specificity and sensitivity
specificity
<-
TN / (TN + FP)
sensitivity
<-
TP / (TP + FN)
cm_temp
<-
rbind(cm_temp, c(specificity, sensitivity))
specificity_temp
<-
c(specificity_temp, specificity)
sensitivity_temp
<-
c(sensitivity_temp, sensitivity)
}
# Store the results
cverror_gbm
<-
rbind(cverror_gbm, c(trees, shrink, depth,
mean(trainerror_temp), var(trainerror_temp),
mean(cverror_temp), var(cverror_temp),
mean(specificity_temp), var(specificity_temp),
mean(sensitivity_temp), var(sensitivity_temp)))
cm_gbm
<-
rbind(cm_gbm, c(trees, shrink, depth, colMeans(cm_temp)[
1
], colMeans(cm_temp)[
2
]))
}
}
}
# Naming the columns of cm_gbm
colnames(cverror_gbm)
<-
c(
"n.trees"
,
"shrinkage"
,
"interaction.depth"
,
"Avg_Train_Error"
,
"Var_Train_Error"
,
"Avg_Test_Error"
,
"Var_Test_Error"
,
"Avg_Specificity"
,
"Var_Specificity"
,
"Avg_Sensitivity"
,
"Var_Sensitivity"
)
18
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
colnames(cm_gbm)
<-
c(
"n.trees"
,
"shrinkage"
,
"interaction.depth"
,
"Specificity"
,
"Sensitivity"
)
Model evaluation:
## Model 1: RF - with tuned parameters
set.seed(
7406
);
### set the seed for randomization
B
=
100
;
### number of loops
library(randomForest)
cverror_rf
<-
NULL;
cm_rf
<-
NULL;
trainerror_rf
<-
NULL;
for
(b
in
1
:B) {
## Step 1: Split to temp train, test set
flag
<-
sort(sample(
1
:n, n1))
traintemp
<-
df[-flag,];
testtemp
<-
df[flag,];
## Step 2: Build model
mod_rf
<-
randomForest(as.factor(mpg01) ~ horsepower + weight + year + origin1,
data =
traintemp,
ntree =
300
,
mtry =
2
,
nodesize =
2
)
### Step 3a: Calculate train error
predtrain_rf
<-
predict(mod_rf, traintemp,
type =
"class"
)
temptrainerror_rf
<-
c(mean(predtrain_rf != traintemp$mpg01))
trainerror_rf
<-
rbind(trainerror_rf, temptrainerror_rf)
# Step 3: Calculate test error
predtest_rf
<-
predict(mod_rf,testtemp,
type=
"class"
)
temptesterror_rf
<-
c(mean(predtest_rf!=testtemp$mpg01));
cverror_rf
<-
rbind(cverror_rf,temptesterror_rf)
# Step 4: CM
confusion_matrix
<-
table(predtest_rf, testtemp[,
1
])
# Extract TP, FP, TN, FN
TP
<-
confusion_matrix[
2
,
2
]
FP
<-
confusion_matrix[
1
,
2
]
TN
<-
confusion_matrix[
1
,
1
]
FN
<-
confusion_matrix[
2
,
1
]
# Calculate specificity and sensitivity
specificity
<-
TN / (TN + FP)
sensitivity
<-
TP / (TP + FN)
19
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
cm_rf
<-
rbind(cm_rf,c(specificity,sensitivity))
colnames(cm_rf)
<-
c(
"Specificity"
,
"Sensitivity"
)
cm_rf
<-
as.data.frame(cm_rf)
}
library(gbm)
## Model 2: GBM - with tuned parameters
set.seed(
7406
);
### set the seed for randomization
B
=
100
;
### number of loops
cverror_gbmsel
<-
NULL;
cm_gbmsel
<-
NULL;
trainerror_gbmsel
<-
NULL;
for
(b
in
1
:B) {
## Step 1: Split to temp train, test set
flag
<-
sort(sample(
1
:n, n1))
traintemp
<-
df[-flag,];
testtemp
<-
df[flag,];
#print(traintemp[1])
## Step 2: Build model
mod_gbmsel
<-
gbm(mpg01 ~ horsepower + weight+ year+origin1,
data =
traintemp,
distribution =
"bernoul
### Step 3a: Calculate train error
predtrain_gbmsel
<-
ifelse(predict(mod_gbmsel, traintemp,
n.trees=
200
,
type =
"response"
) >
0.5
,
1
,
0
)
temptrainerror_gbmsel
<-
c(mean(predtrain_gbmsel != traintemp$mpg01))
trainerror_gbmsel
<-
rbind(trainerror_gbmsel, temptrainerror_gbmsel)
# Step 3: Calculate test error
predtest_gbmsel
<-
ifelse(predict(mod_gbmsel, testtemp,
n.trees=
200
,
type =
"response"
) >
0.5
,
1
,
0
)
temptesterror_gbmsel
<-
c(mean(predtest_gbmsel!=testtemp$mpg01));
cverror_gbmsel
<-
rbind(cverror_gbmsel,temptesterror_gbmsel)
# Step 4: CM
confusion_matrix
<-
table(predtest_gbmsel, testtemp[,
1
])
# Extract TP, FP, TN, FN
TP
<-
confusion_matrix[
2
,
2
]
FP
<-
confusion_matrix[
1
,
2
]
TN
<-
confusion_matrix[
1
,
1
]
FN
<-
confusion_matrix[
2
,
1
]
# Calculate specificity and sensitivity
specificity
<-
TN / (TN + FP)
sensitivity
<-
TP / (TP + FN)
20
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
cm_gbmsel
<-
rbind(cm_gbmsel,c(specificity,sensitivity))
colnames(cm_gbmsel)
<-
c(
"Specificity"
,
"Sensitivity"
)
cm_gbmsel
<-
as.data.frame(cm_gbmsel)
}
## Model 3: LDA
set.seed(
7406
);
### set the seed for randomization
B
=
100
;
### number of loops
library(MASS)
cverror_lda
<-
NULL;
cm_lda
<-
NULL;
trainerror_lda
<-
NULL;
for
(b
in
1
:B) {
## Step 1: Split to temp train, test set
flag
<-
sort(sample(
1
:n, n1))
traintemp
<-
df[-flag,];
testtemp
<-
df[flag,];
## Step 2: Build model
modcv_1
<-
lda(traintemp[,c(
4
,
5
,
7
,
8
)], traintemp[,
1
]);
# Step 3a: Calculate training error
predtrain_1
<-
predict(modcv_1, traintemp[,c(
4
,
5
,
7
,
8
)])$class;
trainerror_1
<-
c(mean(predtrain_1 != traintemp$mpg01))
trainerror_lda
<-
rbind(trainerror_lda, trainerror_1)
# Step 3: Calculate test error
predtest_1
<-
predict(modcv_1,testtemp[,c(
4
,
5
,
7
,
8
)])$class;
temptesterror_1
<-
c(mean(predtest_1!=testtemp$mpg01));
cverror_lda
<-
rbind(cverror_lda,temptesterror_1)
# Step 4: CM
confusion_matrix
<-
table(predtest_1, testtemp[,
1
])
# Extract TP, FP, TN, FN
TP
<-
confusion_matrix[
2
,
2
]
FP
<-
confusion_matrix[
1
,
2
]
TN
<-
confusion_matrix[
1
,
1
]
FN
<-
confusion_matrix[
2
,
1
]
# Calculate specificity and sensitivity
specificity
<-
TN / (TN + FP)
sensitivity
<-
TP / (TP + FN)
21
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
cm_lda
<-
rbind(cm_lda,c(specificity,sensitivity))
colnames(cm_lda)
<-
c(
"Specificity"
,
"Sensitivity"
)
cm_lda
<-
as.data.frame(cm_lda)
}
## Model 4: QDA
set.seed(
7406
);
### set the seed for randomization
B
=
100
;
### number of loops
library(MASS)
cverror_qda
<-
NULL;
cm_qda
<-
NULL;
trainerror_qda
<-
NULL;
for
(b
in
1
:B) {
## Step 1: Split to temp train, test set
flag
<-
sort(sample(
1
:n, n1))
traintemp
<-
df[-flag,];
testtemp
<-
df[flag,];
## Step 2: Build model
modcv_2
<-
qda(traintemp[,c(
4
,
5
,
7
,
8
)], traintemp[,
1
]);
# Step 3a: Calculate training error
predtrain_2
<-
predict(modcv_2, traintemp[,c(
4
,
5
,
7
,
8
)])$class;
trainerror_2
<-
c(mean(predtrain_2 != traintemp$mpg01))
trainerror_qda
<-
rbind(trainerror_qda, trainerror_2)
# Step 3: Calculate test error
predtest_2
<-
predict(modcv_2,testtemp[,c(
4
,
5
,
7
,
8
)])$class;
temptesterror_2
<-
c(mean(predtest_2!=testtemp$mpg01));
cverror_qda
<-
rbind(cverror_qda,temptesterror_2)
# Step 4: CM
confusion_matrix
<-
table(predtest_2, testtemp[,
1
])
# Extract TP, FP, TN, FN
TP
<-
confusion_matrix[
2
,
2
]
FP
<-
confusion_matrix[
1
,
2
]
TN
<-
confusion_matrix[
1
,
1
]
FN
<-
confusion_matrix[
2
,
1
]
# Calculate specificity and sensitivity
specificity
<-
TN / (TN + FP)
sensitivity
<-
TP / (TP + FN)
cm_qda
<-
rbind(cm_qda,c(specificity,sensitivity))
22
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
colnames(cm_qda)
<-
c(
"Specificity"
,
"Sensitivity"
)
cm_qda
<-
as.data.frame(cm_qda)
}
## Model 5: Naive Bayes
set.seed(
7406
);
### set the seed for randomization
library(e1071)
B
=
100
;
### number of loops
cverror_nb
<-
NULL;
cm_nb
<-
NULL;
trainerror_nb
<-
NULL;
for
(b
in
1
:B) {
## Step 1: Split to temp train, test set
flag
<-
sort(sample(
1
:n, n1))
traintemp
<-
df[-flag,];
testtemp
<-
df[flag,];
## Step 2: Build model
modcv_3
<-
naiveBayes(traintemp[,c(
4
,
5
,
7
,
8
)], traintemp[,
1
]);
# Step 3a: Calculate training error
predtrain_3
<-
predict(modcv_3, traintemp[,c(
4
,
5
,
7
,
8
)]);
trainerror_3
<-
c(mean(predtrain_3 != traintemp$mpg01))
trainerror_nb
<-
rbind(trainerror_nb, trainerror_3)
# Step 3: Calculate test error
predtest_3
<-
predict(modcv_3,testtemp[,c(
4
,
5
,
7
,
8
)]);
temptesterror_3
<-
c(mean(predtest_3!=testtemp$mpg01));
cverror_nb
<-
rbind(cverror_nb,temptesterror_3)
# Step 4: CM
confusion_matrix
<-
table(predtest_3, testtemp[,
1
])
# Extract TP, FP, TN, FN
TP
<-
confusion_matrix[
2
,
2
]
FP
<-
confusion_matrix[
1
,
2
]
TN
<-
confusion_matrix[
1
,
1
]
FN
<-
confusion_matrix[
2
,
1
]
# Calculate specificity and sensitivity
specificity
<-
TN / (TN + FP)
sensitivity
<-
TP / (TP + FN)
cm_nb
<-
rbind(cm_nb,c(specificity,sensitivity))
23
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
colnames(cm_nb)
<-
c(
"Specificity"
,
"Sensitivity"
)
cm_nb
<-
as.data.frame(cm_nb)
}
## Model 6: Logistic Regression
set.seed(
7406
);
### set the seed for randomization
B
=
100
;
### number of loops
cverror_lr
<-
NULL;
cm_lr
<-
NULL;
trainerror_lr
<-
NULL;
for
(b
in
1
:B) {
## Step 1: Split to temp train, test set
flag
<-
sort(sample(
1
:n, n1))
traintemp
<-
df[-flag,];
testtemp
<-
df[flag,];
## Step 2: Build model
modcv_4
<-
glm(mpg01 ~ horsepower+weight+ year+origin1,
data =
traintemp,
family =
binomial);
# Step 3a: Calculate training error
predtrain_4
<-
ifelse(predict(modcv_4,traintemp,
type =
"response"
) >
0.5
,
1
,
0
);
trainerror_4
<-
c(mean(predtrain_4 != traintemp$mpg01))
trainerror_lr
<-
rbind(trainerror_lr, trainerror_4)
# Step 3: Calculate test error
predtest_4
<-
ifelse(predict(modcv_4,testtemp,
type =
"response"
) >
0.5
,
1
,
0
);
temptesterror_4
<-
c(mean(predtest_4!=testtemp$mpg01));
cverror_lr
<-
rbind(cverror_lr,temptesterror_4)
# Step 4: CM
confusion_matrix
<-
table(predtest_4, testtemp[,
1
])
# Extract TP, FP, TN, FN
TP
<-
confusion_matrix[
2
,
2
]
FP
<-
confusion_matrix[
1
,
2
]
TN
<-
confusion_matrix[
1
,
1
]
FN
<-
confusion_matrix[
2
,
1
]
# Calculate specificity and sensitivity
specificity
<-
TN / (TN + FP)
sensitivity
<-
TP / (TP + FN)
cm_lr
<-
rbind(cm_lr,c(specificity,sensitivity))
colnames(cm_lr)
<-
c(
"Specificity"
,
"Sensitivity"
)
24
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
cm_lr
<-
as.data.frame(cm_lr)
}
## Model 7: KNN
library(class);
set.seed(
7406
);
### set the seed for randomization
kk
<-
c(
1
,
3
,
5
,
7
,
9
,
11
,
13
,
15
);
B
=
100
;
### number of loops
CVALL_cv_train
<-
NULL;
CVALL_cv
<-
NULL;
CVALL_sens_kNN
<-
NULL;
CVALL_spec_kNN
<-
NULL;
confusion_matrices
<-
list()
# Initialize list to store confusion matrices
# Initialize a data frame to store TP, FP, TN, FN for each iteration
result_df
<-
data.frame(
TP =
numeric(B),
FP =
numeric(B),
TN =
numeric(B),
FN =
numeric(B))
for
(b
in
1
:B) {
## For knn
cverrortrain_knn
<-
NULL;
cverror_knn
<-
NULL;
spec_knn
<-
NULL;
sens_knn
<-
NULL;
## Step 1: Split to temp train, test set
flag
<-
sort(sample(
1
:n, n1))
traintemp
<-
df[-flag,];
testtemp
<-
df[flag,];
## Step 2: Build model
for
(i
in
1
:length(kk)) {
# KNN
xnew_knn
<-
testtemp[,c(
4
,
5
,
7
,
8
)]
ypred_knn.train
<-
knn(traintemp[,c(
4
,
5
,
7
,
8
)], traintemp[,c(
4
,
5
,
7
,
8
)], traintemp$mpg01,
k =
kk[i])
temptrainerror_knn
<-
mean(ypred_knn.train != traintemp$mpg01)
cverrortrain_knn
<-
cbind(cverrortrain_knn, temptrainerror_knn)
ypred_knn.test
<-
knn(traintemp[,c(
4
,
5
,
7
,
8
)], xnew_knn, traintemp$mpg01,
k =
kk[i])
temptesterror_knn
<-
mean(ypred_knn.test != testtemp$mpg01)
cverror_knn
<-
cbind(cverror_knn, temptesterror_knn)
# Create confusion matrix
confusion
<-
table(ypred_knn.test, testtemp$mpg01)
confusion_matrices
<-
append(confusion_matrices, list(confusion))
25
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
# Extract TP, FP, TN, FN
TP
<-
confusion[
2
,
2
]
FP
<-
confusion[
1
,
2
]
TN
<-
confusion[
1
,
1
]
FN
<-
confusion[
2
,
1
]
# Calculate specificity and sensitivity
specificity
<-
TN / (TN + FP)
sensitivity
<-
TP / (TP + FN)
spec_knn
<-
cbind(spec_knn, specificity)
sens_knn
<-
cbind(sens_knn, sensitivity)
}
CVALL_cv_train
<-
rbind(CVALL_cv_train,cverrortrain_knn)
CVALL_cv
<-
rbind(CVALL_cv,cverror_knn)
CVALL_spec_kNN
<-
rbind(CVALL_spec_kNN,spec_knn)
CVALL_sens_kNN
<-
rbind(CVALL_sens_kNN,sens_knn)
}
# Now,
'
confusion_matrices
'
contains the confusion matrices for each value of k.
#
'
result_df
'
contains TP, FP, TN, FN for each iteration.
Evaluation Metrics:
TrainALL
=
cbind(trainerror_rf,trainerror_gbmsel ,trainerror_lda,trainerror_qda,trainerror_nb,trainerror
boxplot(TrainALL,
main =
"Train Error Boxplot"
)
26
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
temptrainerror_knn
temptrainerror_knn
0.00
0.04
0.08
0.12
Train Error Boxplot
TEALL
=
cbind(cverror_rf,cverror_gbmsel ,cverror_lda,cverror_qda,cverror_nb,cverror_lr,CVALL_cv)
boxplot(TEALL,
main =
"Test Error Boxplot"
)
27
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
temptesterror_knn
temptesterror_knn
0.05
0.10
0.15
0.20
0.25
Test Error Boxplot
SpecALL
=
cbind(cm_rf$Specificity,cm_gbmsel$Specificity,cm_lda$Specificity,cm_qda$Specificity,cm_nb$Spec
boxplot(SpecALL,
main =
"Specificity Boxplot"
)
28
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
specificity
specificity
specificity
0.75
0.85
0.95
Specificity Boxplot
SensALL
=
cbind(cm_rf$Sensitivity,cm_gbmsel$Sensitivity,cm_lda$Sensitivity,cm_qda$Sensitivity,cm_nb$Sens
boxplot(SensALL,
main =
"Sensitivity Boxplot"
)
29
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
sensitivity
sensitivity
sensitivity
0.70
0.80
0.90
1.00
Sensitivity Boxplot
means
<-
apply(TEALL,
2
, mean)
variances
<-
apply(TEALL,
2
, var)
summary_stats
<-
data.frame(
Column =
colnames(TEALL),
Mean =
means,
Variance =
variances)
print(summary_stats)
##
Column
Mean
Variance
## 1
0.07666667 0.0007736139
## 2
0.08089744 0.0008025521
## 3
0.09564103 0.0008979340
## 4
0.10230769 0.0009794859
## 5
0.10692308 0.0011795137
## 6
0.09358974 0.0007421354
## 7
temptesterror_knn 0.13884615 0.0010129401
## 8
temptesterror_knn 0.12423077 0.0011113601
## 9
temptesterror_knn 0.12807692 0.0008550163
## 10 temptesterror_knn 0.12717949 0.0009486051
## 11 temptesterror_knn 0.12615385 0.0009487379
## 12 temptesterror_knn 0.12397436 0.0009033630
## 13 temptesterror_knn 0.12384615 0.0009902443
## 14 temptesterror_knn 0.12269231 0.0009903605
30
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
summary_spec
<-
data.frame(
Column =
colnames(SpecALL),
Mean =
apply(SpecALL,
2
, mean) ,
Variance =
apply
print(summary_spec)
##
Column
Mean
Variance
## 1
0.9232738 0.001959364
## 2
0.9355589 0.001320847
## 3
0.9561488 0.001036278
## 4
0.9443722 0.001369056
## 5
0.9444319 0.001711889
## 6
0.9204442 0.001929678
## 7
specificity 0.8567752 0.002518453
## 8
specificity 0.8784851 0.002984309
## 9
specificity 0.8667769 0.002605161
## 10 specificity 0.8652357 0.002576972
## 11 specificity 0.8688997 0.002633592
## 12 specificity 0.8737502 0.002421178
## 13 specificity 0.8780773 0.002453583
## 14 specificity 0.8796575 0.002496284
summary_sens
<-
data.frame(
Column =
colnames(SensALL),
Mean =
apply(SensALL,
2
, mean) ,
Variance =
apply
print(summary_sens)
##
Column
Mean
Variance
## 1
0.9236596 0.001738185
## 2
0.9051825 0.002357728
## 3
0.8649023 0.002298335
## 4
0.8621474 0.002666568
## 5
0.8541714 0.002572463
## 6
0.8960307 0.001912041
## 7
sensitivity 0.8661883 0.002493456
## 8
sensitivity 0.8743033 0.002201731
## 9
sensitivity 0.8778338 0.001956006
## 10 sensitivity 0.8816503 0.002320618
## 11 sensitivity 0.8795674 0.002131426
## 12 sensitivity 0.8786573 0.002124985
## 13 sensitivity 0.8748121 0.002348846
## 14 sensitivity 0.8755619 0.002266427
Significance testing
t-test
# t test
p_values_df
<-
data.frame(
Column1 =
integer(
0
),
Column2 =
integer(
0
),
P_Value =
numeric(
0
),
Significant
# Generate combinations of column pairs
combinations
<-
combn(ncol(TEALL),
2
)
# Perform paired t-tests for each combination
for
(i
in
seq_len(ncol(combinations))) {
31
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
col1
<-
combinations[
1
, i]
col2
<-
combinations[
2
, i]
result
<-
t.test(TEALL[, col1], TEALL[, col2],
paired =
TRUE)
# Add p-value and significance to data frame
significant
<-
ifelse(result$p.value <
0.05
,
"Yes"
,
"No"
)
p_values_df
<-
rbind(p_values_df, data.frame(
Column1 =
col1,
Column2 =
col2,
P_Value =
result$p.value,
}
# Print the data frame of p-values
print(p_values_df)
##
Column1 Column2
P_Value Significant
## 1
1
2 3.129546e-01
No
## 2
1
3 2.296811e-06
Yes
## 3
1
4 1.167770e-08
Yes
## 4
1
5 1.030949e-10
Yes
## 5
1
6 9.353614e-06
Yes
## 6
1
7 1.286170e-26
Yes
## 7
1
8 2.209082e-19
Yes
## 8
1
9 5.267922e-23
Yes
## 9
1
10 9.218069e-22
Yes
## 10
1
11 2.515879e-21
Yes
## 11
1
12 1.345765e-20
Yes
## 12
1
13 8.963262e-20
Yes
## 13
1
14 2.624135e-19
Yes
## 14
2
3 4.828501e-04
Yes
## 15
2
4 1.342677e-06
Yes
## 16
2
5 1.413071e-07
Yes
## 17
2
6 1.191119e-03
Yes
## 18
2
7 1.665457e-25
Yes
## 19
2
8 1.107275e-16
Yes
## 20
2
9 1.500643e-20
Yes
## 21
2
10 4.511518e-19
Yes
## 22
2
11 1.221572e-18
Yes
## 23
2
12 1.695142e-17
Yes
## 24
2
13 3.716715e-17
Yes
## 25
2
14 1.563016e-16
Yes
## 26
3
4 5.052609e-04
Yes
## 27
3
5 5.855067e-07
Yes
## 28
3
6 4.004751e-01
No
## 29
3
7 1.879294e-15
Yes
## 30
3
8 1.869527e-08
Yes
## 31
3
9 1.625460e-10
Yes
## 32
3
10 8.125927e-10
Yes
## 33
3
11 1.497783e-09
Yes
## 34
3
12 6.207109e-09
Yes
## 35
3
13 7.390469e-09
Yes
## 36
3
14 2.352131e-08
Yes
## 37
4
5 5.471170e-03
Yes
## 38
4
6 8.644515e-04
Yes
## 39
4
7 2.732528e-12
Yes
## 40
4
8 1.039836e-05
Yes
32
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
## 41
4
9 7.617537e-08
Yes
## 42
4
10 4.577721e-07
Yes
## 43
4
11 9.461594e-07
Yes
## 44
4
12 3.931840e-06
Yes
## 45
4
13 8.343329e-06
Yes
## 46
4
14 2.227131e-05
Yes
## 47
5
6 1.261662e-06
Yes
## 48
5
7 1.641560e-09
Yes
## 49
5
8 5.457933e-04
Yes
## 50
5
9 1.471888e-05
Yes
## 51
5
10 5.195035e-05
Yes
## 52
5
11 8.278948e-05
Yes
## 53
5
12 3.250051e-04
Yes
## 54
5
13 4.564791e-04
Yes
## 55
5
14 1.315725e-03
Yes
## 56
6
7 5.182996e-19
Yes
## 57
6
8 1.130561e-10
Yes
## 58
6
9 5.388557e-13
Yes
## 59
6
10 4.119388e-12
Yes
## 60
6
11 1.419441e-11
Yes
## 61
6
12 1.599015e-11
Yes
## 62
6
13 5.998276e-11
Yes
## 63
6
14 3.332584e-10
Yes
## 64
7
8 1.252486e-05
Yes
## 65
7
9 2.558681e-04
Yes
## 66
7
10 1.826809e-04
Yes
## 67
7
11 1.011440e-04
Yes
## 68
7
12 1.214581e-05
Yes
## 69
7
13 1.473735e-05
Yes
## 70
7
14 1.599119e-05
Yes
## 71
8
9 9.054984e-02
No
## 72
8
10 2.700545e-01
No
## 73
8
11 4.485337e-01
No
## 74
8
12 9.219066e-01
No
## 75
8
13 8.874992e-01
No
## 76
8
14 6.039986e-01
No
## 77
9
10 5.715115e-01
No
## 78
9
11 2.208576e-01
No
## 79
9
12 2.359098e-02
Yes
## 80
9
13 3.583216e-02
Yes
## 81
9
14 2.359580e-02
Yes
## 82
10
11 4.310488e-01
No
## 83
10
12 6.429585e-02
No
## 84
10
13 1.003370e-01
No
## 85
10
14 6.139859e-02
No
## 86
11
12 8.433192e-02
No
## 87
11
13 1.424806e-01
No
## 88
11
14 9.354490e-02
No
## 89
12
13 9.099457e-01
No
## 90
12
14 4.138163e-01
No
## 91
13
14 3.480105e-01
No
33
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Wilcox test
# Create an empty data frame to store p-values
wilcox_values_df
<-
data.frame(
Column1 =
integer(
0
),
Column2 =
integer(
0
),
P_Value =
numeric(
0
),
Signifi
# Generate combinations of column pairs
combinations
<-
combn(ncol(TEALL),
2
)
# Perform paired t-tests for each combination
for
(i
in
seq_len(ncol(combinations))) {
col1
<-
combinations[
1
, i]
col2
<-
combinations[
2
, i]
result
<-
wilcox.test(TEALL[, col1], TEALL[, col2],
paired =
TRUE)
# Add p-value and significance to data frame
significant
<-
ifelse(result$p.value <
0.05
,
"Yes"
,
"No"
)
wilcox_values_df
<-
rbind(wilcox_values_df, data.frame(
Column1 =
col1,
Column2 =
col2,
P_Value =
resul
}
## Warning in wilcox.test.default(TEALL[, col1], TEALL[, col2], paired = TRUE):
## cannot compute exact p-value with ties
## Warning in wilcox.test.default(TEALL[, col1], TEALL[, col2], paired = TRUE):
## cannot compute exact p-value with zeroes
## Warning in wilcox.test.default(TEALL[, col1], TEALL[, col2], paired = TRUE):
## cannot compute exact p-value with ties
## Warning in wilcox.test.default(TEALL[, col1], TEALL[, col2], paired = TRUE):
## cannot compute exact p-value with zeroes
## Warning in wilcox.test.default(TEALL[, col1], TEALL[, col2], paired = TRUE):
## cannot compute exact p-value with ties
## Warning in wilcox.test.default(TEALL[, col1], TEALL[, col2], paired = TRUE):
## cannot compute exact p-value with zeroes
## Warning in wilcox.test.default(TEALL[, col1], TEALL[, col2], paired = TRUE):
## cannot compute exact p-value with ties
## Warning in wilcox.test.default(TEALL[, col1], TEALL[, col2], paired = TRUE):
## cannot compute exact p-value with zeroes
# Print the data frame of p-values
print(wilcox_values_df)
##
Column1 Column2
P_Value Significant
## 1
1
2 2.936516e-01
No
## 2
1
3 1.152155e-05
Yes
## 3
1
4 6.586686e-08
Yes
34
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
## 4
1
5 2.060770e-09
Yes
## 5
1
6 3.609859e-05
Yes
## 6
1
7 6.816979e-17
Yes
## 7
1
8 2.976688e-14
Yes
## 8
1
9 1.394741e-15
Yes
## 9
1
10 2.604831e-15
Yes
## 10
1
11 3.250826e-15
Yes
## 11
1
12 6.406268e-15
Yes
## 12
1
13 1.636190e-14
Yes
## 13
1
14 2.237652e-14
Yes
## 14
2
3 1.375518e-03
Yes
## 15
2
4 6.228213e-06
Yes
## 16
2
5 5.719007e-07
Yes
## 17
2
6 1.993251e-03
Yes
## 18
2
7 6.700298e-17
Yes
## 19
2
8 8.138846e-13
Yes
## 20
2
9 1.438259e-14
Yes
## 21
2
10 3.862181e-14
Yes
## 22
2
11 1.094281e-13
Yes
## 23
2
12 9.516703e-14
Yes
## 24
2
13 1.908234e-13
Yes
## 25
2
14 6.363138e-13
Yes
## 26
3
4 5.345427e-04
Yes
## 27
3
5 4.731274e-07
Yes
## 28
3
6 3.527105e-01
No
## 29
3
7 3.198704e-12
Yes
## 30
3
8 1.652952e-07
Yes
## 31
3
9 2.455649e-09
Yes
## 32
3
10 1.446520e-08
Yes
## 33
3
11 1.392513e-08
Yes
## 34
3
12 5.629992e-08
Yes
## 35
3
13 1.693237e-08
Yes
## 36
3
14 5.571670e-08
Yes
## 37
4
5 8.610989e-03
Yes
## 38
4
6 8.163688e-04
Yes
## 39
4
7 3.356121e-10
Yes
## 40
4
8 1.074098e-05
Yes
## 41
4
9 1.036627e-07
Yes
## 42
4
10 6.460885e-07
Yes
## 43
4
11 9.895693e-07
Yes
## 44
4
12 5.365999e-06
Yes
## 45
4
13 2.333991e-06
Yes
## 46
4
14 1.031598e-05
Yes
## 47
5
6 1.654251e-06
Yes
## 48
5
7 1.035835e-08
Yes
## 49
5
8 3.542278e-04
Yes
## 50
5
9 5.736968e-06
Yes
## 51
5
10 1.052197e-05
Yes
## 52
5
11 3.681363e-05
Yes
## 53
5
12 1.944498e-04
Yes
## 54
5
13 6.867979e-05
Yes
## 55
5
14 3.713461e-04
Yes
## 56
6
7 6.514968e-14
Yes
## 57
6
8 1.452611e-09
Yes
35
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
## 58
6
9 2.456485e-11
Yes
## 59
6
10 2.199424e-10
Yes
## 60
6
11 4.072362e-10
Yes
## 61
6
12 3.579193e-10
Yes
## 62
6
13 3.580718e-10
Yes
## 63
6
14 7.713454e-10
Yes
## 64
7
8 1.971053e-05
Yes
## 65
7
9 4.110505e-04
Yes
## 66
7
10 1.075116e-04
Yes
## 67
7
11 8.182855e-05
Yes
## 68
7
12 2.131455e-05
Yes
## 69
7
13 1.283626e-05
Yes
## 70
7
14 4.425322e-05
Yes
## 71
8
9 8.109189e-02
No
## 72
8
10 3.118137e-01
No
## 73
8
11 4.552619e-01
No
## 74
8
12 8.789112e-01
No
## 75
8
13 8.831379e-01
No
## 76
8
14 8.330633e-01
No
## 77
9
10 2.255099e-01
No
## 78
9
11 1.649483e-01
No
## 79
9
12 2.559919e-02
Yes
## 80
9
13 4.040016e-02
Yes
## 81
9
14 3.743972e-02
Yes
## 82
10
11 3.596937e-01
No
## 83
10
12 1.542088e-01
No
## 84
10
13 3.331840e-01
No
## 85
10
14 3.009194e-01
No
## 86
11
12 1.314828e-01
No
## 87
11
13 3.349624e-01
No
## 88
11
14 3.562930e-01
No
## 89
12
13 8.495031e-01
No
## 90
12
14 8.107766e-01
No
## 91
13
14 9.675252e-01
No
36
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help