Lab4
docx
keyboard_arrow_up
School
Western Michigan University *
*We aren’t endorsed by this school
Course
5821
Subject
Industrial Engineering
Date
Apr 3, 2024
Type
docx
Pages
5
Uploaded by CaptainSeaLion4027
Comparing KNN, LDA, and Logistic Regression
Maisha Maliha
10/04/2023
# Load necessary libraries
library
(class)
library
(Metrics)
library
(MASS)
library
(nnet)
# Load and preprocess your dataset
data
(
"iris"
)
# Replace 'X' and 'y' with your feature matrix and target variable
X <-
iris[, c
(
"Sepal.Length"
, "Sepal.Width"
, "Petal.Length"
, "Petal.Width"
)]
y <-
iris
$
Species
# Split the dataset into training (70%) and testing (30%) sets
set.seed
(
123
)
splitIndex <-
sample
(
1
:
nrow
(X), 0.7
*
nrow
(X))
train_data <-
X[splitIndex, ]
test_data <-
X[
-
splitIndex, ]
train_labels <-
y[splitIndex]
test_labels <-
y[
-
splitIndex]
# Convert to binary classification
train_labels_binary <-
ifelse
(train_labels ==
"versicolor"
, 1
, 0
)
test_labels_binary <-
ifelse
(test_labels ==
"versicolor"
, 1
, 0
)
# Split the dataset into training (70%) and testing (30%) sets
set.seed
(
123
)
splitIndex <-
sample
(
1
:
nrow
(X), 0.7
*
nrow
(X))
train_data <-
X[splitIndex, ]
test_data <-
X[
-
splitIndex, ]
train_labels <-
y[splitIndex]
test_labels <-
y[
-
splitIndex]
# K-Nearest Neighbors (KNN)
best_k <-
NULL
best_accuracy <-
0
# Iterate through different values of K to find the optimal K
for
(k in
1
:
20
) {
knn_model <-
knn
(train_data, test_data, train_labels, k =
k)
accuracy <-
sum
(knn_model ==
test_labels) /
length
(test_labels)
if
(accuracy >
best_accuracy) {
best_accuracy <-
accuracy
best_k <-
k
}
}
# Train the final KNN model with the best K value
final_knn_model <-
knn
(train_data, test_data, train_labels_binary, k =
best_k)
# Evaluate KNN
cm <-
table
(
Actual =
test_labels_binary, Predicted =
final_knn_model)
knn_accuracy <-
sum
(
diag
(cm)) /
sum
(cm)
knn_precision <-
cm[
2
, 2
] /
sum
(cm[, 2
])
knn_recall <-
cm[
2
, 2
] /
sum
(cm[
2
, ])
knn_f1_score <-
2
*
(knn_precision *
knn_recall) /
(knn_precision +
knn_recall)
# Linear Discriminant Analysis (LDA) + Logistic Regression
lda_model <-
lda
(train_labels ~
Sepal.Length +
Sepal.Width +
Petal.Length +
Petal.Width, data =
train_data)
# LDA scores for the training data using the predict function
lda_scores_train <-
predict
(lda_model, train_data)
$
x
# Train logistic regression model
logistic_model <-
glm
(train_labels_binary ~
lda_scores_train[, 1
] +
lda_scores_train[, 2
], family =
binomial)
# Predict probabilities for the test data
lda_scores_test <-
predict
(lda_model, test_data)
$
x
predicted_probabilities <-
predict
(logistic_model, newdata = data.frame
(lda_scores_test), type =
"response"
)
## Warning: 'newdata' had 45 rows but variables found have 105 rows
predicted_labels <-
rep
(
0
, length
(test_labels_binary))
predicted_labels[
1
:
45
] <-
ifelse
(predicted_probabilities >
0.5
, 1
, 0
)
## Warning in predicted_labels[1:45] <- ifelse(predicted_probabilities
> 0.5, :
## number of items to replace is not a multiple of replacement length
# Evaluate LDA + Logistic Regression
conf_matrix <-
table
(
Actual =
test_labels_binary, Predicted = predicted_labels)
lda_accuracy <-
sum
(
diag
(conf_matrix)) /
sum
(conf_matrix)
lda_precision <-
conf_matrix[
2
, 2
] /
sum
(conf_matrix[, 2
])
lda_recall <-
conf_matrix[
2
, 2
] /
sum
(conf_matrix[
2
, ])
lda_f1_score <-
2
*
(lda_precision *
lda_recall) /
(lda_precision +
lda_recall)
# Logistic Regression
# Fit logistic regression model
lr_model <-
glm
(train_labels_binary ~
Sepal.Length +
Sepal.Width +
Petal.Length +
Petal.Width, data =
train_data, family =
binomial)
# Predict on the test data using the logistic regression model
lr_predicted <-
predict
(lr_model, newdata =
test_data, type = "response"
)
# Evaluate Logistic Regression
# Convert probabilities to binary predictions
lr_predicted_binary <-
ifelse
(lr_predicted >
0.5
, 1
, 0
)
# Evaluate Logistic Regression
Actual <-
test_labels_binary
Predicted <-
lr_predicted_binary
cm_lr <-
as.table
(
table
(Actual, Predicted))
log_accuracy <-
sum
(
diag
(cm_lr)) /
sum
(cm_lr)
log_precision <-
cm_lr[
2
, 2
] /
sum
(cm_lr[, 2
])
log_recall <-
cm_lr[
2
, 2
] /
sum
(cm_lr[
2
, ])
log_f1_score <-
2
*
(log_precision *
log_recall) /
(log_precision +
log_recall)
Results and Discussion
After conducting the experiments, we obtained the following results:
K-Nearest Neighbors (KNN)
•
Optimal K:
cat
( best_k, "
\n
"
)
## 1
cat
( knn_accuracy, "
\n
"
)
## 0.9777778
cat
( knn_precision, "
\n
"
)
## 1
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
cat
( knn_recall, "
\n
"
)
## 0.9444444
cat
(knn_f1_score, "
\n
"
)
## 0.9714286
Linear Discriminant Analysis (LDA) + Logistic Regression
cat
(
"Accuracy: "
, lda_accuracy, "
\n
"
)
## Accuracy: 0.5111111
cat
(
"Precision: "
, lda_precision, "
\n
"
)
## Precision: 0.3
cat
(
"Recall: "
, lda_recall, "
\n
"
)
## Recall: 0.1666667
cat
(
"F1 Score: "
, lda_f1_score, "
\n
"
)
## F1 Score: 0.2142857
Logistic Regression
cat
(
"Accuracy: "
, log_accuracy, "
\n
"
)
## Accuracy: 0.7111111
cat
(
"Precision: "
, log_precision, "
\n
"
)
## Precision: 0.7272727
cat
(
"Recall: "
, log_recall, "
\n
"
)
## Recall: 0.4444444
cat
(
"F1 Score: "
, log_f1_score, "
\n
"
)
## F1 Score: 0.5517241
Based on the evaluation metrics, let’s discuss the performance of each model in the specific class “versicolor” and gain insights into their strengths and weaknesses:
KNN stands out with its high accuracy, strong recall and perfect precision, meaning it rarely
misclassifies positive cases, and its high recall indicates its ability to capture most positive cases. The impressive F1 Score further confirms the model’s strong performance.
The combination of Linear Discriminant Analysis (LDA) and Logistic Regression has shown significantly lower accuracy compared to KNN. The model’s precision, recall, and F1 Score are also quite low. This suggests that the LDA might not be the best feature reduction technique for this dataset, and the logistic regression model may struggle to fit the reduced
feature space effectively. This approach might require further tuning or different feature selection methods to improve its performance.
Logistic Regression demonstrates moderate performance, with an accuracy of 71.11% which offers a reasonable balance between precision and recall. While the F1 Score is higher than that of the LDA + Logistic Regression model, it is still lower than that of KNN. Logistic Regression is a simple and interpretable model, making it a good choice for cases where model interpretability is essential.
Conclusion
When selecting a model, it is critical to take the individual use case, dataset properties, and computing needs into account. While Logistic Regression offers a straightforward and understandable option, KNN shines in situations when complicated patterns are essential. These models’ performance in their respective domains could be enhanced with more research and hyperparameter optimization.