question 3.1
docx
keyboard_arrow_up
School
Georgia State University *
*We aren’t endorsed by this school
Course
CSE6242
Subject
Industrial Engineering
Date
Dec 6, 2023
Type
docx
Pages
1
Uploaded by DrSummerHornet24
Question 3.1
Using the same data set (
credit_card_data.txt
or
credit_card_data-headers.txt
) as
in Question 2.2, use the
ksvm
or
kknn
function to find a good classifier:
(a)
using cross-validation (do this for the k-nearest-neighbors model; SVM is optional); and
(b)
splitting the data into training, validation, and test data sets (pick either KNN or SVM; the other
is optional).
A.
Using cross-validation (train.kknn) and setting a kmax = 42, the model output a best k of 12.
This seems to fall in line with the answer I received in the previous homework, where we used
LOOCV, even though I provided train.knn a randomized training set that is 70% of the rows
found in credit_card_data.
I used the input that train.kknn gave, which is k=12, and did the same thing as in HW 1 for a
kknn model, except this time I had a training data set and a test set. I first ran the model using
the training data to check the accuracy, which returned an accuracy exactly as I received in
homework 1 for k=12, which is 84.68%. Afterwards, I used the test data set within the kknn
model with k=12 and I received a lower accuracy of 80%. This was well within what I expected,
as the model had not previously seen the test data, so all the randomness in the data it was
overly fit to in the train data made the model look overly optimistic, as shown by the accuracy
when given the test data set.
B.
The process here was very similar to the last part in Homework 1, except we essentially ran
multiple iterations with subsets of the full credit card data set. For training, validation, and test
data sets, I used a 60-20-20 percentage split respectively. Checking for accuracies in the training
set, I found K=10 through K=15 to be the optimal values, so I kept those values in mind for the
validation set. When validating the model, I used two for loops to iterate through 6 different K
values (K=10 -> K=15) and iterate through all rows while excluding the selected data point. They
essentially all returned the same exact validation accuracy (84.94%), so as I learned in the office
hours, Occam’s razor is applied here and we choose the lowest K, which is K = 10.
Now that we’ve selected the most optimal K value, I run my model on the test set, to get an
accuracy that should represent an unbiased accuracy of the model. I repeated the same process as
the training and validation set, except this time there’s no need for a for loop to iterate over
different k values. The most surprising part to me is I did this entire process 3 times and I received
a test accuracy of 55% initially, which made me think something was really wrong. I re-did the
entire process and received a test accuracy of 87% this time, which is higher than the training and
validation! I have attached my code for review! Otherwise, K=10 and K=12 seemed to give me the
best results.
Discover more documents: Sign up today!
Unlock a world of knowledge! Explore tailored content for a richer learning experience. Here's what you'll get:
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help