question 3.1

docx

School

Georgia State University *

*We aren’t endorsed by this school

Course

CSE6242

Subject

Industrial Engineering

Date

Dec 6, 2023

Type

docx

Pages

1

Uploaded by DrSummerHornet24

Report
Question 3.1 Using the same data set ( credit_card_data.txt or credit_card_data-headers.txt ) as in Question 2.2, use the ksvm or kknn function to find a good classifier: (a) using cross-validation (do this for the k-nearest-neighbors model; SVM is optional); and (b) splitting the data into training, validation, and test data sets (pick either KNN or SVM; the other is optional). A. Using cross-validation (train.kknn) and setting a kmax = 42, the model output a best k of 12. This seems to fall in line with the answer I received in the previous homework, where we used LOOCV, even though I provided train.knn a randomized training set that is 70% of the rows found in credit_card_data. I used the input that train.kknn gave, which is k=12, and did the same thing as in HW 1 for a kknn model, except this time I had a training data set and a test set. I first ran the model using the training data to check the accuracy, which returned an accuracy exactly as I received in homework 1 for k=12, which is 84.68%. Afterwards, I used the test data set within the kknn model with k=12 and I received a lower accuracy of 80%. This was well within what I expected, as the model had not previously seen the test data, so all the randomness in the data it was overly fit to in the train data made the model look overly optimistic, as shown by the accuracy when given the test data set. B. The process here was very similar to the last part in Homework 1, except we essentially ran multiple iterations with subsets of the full credit card data set. For training, validation, and test data sets, I used a 60-20-20 percentage split respectively. Checking for accuracies in the training set, I found K=10 through K=15 to be the optimal values, so I kept those values in mind for the validation set. When validating the model, I used two for loops to iterate through 6 different K values (K=10 -> K=15) and iterate through all rows while excluding the selected data point. They essentially all returned the same exact validation accuracy (84.94%), so as I learned in the office hours, Occam’s razor is applied here and we choose the lowest K, which is K = 10. Now that we’ve selected the most optimal K value, I run my model on the test set, to get an accuracy that should represent an unbiased accuracy of the model. I repeated the same process as the training and validation set, except this time there’s no need for a for loop to iterate over different k values. The most surprising part to me is I did this entire process 3 times and I received a test accuracy of 55% initially, which made me think something was really wrong. I re-did the entire process and received a test accuracy of 87% this time, which is higher than the training and validation! I have attached my code for review! Otherwise, K=10 and K=12 seemed to give me the best results.
Discover more documents: Sign up today!
Unlock a world of knowledge! Explore tailored content for a richer learning experience. Here's what you'll get:
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help