Homework2Final

pdf

School

Georgia Institute Of Technology *

*We aren’t endorsed by this school

Course

6501

Subject

Computer Science

Date

Dec 6, 2023

Type

pdf

Pages

6

Uploaded by DeanProtonRabbit35

Report
Homework 2 2023-09-06 Question 3.1 Using the same data set (credit_card_data.txt or credit_card_data-headers.txt) as in Question 2.2, use the ksvm or kknn function to find a good classifier: (a) using cross-validation (do this for the k-nearest-neighbors model;SVM is optional);and (b) splitting the data into training, validation, and test data sets(pick either KNN or SVM;the other is optional). Question 3.1.a For part a, I used cross validation to identify the optimal k value to be 12 with an associated accuracy of 0.8517. I essentially built a function that ran a value of k through the cv.kknn function to produce a vector with decimal outputs showing the confidence of that value to the actual results from the original dataset. Once I had that vector, I used if if-else statement to round values equal to or above 0.5 to be 1 and values below 0.5 to be 0. Then, I checked the values against the results from the dataset to get the number of matches. I divided by the number of rows, to get the optimal probability. Final Answer: k=12 / Accuracy: 0.8517 rm ( list= ls ()) #importing data & loading kknn function data1 <- read.table ( "credit_card_data.txt" , stringsAsFactors = FALSE , header = FALSE ) library (kknn) # Question 3.1.a #Creating a function to test different values of K using KNN cross validation #and find optimal k value kOptimal <- function (kTest){ set.seed ( 1 ) testArray = c () model1 <- cv.kknn (V11 ~ ., data1, kcv= 10 , k= kTest, scale= TRUE ) results1 <- model1[[ 1 ]][, 2 ] #creating variable to pull fitted data from function #using for loop to make each value >=.5 to a 1, and <.5 to a zero. This is #essentially rounding the data to best fit predicted value and later match #to the V11 or results column in the data set for (i in 1 : nrow (data1)) { if (results1[i] >= . 5 ){ i = 1 1
} else { i = 0 } testArray = c (testArray,i) #storing outputs into a new vector } #printing the sum of the number of times the model outputs match those of the #data set. Dividing this by number of rows in data set to get percentage of #correct matches kOptVal <- sum (testArray == data1[, 11 ]) / nrow (data1) return (kOptVal) } #creating for loop to test k values from 2 to 20 in function built above kOptTestValStorage <- c () for (kTest in 2 : 20 ) { save1 <- kOptimal (kTest) kOptTestValStorage <- c (kOptTestValStorage,save1) } #identifying the best k to be 12, the output is 0.8517 Question 3.1.b In this problem, I split the data using a 60/20/20 Training/Validation/Testing breakdown, with points sampled randomly without replacement, until each quota was met. The for loop you will see tests values of k from 2 to 20 using a kknn function with train and validation data. The outputs returned 116 (out of 131 validation values) as the highest value, corresponding to k = 2,3, and 4. I chose 3. Using this value for k, I ran the kknn function again with the optimal k value to get an accuracy of ~89%. I then completed the problem by running the kknn function using the train and test data sets, and got an accuracy of ~90%. Final answer: k=3, Accuracy=~90% set.seed ( 2 ) #Using 60/20/20 Train/Validate/Test Split, sampling randomly w/o replacement train1 <- data1[ sample ( nrow (data1), 392 ),] validate1 <- data1[ sample ( nrow (data1), 131 ),] test1 <- data1[ sample ( nrow (data1), 131 ),] #testing optimal value of k nearest neighbors, using for loop to test values of #k from 2 to 20. The outputs are rounded to the nearest 0 or 1 value then #compared to the values in the validate set response column. The highest value #was 116 which corresponded to k=2,3,4. I chose 3. kTest2 = 2 kTest2Storage <- c () for (kTest2 in 2 : 20 ) { kknn1 <- kknn (V11 ~ ., train1, validate1, k= kTest2, distance = 2 , kernel = "optimal" , scale = TRUE ) kknn1Eval <- round ( fitted (kknn1)) == validate1 $ V11 kknn1EvalVal <- sum (kknn1Eval) kTest2Storage <- c (kTest2Storage,kknn1EvalVal) 2
} #using the optimal k value found above, I am creating a knn model using the #60% for train and 20% for validation kknn2 <- kknn (V11 ~ ., train1, validate1, k= 3 , distance = 2 , kernel = "optimal" , scale = TRUE ) kknn2Eval <- round ( fitted (kknn2)) == validate1 $ V11 kknn2EvalVal <- sum (kknn2Eval) kknn2EvalValPerformance <- kknn2EvalVal / nrow (validate1) #Accuracy = 89% #Using the optimal k value, I am creating a knn model using the 60% for training #and 20% for testing. kknnTest <- kknn (V11 ~ .,train1, test1, k= 3 , distance = 2 , kernel = "optimal" , scale = TRUE ) kknnTestEval <- round ( fitted (kknnTest)) == test1 $ V11 kknnTestEvalVal <- sum (kknnTestEval) kknnTestEvalValPerformance <- kknnTestEvalVal / nrow (test1) #Accuracy = 90% Question 4.1 Describe a situation or problem from your job, everyday life, current events, etc., for which a clustering model would be appropriate. List some (up to 5) predictors that you might use. A) I am currently serving Active Duty in the Air Force. One of the requirements for all members is to take a fitness exam or PFA, twice a year. The results are graded on a maximum of 100. A clustering model can help in determining any possible correlations of other factors of PFA scores. For example, creating a model based on a persons height and weight, then trying to identify clusters may help predict one’s PFA score. Other attributes including a member’s job, age, and bodyfat percentage could also be used for clustering analysis in predicting PFA score. Question 4.2 The iris data set iris.txt contains 150 data points, each with four predictor variables and one categorical response. The predictors are the width and length of the sepal and petal of flowers and the response is the type of flower. The data is available from the R library datasets and can be accessed with iris once the library is loaded. It is also available at the UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/datasets/Iris ). The response values are only given to see how well a specific method performed and should not be used to build the model. Use the R function kmeans to cluster the points as well as possible. Report the best combination of predictors, your suggested value of k, and how well your best clustering predicts flower type. To begin, I started by trying to find the optimal value for k. I did this by creating a for loop in which I ran cluster center values from 1 to 10 through the kmeans function in order to get the WSS values. I then used the WSS values, along with the number of centers to build an elbow diagram. After analyzing the elbow diagram, it can be seen that the kink is at 3, meaning that the optimal amount of clusters is 3. Using the cluster value of 3, I ran the kmeans function again, once for sepal length vs width, and another for petal length vs width. Using the table function, I was able to see that petal length vs width captured a higher amount of correct flower types. Cluster 1 classified 50 Iris-setosa, Cluster 2 captured 2 Iris-versicolor and 46 Iris-virginica, and Cluster 3 captured 48 Iris-versicolor and 4 Iris-virginica. Final Answer: k=3; Best combination: Petal Length vs Petal Width; Performance: Cluster 1: 50; Cluster 2: 46; Cluster 3: 48 3
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
rm ( list= ls ()) data2 <- read.csv ( "iris.data" , header= FALSE ) table (data2 $ V5) # previewing the amount of each type of flower ## ## Iris-setosa Iris-versicolor Iris-virginica ## 50 50 50 library (ggplot2) #previewing the data comparing V1 to V2 and V3 to V4 ggplot (data2, aes (V1,V2, color= V5)) + geom_point () 2.0 2.5 3.0 3.5 4.0 4.5 5 6 7 8 V1 V2 V5 Iris-setosa Iris-versicolor Iris-virginica ggplot (data2, aes (V3,V4, color= V5)) + geom_point () 4
0.0 0.5 1.0 1.5 2.0 2.5 2 4 6 V3 V4 V5 Iris-setosa Iris-versicolor Iris-virginica set.seed ( 3 ) #Using a for loop to test for the optimal number of cluster centers starting #from 1 to 10. The main goal of this is to build an elbow diagram, with the #number of centers as the x axis and WSS value as the Y axis cent <- 1 centarray <- c () centVal <- c () for (cent in 1 : 10 ) { clusterTest <- kmeans (data2[, 1 : 4 ], centers= cent, nstart= 5 ) wssVal <- clusterTest $ tot.withinss centVal <- c (centVal,cent) centarray <- c (centarray,wssVal ) } plot (centVal,centarray) #optimal k is 3 5
2 4 6 8 10 0 100 300 500 700 centVal centarray #Running kmeans function with optimal number of cluster centers cluster_optimal <- kmeans (data2[, 1 : 4 ], centers= 3 , nstart= 5 ) #Comparing model output to response column in iris data table (cluster_optimal $ cluster, data2 $ V5) ## ## Iris-setosa Iris-versicolor Iris-virginica ## 1 0 2 36 ## 2 0 48 14 ## 3 50 0 0 #Building model of sepal length and sepal width, then comparing to the outputs #in iris data cluster_optimalSepal <- kmeans (data2[, 1 : 2 ], centers= 3 , nstart= 5 ) table (cluster_optimalSepal $ cluster,data2 $ V5) ## ## Iris-setosa Iris-versicolor Iris-virginica ## 1 0 12 35 ## 2 50 0 0 ## 3 0 38 15 #Building model of petal length and Petal width, then comparing to the outputs #in iris data cluster_optimalPetal <- kmeans (data2[, 3 : 4 ], centers= 3 , nstart= 5 ) table (cluster_optimalPetal $ cluster,data2 $ V5) ## ## Iris-setosa Iris-versicolor Iris-virginica ## 1 50 0 0 ## 2 0 2 46 ## 3 0 48 4 6
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help