CSC522_HW2

pdf

School

North Carolina State University *

*We aren’t endorsed by this school

Course

522

Subject

Computer Science

Date

Feb 20, 2024

Type

pdf

Pages

3

Uploaded by DrKuduMaster1078

Report
Homework 2 Automated Learning and Data Analysis Dr. Pankaj Telang Spring 2024 Instructions Due Date: February, 21 2024 at 11:45 PM Total Points : 30 Submission checklist : Clearly list each team member’s names and Unity IDs at the top of your submission. Your submission should be a single PDF file containing your answers. Name your file : G(homework group number) HW(homework number), e.g. G1 HW2. If a question asks you to explain or justify your answer, give a brief explanation using your own ideas, not a reference to the textbook or an online source. Submit your PDF through Gradescope under the HW2 assignment (see instructions on Moodle). Note : Make sure to add you group members at the end of the upload process. Submit the programming portion of the homework individually through JupyterHub. 1
ADLA – Spring 2024 Homework 2 Last Updated: February 15, 2024 1 Evaluation Measures & Pruning (15 points) This analysis pertains to the Titanic Survival Prediction dataset, which includes attributes about the survival status of individual passengers on the Titanic (Yes/No). To predict the survival, consider using the decision tree shown in Figure 1 which involves Ticket Price, Gender, Pclass (passenger class) and Age. Complete the following tasks: Figure 1: Decision Tree 1a) ( 4 points ) Use the decision tree above to classify the provided dataset. hw2q1.csv . Construct a confusion matrix and report the test Accuracy, Error Rate, Precision, Recall, and F1 score. Use “Yes” as the positive class in the confusion matrix. 1b) ( 4 points ) Calculate the optimistic training classification error before splitting and after splitting using Age , respectively. Consider only the subtree starting with the Age node. If we want to minimize the optimistic error rate, should the node’s children be pruned? 1c) ( 4 points ) Calculate the pessimistic training errors before splitting and after splitting using Age re- spectively. Consider only the subtree starting with the Age node. When calculating pessimistic error, use a leaf node error penalty of 0.8. If we want to minimize the pessimistic error rate, should the node’s children be pruned? 1d) ( 3 points ) Assuming that the “Age” node is pruned, recalculate the test Error Rate using hw2q1.csv . Based on your evaluation using the dataset in hw2q1.csv , was the original tree (with the Age node) over-fitting? Why or why not? 2
ADLA – Spring 2024 Homework 2 Last Updated: February 15, 2024 2 1-NN, & Cross Validation (15 points) Consider the following dataset (9 instances) with 2 continuous attributes ( x 1 and x 2 ) that have been scaled to be in the same range, and a class attribute y , shown in Table 1. For this question, we will consider a 1-Nearest-Neighbor (1-NN) classifier that uses euclidean distance. Table 1: 1-NN ID x1 x2 Class 1 5.56 1.25 + 2 3.61 3.33 - 3 8.06 5 - 4 3.89 4.17 + 5 10 7.5 - 6 2.78 7.08 - 7 1.94 0 + 8 2.22 6.25 + 9 6.11 4.17 - 2a) Calculate the distance matrix for the dataset using euclidean distance. Tip : You can write a simple program to do this for you, and there is an example of how to do this in the programming portion of this homework. 2b) By hand, evaluate the 1-NN classifier, calculating the confusion matrix and testing accuracy (show your work by labeling each data object with the predicted class). Tip : you can scan a row or column of the distance matrix to easily find the closest neighbor. Use the following evaluation methods: i) A holdout test dataset consisting of last 4 instances ii) 3-fold cross-validation, using the following folds with IDs: [1,2,3], [4,5,6], [7,8,9] respectively. iii) Leave one out cross validation (LOOCV) 2c) For a data analysis homework, you are asked to perform an experiment with a binary classification algorithm that uses a “simple majority vote classifier“ which always predicts the majority class in the training dataset (if there is no majority, one of the classes is chosen at random). You are given a dataset with 50 instances and a class attribute that can be either Positive or Negative. The dataset includes 25 positive and 25 negative instances. You use three different validation methods: holdout (with a random 30/20 training/validation split), 5-fold cross validation (with random folds) and LOOCV. You expect the simple majority classifier to achieve approximately 50% validation accuracy, but for one of these evaluation methods you get 0% validation accuracy. Which evaluation gives this results and why? 3
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help