mahesh20_hw4

pdf

School

Purdue University *

*We aren’t endorsed by this school

Course

373

Subject

Computer Science

Date

Feb 20, 2024

Type

pdf

Pages

18

Uploaded by BailiffKangaroo6144

Report
Theoretical Questions (8+20+14+20=62 pts) Please submit your answers on Gradescope. Ql (8 pts): True or False questions Answer the following as True or False with a justification or example. Points are uniformly distributed wihtin the questions. (a) (4 pts) Underfitting happens to a model with high variance and low bias. (b) (4 pts) When doing cross-validation, you do not need to split your dataset and fix your training, validation and test sets priorly. 3
Q2 (20 pts): Cross-validation Training and Valldatlon loss for M 1 l.O - T~lnlng 0.9 - ValldaUon 1 0.8 l.2 l.O Training and Validation loss for M, ... -=T raining ..... - Validation 0.7 0.8 .9 .9 0.6 0.6 0.5 0.4 0.4 0.3 0.2 0 30 100 200 300 400 500 20 40 60 Iteration tteratlon (a) Training and validation loss of a model M 1 (b) Training and validation loss of a model M2 Figure 1: Example of training curves showing training and validation loss for two hypothetical models M1 and M2. 1. (6 pts) Review the training and validation loss curve in Figure la (model Mi). The y-axis shows a loss (lower is better) for the training and validation data. The x-axis shows the iterations to update the model parameters (e.g. update steps on Perceptron algorithm or steps of gradient descent for logistic regression). Hint: Note that th ese are training curves, not learning curves. (a) (3 pts) What are the best parameters for model M 1 ? The parameters at 500 iterations or at 100 iterations? Please justify your answer. (b) (3 pts) What is the disadvantage of choosing the model with parameters obtained after 30 it erations? 4
2. (4 pts) Review the training and validation loss curve in Figure lb of model M 2 . Which of the following explanations and decisions are appropriate for the observed behavior in lb? Choose all answers that are correct. You don't need to justify your answer. @ The validation data is scarce and not very representative. (B) The model has not converged yet at 100 iterations and many more iterations will likely improve the training loss. @ Getting more validation data will not reduce the training loss . (D) The training loss has converged, so obtaining more and better training data will not reduce the training loss further. 3. (4 pts) Which statement about k-fold cross-validation is incorrect? @ \V_hen using k-fold cross-validation, using a smaller k means it will take more time to run the evalu- ation. (B) With larger k, the training set gets larger, while the test set gets smaller. (C) When splitting data into folds, it is desirable to minimize the variance across folds. ® When k = N, where N is the size of the data set, k-fold cross-validation is not the same as leave- one-out cross-validation. 5
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
4. (6 p~ Given the supervised learning dataset with 3 training examples (Xi, Yi) = {O, I), (X2, Y2) = (1, -1 ), (X3 , Y3 ) = (2, 1) , use the Perceptron algorithm to learn a classifier. If needed, \15e sign{O) = I. The Perceptron classifier will have two parameters w = (w 0 , wi) (the first parameter, w 0 acts as the bias). What is the average 0-1 loss of the Perceptron algorithm evaluated by 3-fold cross-validation? (Please also describe, for each fold/iteration of th _e cross-validation, what is the training data used and the loss obtained). Note: Without a bias tenn for the Perceptron, the data may not be linearly separable. A simple way to add a bias tenn to a linear classifier is to add an extra feature with value of 1 for all data points. This way, the new data points become X' = {(O , I) , {l, I), (2, l)}. 6
Q3 (14 pts): Ensemble methods (AdaBoost) Consider the dataset in the following table: Index (i) Feature 1 (X ; 1) Feature 2 (Xi2) 1 -1 1 2 1 1 3 1 -1 4 -1 -1 Class label (Yi) +1 -1 +1 +1 For this question, you have to show the first few steps of the AdaBoost algorithm. The weak learner can be one of the four possible decision stumps illustrated in Fig. 2 (i.e., at each step you must choose one of the n classifiers in the figure, to minimize the weighted empirical error rate, defined as ct= I:; Dt(i)H[ht( x;) c/ y;], i=l where t means the number of iterations and Dt(i) means the reweighting of data at round t, breaking any ties by choosing the classifier with lower number). You may refer to this link https: //drive. go ogle. com/ file/d/10WABd0_gvk0cDgt3WsiNx_b-ho0U0vXM/view?usp=sharing for details of AdaBoost. Classifier 1 Classifier 2 Classifier 3 Classifier 4 1- + 1 - + 1- + N N N ! ! ! ! B 0 - B o B 0- B o- .. .. .. .. ., ., l ., IL IL IL -1 -1- + + -1 - + + -1- + + + + ' ' ' ' ' ' ' ' ' -1 0 1 -1 0 1 -1 0 1 -1 0 1 Feature 1 Feature 1 Feature 1 Feature 1 Figure 2: Choices of classifiers for the dataset. The solid red areas are assigned label -1 and the hatched blue areas are assigned label +l 1. (6 pts) Complete the following table, with the correct values computed during the execution of th e AdaBoost algorithm, for t = 1, . .. , 5. Please refer to lecture slides for details of AdaBoost. t 1 2 3 4 5 ct Ot i = 1 Dt(i) i=2 i=3 i =4 ht Classifier H(ht( ;) f y;) i= l i=2 i=3 i=4 2. (4 pts) Complete the following tabl e, with the predictions co mputed by the AdaBoo st algorithm for the training data , at the e nd of tra ining (i. e. after T = 5 rounds). 7
0 ( i_ ) I(J.,d i.) 1 1 I .:t ' ~ o< ~ -l =- 1 i = 2. .i. =-3 -L ~ y -h._I: j_-.. I J.. -: 2 i =- 3 j_::z Li - -- 0 · 25 () 0 • 5~ 9 2Sc 0·2 5o 0 ·2So 0·25o ) 0 0 I 0 2 l o, I( ; l 0 -~04- o. ,, l 0,ft:.7 o,56o O · /{; 7 2 I 0 0 0 .J I0 -3 00 0 ' ~2 4 0 - 500 /00 0·3o c 0 /OD / I J 0 0 I 0 Lt ~0 · 358 2 9 ½ 0·35 8' 0-07 / 0 · .5CD O - 07 / 2 I I 0 0 0 S lo .390 0 ,22.4- o ,s oo 0 · 055 0·390 o•oss I I n 0 ) 0
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
3 _ ( 4 pts) Assuming we stopped the algorithm after finishing the step t = 4, what is the predicted label for a new sample~= (0, -1)? If the true label of this new test example is +1, what is the 0-1 loss for this example? 'I . 8
' Q4 (20 pts): SVM Co'.151 ~er an _SVM that obtains a decision bound t . . . which IS eqwvalent to the minimization . 8:J ~ough th e maximizat10n optimization in Equation (1) h - h . . op IIIllZat1on m Equaf (2) h /3 ' Xi t e i-t trauung example E { _ 1 1 } th 1 1 mn , w ere and f3o denote the parameters number of instances. ' ' ' e c ass abel for th e i-th training example, M the margin, N th; which is equivalent to max M fi , fio,ll t%=1 subject to Yi (xf f3 + f3o) 2 M, i = 1, . .. , N. min - 2 1 11/311~ fi,fio subject to y;(xf f3 + (3 0 ) 2 1, i = 1, .. . , N. (1) (2) Mark true or false for the following assertions (Q4.l, Q4 . 2, Q4.3) and justify your answer. Questions with no justification will receive O points. 1. ( 4 pts) This SVM objective assumes data is linearly separable. ) Aft r obtaining the decision boundary in Equation 1, if we modify f3 while maintaining the 2. (4 p_ ts _ l\/3e\l2 == 1 this would modify the minimum distance from the decision boundary to the orioin restriction 2 ' · ~ - l\,,._~ 4 ~~J,.h ""'-~ ~ -l ~J-l..t..c ~- .-- . -a-- . 3 4 pts) Assume the maximum margin ~~tained by the model was M = 4 = d1 + d2 , where d1 = 2 and . ( == 2 denote the distance from the dec1~1on bound~ry to hyperplanes H1 and H2 defined by the su ort d 2 If we add 1 to (3 0 , the new maximum margm would be M + 1. pp vectors. - ,. 4 _ (4 pts) sVM deci_sio~ bouorlru:Y· Consi~er again an SVM whose decision boundary is obtained b Equatio"' 1 ,od 2 \,k£ m th ' P'"'°"' quffit>oos. Now, rorusid~ th, tmini»g dat""t with two tmini~ 9
clas.5 5 (signaled as squares and triangles) given by the scatter plot in the following figure. Draw the decision boundary for this dataset obtained by our SVM on it. The drawing needs to correctly separate the examples but there are multiple possible solutions. Each point of the line you draw must match the points of the true decision boundary within the error margin of 10 unit intervals along the x-axis and y-axis . Ill ·x Ill ,;.. 100 95 90 85 80 75 70 65 60 55 50 45 40 35 30 2 2 1 1 5 0 5 0 5 0 7' .. .r ,~ J I - I - I ] J I I LI A IJ - L, I J :t - I ,. I J \ ... Ll ,~ """ I\ J .... I I L~ L Ll ---..r-.. I I\ I I v.I 0 5 10 15 20 25 30 35 40 45 50 ~5 6 x-axis 10 \ I\ o 65 \ O 75 BO 85 90 95 100
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
5. (4 pts) SVM decision boundary with removed example. Now pick a single data point x; that, if removed, would yield the maximum possible increase in the margin of a new SVM decision boundary over the dataset without x;. Indicate which point is x; in the figure below (reproduction of last question). Draw the new decision boundary after x; 's removal on it. The drawing needs to correctly separate th e remaining examples but there are multiple possible solutions. Each point of the line you draw must matc_h the points of the true new decision boundary within the error margin of 10 unit intervals along the x-axis and y-axis. 100 95 90 85 - _,_ \ '\ l/ \ ~' u-c~ ' LI U'\.I ~-r: ~ -- ... 80 \ 75 \ I 1 70 \ 65 60 Ill 55 ·x 50 lg 45 'j- A \ \ L I \ .. I .I - 40 35 30 ,'\ .. \ Lt. /l \ - - \ 25 4~ 20 15 10 L .l \ L - \ / ' 5 I\ J \ / ,, 0 5 10 15 20 25 30 35 40 45 j 55 60 65. 70 75 80 85 90 95100 0 -axis .. 11
Programming Part (20+18=38 pts) There are some basic requirements that you should follow when you submit your code: • Make sure you don't import any package that you are not allowed to use. • Make sure your returned data types follow the given data types. • Make sure your code does not contain matplotlib popups, such as plt. show(). Q5 (20 pts): Cross-Validation In this assignment, you will use a k~ fold cross-validation technique to measure the performances and tune hyperparameters of some algorithms that you have learned or implemented, including KNN, Decision Tree, Logistic Regression and SVM. You will still be using the dating dataset that you have seen in HW2. Again, it is a part of the whole dataset, while the test data is not published to you and will be used for grading. Please refer to the previous homework if you have questions about datasets. A k-fold cross-validation is implemented as follows. Data l'-----'----L--1 Fold 1: i Valid J T rain I Train I Fold 2: I Train ! Valid J Train i Fold 3: I Train I Train ! Valid i • Partition the training data into k parts. • For i = 1, ... , k: use the i-th part as the validation set and the rest for training • Report the validation error averaged over k rounds. 1. Implementation Implement the required functions in the file cross_ validation. py: (a) (3 pts) Implement create_folds() method, which tak es in data samples X a nd the ir labels y. You should partition the data into k-folds and return a list of tuples of (X, y) with each element in the tuple being a fold . Please refer to the code comments for a sample output. (b) (3 pts) Implement train_valid_splitO method, which tak es in a li st of fo ld s and r rn int ege r ind ex at which the fold will be us ed as the valid at ion set. Comb in e the rest of the fold s to ge l, the training set. Return two tupl es : (X_train, y_train), (X_val, and y_val). Please refer to the co de co mm e nt s for a sample output. (c) (3 pts) Impleme nt cross_val_ s coreO rn elho d, whi ch t,,kcs in dill!\ Sl\ mpl es X, the ir l l\b els Y, and a classifier elf as input, o.nd should return two li st s: tra i n_accs l\ nd val_accs , whe re each eleme nt repr ese nts the accura cy of the tro.in in g or vnlidllt.i on dl\l.l\ set, of I\ sp ec ifi c fo lds s pli t co.so . Pl ease ref er lo the code com ments for a sn mpl e outpnt. 12
Implement the required functions in the file evaluation.py: (d) (3 pts) Implement find_best_param ~ethod, "".hich us~ _k-fold cross-validation to find the best param- eter of a model. Models can be Logistic Regression, Dec1S1on Tree and SVM. It takes the pre-processed dataset (X, y), the number of folds k, the model name ('logistic', 'decision_tree' or 'svm'), and the corresponding parameters as input, and should return the training and validation accuracy dictionaries and also the best parameter. • For Logistic Regression model, you will be using the previous code and tune the hyperparameter lr (learning rate). • For Decision Tree, you will be using the sklearn library, sklearn. tree. DecisionTreeClassif ier, and tune the hyperparameter max_depth. • For SVM, you will be using the sklearn library sklearn. svm. LinearSVC, which is a linear SVM, and tune the hyperparameter C, which is a regularization parameter, and basically determines the influence of misclassification on the data samples. The larger the value of C is, the less tolerant your model will be to misclassified samples. Please refer to the code comments for a sample output. 2. Evaluation This part is simple. We have implemented many functions for you, including the plotting of learning curves, cross-validation processes and the main function. Please refer to the code for more details. If your implementations are correct, you will be able to get all the results. What you need to do is just attaching your results to the final report. (a) (2 pts) Print out your best parameters for the three models that you tuned by using k-fold cross- validation. The format of your output should be: The best learning rate for Logistic Regression is ... The best max_depth for Decision Tree is ... The best C for SVM is ... .. 0·00 /333 5 2 1 JN_~ ~ - shp.k. ~ ~ ~ " (b) (6 pts) Attach the learning curves of Logistic Regression, Decision Tree, and SVM in your report, which can be automatically generated under your working directory (3 figures). Briefly discuss your observations. 13
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
14
>, u 0.70 0.65 a 0.60 0.55 0.50 10- 4 Leaming curve of logistic 10- 3 10- 2 Ir Training validation 10- 1
>, u 0. 95 0. 90 0.85 ::, u 0. 80 0.75 + 'Training ... _ validation 2 3 ------ Leaming curve of decision_tree + ------+------ t- -----+-----t-----+-----+------+ 4 5 6 7 8 9 10 11 max_depths
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
>, u :, u 0. 800 + 'Training .... validation 0.795 0.790 0.785 0.780 0. 775 0.770 0.765 Leaming curve of svm C
V Q6 (18 pts): Bagging and Boosting Bagging is generally used to help stabilize classifiers with unstable learning algorithms ( optimal score search- ing algorithms). A classifier has a stable learning algorithm if, by changing the training data, the predicted class labels in the test data don't change. For instance, the predictions of a decision tree might significantly change with a small change in the training data. This definition depends on the amount of data, of course. Classifiers that are unstable with . 10 3 training examples may be stable with 10 9 examples. Bagging works by aggregating the answers of unstable classifiers trained over multiple training datasets. These multiple datasets are often not independent, generally sampled with replacement from the same training data. Boosting works by converting weak classifier (very simple models) to strong ones (models that can describe complex relationships between the inputs and the class labels). A weak learner is a classifier whose output of an test example attributes Xi is only slightly correlated with its true class ti. That is, the weak learner classifies the data better than random, but not much better than random. In boosting, weak learners are trained sequentially in a way that the current learner gives more emphasis to the examples that past learns made mistakes on. 1. (8 pts) Suppose we decide to use a large deep feedforward network as a classifier with a small training dataset. Assume the network can perfectly fit the training data but we want to make sure it is accurate in our test data (without having access to the test data). Would you use boosting or bagging to help improve the classification accuracy? Describe what would be the problem of using the other approach. 15
2. (10 pts) Download the code mistery_classifier at the link https://www.dropbox.com/s/swdrapsebhad6tb/mistery _classifier.zip?dl=O. Read the code and answer the following questions: '- N ote: You don't need to run the code, but if you want to run it on the scholar.rcac.purdue.edu cluster, you will need to command module load anaconda/5 .1. O-py36 (a) (5 pts) Which classifier is this? Why are the trees used in this classifier so shallow? Describe how the classifier works using pseudo-code. Hint: It uses decision trees but this is not a decision tree classifier. (b) (5 pts) What happens if we have mislabeled data? Why mislabeled data could be a problem? Hint: We are looking for an answer that uses the training weights of the examples at each iteration as a justification. 16
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help