sp21final

pdf

School

University of California, Berkeley *

*We aren’t endorsed by this school

Course

C200

Subject

Computer Science

Date

Feb 20, 2024

Type

pdf

Pages

33

Uploaded by DeanBookKookabura6

Report
DATA 100 Final-Exam Spring 2021 Final-Exam INSTRUCTIONS This is your exam. Complete it either at exam.cs61a.org or, if that doesn’t work, by emailing course sta ff with your solutions before the exam deadline. This exam is intended for the student with email address <EMAILADDRESS> . If this is not your email address, notify course sta ff immediately, as each exam is di ff erent. Do not distribute this exam PDF even after the exam ends, as some students may be taking the exam in a di ff erent time zone. For questions with circular bubbles , you should select exactly one choice. # You must choose either this option # Or this one, but not both! For questions with square checkboxes , you may select multiple choices. 2 You could select this choice. 2 You could select this one too! You may start your exam now. Your exam is due at <DEADLINE> Pacific Time. Go to the next page to begin.
Exam generated for <EMAILADDRESS> 3 1. (7.0 points) (a) (2.0 pt) Recall the tips dataset that we worked with on assignments in the past, which includes data about the tip on a restaurant bill as well as the day of week and the sex of the individual. The plot below attempts to examine patterns between the tip as a percentage of the bill and the sex of the individual by the day of week (DOW) Select the best reason below for why the data visualization is misleading or poorly constructed. # the y -axis should be log transformed # the clustering of bars doesn’t allow a key comparison to be made # the plot su ff ers from overplotting # the bars for each day of week should be stacked on top of each other (e.g. the bar for “Thur” would have a total height of approximately 0.3)
Exam generated for <EMAILADDRESS> 4 (b) (2.0 pt) Consider the surface whose contour plot is provided below. gradient fields most likely corresponds to the surface shown above? A gradient field is a plot that shows the direction and relative magnitude of the gradient of a surface on a 2-dimensional plot where each point has a vector pointing from it in the direction of the gradient at that point and the length of that vector is proportional to the magnitude of the gradient.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Exam generated for <EMAILADDRESS> 5 # #
Exam generated for <EMAILADDRESS> 6 # # (c) (3.0 points) We have read in some data as the dataframe df . Consider a subset of df below, which contains some information on the background of various individuals in the US.
Exam generated for <EMAILADDRESS> 7 i. (2.0 pt) Suppose we want to observe the relationship between and the distributions of the AFQT (an intelligence metric, with units percentile) and log_earn_1999 (log of the individual’s earnings in 1999) variables based on whether the individual’s parents both went to college. Select the line of code below that generates the best plot to observe this relationship. A: sns.kdeplot(x=df[ ' AFQT ' ], y=df[ ' log_earn_1999 ' ], hue=df[ ' mother_college ' ] & df[ ' father_college ' ]) B: sns.scatterplot(x=df[ ' AFQT ' ], y=df[ ' log_earn_1999 ' ], hue=df[ ' mother_college ' ] & df[ ' father_college ' ]) C: sns.lineplot(x=df[ ' AFQT ' ], y=df[ ' log_earn_1999 ' ], hue=df[ ' mother_college ' ] & df[ ' father_college ' ]) D: sns.kdeplot(x= ' AFQT ' , y= ' log_earn_1999 ' , hue=[ ' mother_college ' , ' father_college ' ], data=df) E: sns.scatterplot(x= ' AFQT ' , y= ' log_earn_1999 ' ,hue=[ ' mother_college ' , ' father_college ' ], data=df) F: sns.lineplot(x= ' AFQT ' , y= ' log_earn_1999 ' , hue=[ ' mother_college ' , ' father_college ' ], data=df) Hint: Consider overplotting. # A # B # C # D # E # F
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Exam generated for <EMAILADDRESS> 8 ii. (1.0 pt) Suppose we want to understand the relationship between weeks_worked_1999 and the sex of the individual. We run the following code to generate a plot: df2 = df.groupby("zip_code").mean().reset_index() sns.lineplot("zip_code", "log_earn_1999", data=df2) Select the reason below for why this plot would represent a bad data visualization. # treats a categorical variable as a continuous variable # treats a continuous variable as a categorical variable # represents a density with a feature other than area # does not show the relationship between the variables of interest
Exam generated for <EMAILADDRESS> 9 2. (9.0 points) (a) (4.0 points) Recall that a random forest is created from a number of decision trees, with each decision tree created from a bootstrapped version of the original training set. One hyperparameter of a random forest is the number of decision trees we train to create the random forest. Define T to be the number of decision trees used to create the random forest. Let’s say we have two candidate values for T : var 1 and var 2 . We want to perform var 3 - fold cross-validation to determine the optimal value of T . Assume var 1 , var 2 , and var 3 are integers. i. (2.0 pt) In this cross-validation process, how many random forests will we train? Your answer should be in terms of var 1 , var 2 , and/or var 3 and should be an integer. ii. (2.0 pt) In this cross-validation process, how many decision trees will we train? Your answer should be in terms of var 1 , var 2 , and/or var 3 and should be an integer.
Exam generated for <EMAILADDRESS> 10 (b) (2.0 pt) Let’s say we pick three hyperparameters to tune with cross-validation. We have 5 candidate values for hyperparameter 1, 6 candidate values for hyperparameter 2, and 7 candidate values for hyperparameter 3. We perform 4-fold cross validation to find the optimal combination of hyperparameters, across all possible combinations. In this cross-validation process, how many random forests will we train? Your answer can be left as a product of multiple integers, e.g. “1 * 2 * 3”, or simplified to a single integer, e.g. “6”. (These are not the correct answers to the problem). (c) (3.0 pt) Here is some code that attempts to implement the cross-validation procedure described above. However, it is buggy. In one sentence, describe the bug below. You may assume the following: X_train is a pd.DataFrame that contains our design matrix, and Y_train is a pd.Series that contains our response variable, both for the full training set. Assume ensemble.RandomForestClassifier(**args) creates a random forest with the appropriate hyperparameter values. The bug is not on this line. The candidate values for each hyperparameter have been loaded into the lists cands1 , cands2 , and cands3 , respectively. 1: from sklearn.model_selection import KFold 2: from sklearn import ensemble 3: import numpy as np 4: import pandas as pd 6: kf = KFold(n_splits = 4) 7: cv_scores = [] 8: for cand1 in cands1: 9: for cand2 in cands2: 10: for cand3 in cands3: 11: validation_accuracies = [] 12: for train_idx, valid_idx in kf.split(X_train): 13: split_X_train, split_X_valid = X_train.iloc[train_idx], X_train.iloc[valid_idx] 14: split_Y_train, split_Y_valid = Y_train.iloc[train_idx], Y_train.iloc[valid_idx] 16: model = ensemble.RandomForestClassifier(**args) 17: model.fit(X_train, Y_train) 18: accuracy = np.mean(model.predict(split_X_valid) == split_Y_valid) 19: validation_accuracies.append(accuracy) 20: cv_scores.append(np.mean(validation_accuracies))
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Exam generated for <EMAILADDRESS> 11 3. (14.0 points) We are trying to train a decision tree for a classification task where 0 is the negative class and 1 is the positive class. We are given 8 data points each in pairs of ( x 1 , x 2 ) features. (a) (3.0 pt) x 1 x 2 y 3 4 1 2 1 0 1 3 1 5 9 0 9 6 1 7 2 1 4 7 0 8 8 1 What is the entropy at the root of the tree? Round to 4 decimal places. (b) (2.0 pt) What is the gini inpurity at the root of the tree? Note that the formula for gini impurity is 1 - P c i =1 p 2 i where p i is the fraction of items labelled with class i and c is the total number of classes.
Exam generated for <EMAILADDRESS> 12 (c) (4.0 pt) Suppose we decide to split the root node with the rule x i β where i = 1 or 2. Which of the following minimizes the weighted entropy of the two resulting child nodes. # x 1 6 # x 1 3 . 5 # x 2 5 # x 2 3 . 5 # x 2 6 . 5 (d) (2.0 points) We have decided to create a food recommendation system using a decision tree! We would like to run our decision tree to see what food it recommends in certain scenarios. If you have trouble reading the above tree, please go to this link: https://i.imgur.com/9Z40cYP.png i. (1.0 pt) Bob wants to eat some unhealthy food, specifically at a fast food restaurant. When asked what he’s in the mood for, he replies with “Mediterranean”. Which of the following restaurants could the decision tree recommend for Bob? # Chipotle # Taco Bell # Dyars Cuisine # IBs Burgers ii. (1.0 pt) Larry would like to eat some unhealthy food as well! However, he got a salary bonus from his job so he does not want to eat at a fast food restaurant. When asked how much he would like to pay, he replies with “I have no preference”. Which of the following restaurants could the decision tree recommend for Larry? 2 Olive Garden 2 Cheesecake Factory 2 Super Dupers Burger 2 Flemings Prime Steakhouse
Exam generated for <EMAILADDRESS> 13 (e) (3.0 pt) Joey and Andrew are each training their own decision tree for a classification task. Joey decides to limit the depth of his decision tree to depth 3 while Andrew decides to not set a limit on the depth of his decision tree. When plotting the training error, Joey’s error seems to be much higher than Andrew’s error. However, when plotting the validation error, Andrew’s error seems be much higher than his training error as well as Joey’s error. Andrew is confused and surmises that there must be a bug in his code that is causing this to happen. What happened? Explain. What can he do to improve it? Name at least 3 things he can do to improve his error. Please limit your repsonse to 2 sentences per reason.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Exam generated for <EMAILADDRESS> 14 4. (16.0 points) (a) (3.0 pt) Suppose we are modeling the number of calls to MangoBot food delivery service per minute. We believe that there are likely more calls around lunch time. Which of the following feature encodings of the time of day (0.0 to 24.0, exclusive of both ends) would capture this assumption? Select all that apply. 2 time_of_day ** 2 2 np.log(12 * time_of_day) 2 1-np.cos(np.pi * time_of_day / 12) 2 np.exp(-(time_of_day - 12) ** 2) (b) (4.0 pt) Recall that in a binary classification task, we want our data to become linearly separable so that we can maximize the performance of our classifier. In many cases, however, our data are not directly linearly separable. As a result, we want to apply some transformation to our data so they will become linearly separable afterwards. For the following dataset, select all transformations that can make the data linearly separable. 2 ( x 1 , x 2 ) ! ( x 2 1 , x 2 ) 2 ( x 1 , x 2 ) ! ( x 1 , x 2 2 ) 2 ( x 1 , x 2 ) ! ( x 2 1 , x 2 2 ) 2 ( x 1 , x 2 ) ! ( x 2 2 , x 2 1 )
Exam generated for <EMAILADDRESS> 15 (c) (3.0 pt) One way to transform textual data into features is to count the frequencies for all of the words in the text. Consider the following preprocessing steps: i. Remove all punctuations ( . , , , : , . . . ). ii. Remove all stopwords ( did , the , . . . ). Note that stopwords do not include words that negate things such as no , not , . . . iii. Lower case the sentence, and keep words that only consist of letters a - z . iv. Encode the sentence as a vector containing the frequencies for all the unique words in the text. Suppose we use the frequency vector from the steps above as our feature to train a logistic regression model that predicts the sentiment of a sentence (positive, negative). In 1-2 sentences, describe a case where our model would fail and make a false prediction. Your answer must be specific to the preprocessing steps and includes an example sentence to earn credits .
Exam generated for <EMAILADDRESS> 16 (d) (3.0 pt) Recall that in the housing assignment, if we want to include a categorical variable in our linear model, we need to convert it into a collection of dummy variables of values 0 and 1. Suppose we have a dataframe housing that contains a subset of the Cook County data. We are interested in one-hot-encoding the categorical variable floor_material and using the dummy columns as the sole features to build an ordinary least squares model to predict the sale price of the houses. Specifically, we create the design matrix X with the following block of code: X = pd.get_dummies(housing[ ' floor_material ' ]).to_numpy() In addition, running the code housing[ ' floor_material ' ].value_counts() gives us the following out- put: Which of the following statements are true about the design matrix X ? Select all that apply. Note: define to be the vector containing the optimal parameters. 2 X has a dimension of 3 columns and 120 rows. 2 We can add a bias column of all 1’s to X and still find a unique solution for the optimal parameters. 2 X > X is a diagonal matrix (zeros everywhere except along the main diagonal). 2 All of the entries in X > X add up to be 120. 2 The optimal parameter vector contains the average sale price for each type of floor material. (e) (3.0 pt) When building your models, one way to select features is to consider the pair-wise relationship between each column and the response variable (i.e. the column you are trying to predict). Consider the following approach: i. Compute the pairwise correlation coe cient between each column and the response variable in the dataframe. ii. Sort the correlation coe cients in descending order. iii. Pick the top k coe cients and select the corresponding columns as the features. In 1-2 sentences, describe how the approach above can result in multicollinearity and issues with feature diversity. Your answer must explain why multicollinearity and lack of feature diversity could poten- tially occur to earn credits for this question.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Exam generated for <EMAILADDRESS> 17 5. (9.0 points) Suppose we are modelling some response using our data X . For a given observation we have 3 features, x 1 , x 2 , x 3 . Note that the subscript does not refer to the first, second, and third observations, respectively. For a given data point x , we come up with a model of the form f ( x ) = 1 x 1 + 1 2 x 2 + 2 1 x 3 . We use the squared error function, denoted L ( y, ˆ y ) , to calculate the error for each observation and additionally use L2 regularization, denoted R ( ) , with penalty λ . You may assume that λ > 0 . Thus our objective function is of the form L ( y, ˆ y ) + λ R ( ) . (a) (3.0 pt) For a single observation x having response y and features x 1 , x 2 , x 3 , compute the gradient to be used in gradient descent: # - 2 ( y - ( 1 x 1 + 1 2 x 2 + 2 1 x 3 ))( x 1 + 2 x 2 + 2 1 x 3 ) - λ✓ 1 ( y - ( 1 x 1 + 1 2 x 2 + 2 1 x 3 ))( 1 x 2 ) - λ✓ 2 # 2( x 1 + 2 x 2 + 2 1 x 3 - y )( 1 x 1 + 1 2 x 2 + 2 1 x 3 ) + 2 λ✓ 1 2( 1 x 2 - y )( 1 x 1 + 1 2 x 2 + 2 1 x 3 ) + 2 λ✓ 2 # - 2( y - ( 1 x 1 + 1 2 x 2 + 2 1 x 3 ))( x 1 + 2 x 2 + 2 1 x 3 ) - 2( y - ( 1 x 1 + 1 2 x 2 + 2 1 x 3 ))( 1 x 2 ) # 2 2 4 ( 1 x 1 + 1 2 x 2 + 2 1 x 3 - y )( 1 ) + λ✓ 1 ( 1 x 1 + 1 2 x 2 + 2 1 x 3 - y )( 1 2 ) + λ✓ 1 2 ( 1 x 1 + 1 2 x 2 + 2 1 x 3 - y )( 2 1 ) + λ✓ 2 1 3 5 # 2 4 ( 1 x 1 + 1 2 x 2 + 2 1 x 3 - y ) 2 ( 1 ) ( 1 x 1 + 1 2 x 2 + 2 1 x 3 - y ) 2 ( 2 ) ( 1 x 1 + 1 2 x 2 + 2 1 x 3 - y ) 2 ( 3 ) 3 5 # 2 4 ( y - ( 1 x 1 + 1 2 x 2 + 2 1 x 3 )) + λ R ( 1 ) ( y - ( 1 x 1 + 1 2 x 2 + 2 1 x 3 )) + λ R ( 2 ) ( y - ( 1 x 1 + 1 2 x 2 + 2 1 x 3 )) 3 5 (b) (2.0 pt) Suppose that you and your friend are implementing gradient descent. Just for fun, your friend chooses a negative learning rate and asks you to fix their code. Which of the following expressions will always result in the same update as the conventional gradient descent algorithm? You may assume that the gradient r is correctly computed and you do not need to worry about the magnitude of . 2 ( t +1) = ( t ) - r 2 ( t +1) = ( t ) + r 2 ( t +1) = ( t ) - | | r 2 ( t +1) = ( t ) + | | r 2 None of the above (c) (4.0 points)
Exam generated for <EMAILADDRESS> 18 i. (2.0 pt) We seek to optimize a given loss function using stochastic gradient descent where 1 < batch size < n , where n is the total number of data points, We initialize all model parameters as 0 and use a constant learning rate ( t ) = . Based on the contour plot below, which of the following will most likely result in better minimization of the loss function: # Fewer iterations # Greater learning rate # Smaller batch size # Greater batch size
Exam generated for <EMAILADDRESS> 19 ii. (2.0 pt) We seek to optimize a given convex loss function using gradient descent with a decaying learning rate where at time t , the learning rate ( t ) = t +1 , where > 0 . Based on the contour plot below, which of the following will most likely result in better minimization of the loss function: 2 Fewer iterations 2 Greater iterations 2 Negate 2 Smaller 2 Greater 2 ( t ) = p t +1 2 ( t ) = ( t +1) 2
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Exam generated for <EMAILADDRESS> 20 6. (6.0 points)
Exam generated for <EMAILADDRESS> 21 (a) (6.0 pt) Leif wants to do a study on the number of flowers in people’s gardens. He collects data on 100 di ff erent gardens, classifying each of them into three di ff erent sizes: ‘small’, ‘medium’, and ‘large’, and counts every flower in each person’s garden. The following is the first five rows of the data he collected: Leif then asks you to construct the following table using the data he collected. The table represents the total flowers in each category. For example, there are 1700 Hyacinths in “large” gardens. Write code below such that the above table is generated. Assume the data Leif collected is placed in a Pandas DataFrame assigned to the variable inputdf . The resulting table should be named outputdf . Please follow the template below (you must use pd.pivot_table ). outputdf = pd.pivot_table(_______________________________________________)
Exam generated for <EMAILADDRESS> 22
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Exam generated for <EMAILADDRESS> 23 7. (8.0 points) (a) (8.0 pt) Kunal has a large dataset of Irish poems. He wants to analyze the many di ff erent sentences in each of the poems. He has a list of words he is particularly interested in: words = [ ' artist ' , ' dinner ' , ' data ' , ' pay ' , ' color ' , ' science ' , ' clearly ' , ' run ' ] Kunal creates a Pandas Series of 100 sentences from these poems and wants to build a frequency array for the above words for each of the 100 sentences. The frequency array should capture how many times a word in the above list of words appears in a certain sentence. For example, the sentence “Data Science is clearly science.” would yield an array like: [0, 0, 1, 0, 0, 2, 1, 0] Note: ‘science’ was recorded twice, even though the first letter has di ff erent capitalization. You may assume all collected sentences have no punctuation. Define a function that takes in a Series of sentences and a list of words as an input, and outputs a frequency DataFrame where each row represents a frequency array for each sentence, as described above. Please start your code with the following method signature: def funcname(ser, words): Note: Please limit your response to 6 lines. The sta ff ’s solution was done in 3 lines, for reference (including the function signature). Hint: Try using one of the following str methods: str.contains , str.get , str.count , str.split , str.find .
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Exam generated for <EMAILADDRESS> 24 8. (8.0 points) (a) (3.0 pt) Suppose I am given a dataset [-2,4,1,3,4] and I am using the constant model, ˆ y i = . To find the optimal , I have the following two loss functions to work with. L A ( y i , ) = (3 y i - ) 2 and L B ( y i , ) = | 2 y i - | . Let A , B be the optimal parameters found for loss functions L A , L B respectively. What is the relationship between A and B ? # A > B # A = B # A < B # Need more information to tell (b) (2.0 pt) Which of the following models is linear in the parameters? 2 f ( x ) = 1 x + 2 x 2 + 3 e cos ( x ) 2 f ( x ) = sin ( 1 ) x + 2 x 2 f ( x ) = 1 2 f ( x ) = 1 log ( x 4 + 5 x 3 + 6) + 3 2 x 2 f ( x ) = 1 x 2 +1 1 + 2 (c) (3.0 pt) Which of the following is true regarding MSE (Mean Squared Error) & MAE (Mean Absolute Error) in Linear Regression? 2 There is a closed form solution to the optimal parameters when using MAE. 2 If our data contains many corrupted outliers, MAE loss is a better metric than MSE loss. 2 MAE encourages sparsity in the parameters (a lot of parameters are set to 0), which allows for non-relevant features to not be included. 2 The median minimizes the MSE for a constant model. 2 The optimal parameters found in a MSE loss function and a MAE loss function will never be equal. 2 The MAE loss is not di ff erentiable everywhere, which makes it impossible to take the gradient at those points.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Exam generated for <EMAILADDRESS> 25 9. (16.0 points) In this question we will be focusing on predicting whether a tweet is happy (1) or sad (0) using logistic regression. Rather than using a bag-of-words featurization, we will simply count the number of positive emojis ( ":-)" , ";-)" , . . . ) and negative emojis ( ":-[" , ":-(" , . . . ). Assume you are given a training dataframe training of the following form (the first 4 rows are shown): Tweet HappyEmojiCount SadEmojiCount isHappy Woke up to a sunny day :-). Life is good :) 2 0 1 Stuck in tra c for 1 hr on my way to work today (._.) 0 1 0 Found a new album that really slaps =ˆ_ˆ= check it out on my Spotify 1 0 1 Grinding on this paper until 2am last night (-_-), but last paper of the semester :) 1 1 1 You fit a logistic regression model using the following block of codes: from sklearn.linear_models import LogisticRegression lr = LogisticRegression(intercept=True) lr.fit(training[[“HappyEmojiCount”, “SadEmojiCount”]], training[“isHappy”]) (a) (3.0 pt) Given a new tweet transformed into a numpy array containing the same set of features, and the array is assigned to the variable x test , which of the following expressions computes: P ( Sad | x test ) = P ( Y = 0 | X = x test ) under the logistic regression model (note the label Sad is the same as Not isHappy here)? Note: is the vector containing the trained parameters from the logistic regression model. σ is the sigmoid function. 1 { x } is a function that returns 1 if x is true and 0 otherwise. # σ ( > x test ) . # 1 - σ ( > x test ) # σ (1 - > x test ) # 1 { σ ( > x test ) > 0 . 5 } # 1 { σ ( > x test ) < 0 . 5 } (b) (4.0 pt) Using the same model, we are now interested in seeing how much more likely our model will classify a tweet as happy rather than sad. We will use the following metric (note the log here is in natural base): log P ( Y = 1 | X = x ) P ( Y = 0 | X = x ) Suppose we have the following tweet and the corresponding features: Tweet HappyEmojiCount SadEmojiCount Weather is a bit dry today (=_=) :( stay hydrated during your workout () :) :-) 3 2 Our model has the following set of trained parameters: Features Intercept Term HappyEmojiCount SadEmojiCount Coe cient 0.2 0.7 -0.5
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Exam generated for <EMAILADDRESS> 26 What is the value of the metric for this tweet? (c) (4.0 points) Consider a small subset of tweets scattered as points ( HappyEmojiCount , SadEmojiCount ) in the 2- dimensional plane. We will use the shape and color of a point to indicate whether the tweet is actually happy or sad. Suppose for the following subset of tweets, we want to train a logistic regression model (intercept included) with L 2 regularization on one of its parameters. Recall that a logistic regression model with L 2 regularization has the loss function of the following form: Loss ( Y, ˆ Y , ) = CrossEntropyLoss ( Y, ˆ Y ) + λ✓ 2 i , where i is some parameter from the model. Consider the figures below depicting three possible decision boundaries applied on the dataset. Figure A
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Exam generated for <EMAILADDRESS> 27 Figure B Figure C For each of the following parameters, explain, by matching with one of the figures above, when compared to a model with no regularization, what would happen to the slope of the decision boundary that divides the two classes and the training error if we set λ to be a really large value. i. (1.0 pt) Let i = coe cient for the sad emoji count . Which of the figures above best depict the decision boundary in this case? # Figure A # Figure B # Figure C # None of the above ii. (1.0 pt) Let i = coe cient for the sad emoji count . What will happen to the training error in this case? # Increase # Decrease # Remain the same # Cannot be determined
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Exam generated for <EMAILADDRESS> 28 iii. (1.0 pt) Let i = coe cient for the happy emoji count . Which of the figures above best depict the decision boundary in this case? # Figure A # Fgure B # Figure C # None of the above iv. (1.0 pt) Let i = coe cient for the happy emoji count . What will happen to the training error in this case? # Increase # Decrease # Remain the same # Cannot be determined
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Exam generated for <EMAILADDRESS> 29 (d) (2.0 pt) We test our logistic regression model on a subset of tweets with 70-30 class imbalance. In other words, 70% of the tweets are happy and the remaining 30% are sad. Unfortuntely, it turns out our model only has a 40% accuracy. Suppose we invert the predictions from our model, i.e. if a model predicts happy, output sad instead, and vice versa. In percentages , what would be the new accuracy of our model? Please round your answer to the nearest integer between 0 and 100. (e) (3.0 points) Consider the following 3 logistic regression models with di ff erent features trained on the same dataset. Let the models be denoted A , B , and C respectively. Shown below are the corresponding ROC curves for these models: i. (2.0 pt) Suppose we fix the decision threshold for all 3 logistic regression models to be such that we get a FPR of 0.8 when we evaluate our model. In order of most to least preferred, rank each of the models given their ROC curves. # C > B > A # C > A > B # B > C > A # B > A > C # Cannot be determined ii. (1.0 pt) Which of the following model is a strictly worse classifer? # A # B # C # Cannot be determined
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Exam generated for <EMAILADDRESS> 30 10. (7.0 points) Let us say that we have some model y 1 = 1 X (notice no intercept term), where 1 2 R and X, y 1 2 R n 1 (vectors), which takes in n data points with the same common y -value c . In other words, the model is fitted to the data points ( x 1 , c ) , ( x 2 , c ) . . . ( x n , c ) . Let us also say that we have some model y 2 = 2 X (notice no intercept term), where 2 2 R and X, y 2 2 R n 1 (vectors), which takes in the same x -values of the n data points, except with the common y -value c 0 . In other words, the model is fitted to the data points ( x 1 , c 0 ) , ( x 2 , c 0 ) . . . ( x n , c 0 ) . (a) (3.0 pt) If we run OLS on our data, what will be the value of 1 ? Note that in the answer choices, ¯ x = 1 n P n i =1 x i . # c # c · P n i =1 x 2 i P n i =1 x i # c · P n i =1 x i P n i =1 x 2 i # c · (P n i =1 ( x i - ¯ x ) 2 ) # None of the above (b) (4.0 pt) We now choose to combine all our data points and fit to a combined model: y comb = comb X comb Where comb 2 R and X comb , y comb 2 R 2 n 1 . In other words, the model is fitted to the data points ( x 1 , c ) . . . ( x n , c ) , ( x 1 , c 0 ) , . . . , ( x n , c 0 ) and are fed into y comb . Which of the following is the relationship between comb , 1 , and 2 ? # comb = 1 + 2 2 # comb = c 1 + c 0 2 c + c 0 # comb = c 1 P n i =1 x i + c 0 2 P n i =1 x i # None of the above
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Exam generated for <EMAILADDRESS> 31 11. (6.0 points) Suppose we are given three datasets A, B, and C 2 R 100 2 i.e. each dataset consists of 100 data points in two dimensions. We visualize the datasets using scatterplots, labelled Plot A, Plot B, and Plot C, respectively: (a) (1.0 pt) If we applied PCA to each of the above datasets and used only the first principal component which dataset(s) would have the lowest reconstruction error? 2 Dataset A 2 Dataset B 2 Dataset C 2 Cannot be determined (b) (2.0 pt) If we applied PCA to each of the above datasets and used the first two principal components, which dataset(s) would have the lowest reconstruction error? 2 Dataset A 2 Dataset B 2 Dataset C 2 Cannot be determined (c) (3.0 pt) Suppose we are taking the SVD of one of the three datasets, which we will name dataset X. We run the the following piece of code: X_bar = X - np.mean(X, axis=0) U, Sigma, V_T = np.linalg.svd(X_bar) We get the following output for Sigma : array([15.59204498, 3.85871854]) and the following output for V_T : array([[ 0.89238775, -0.45126944], [ 0.45126944, 0.89238775]]) Based on the given plots and the SVD, which of the following datasets does dataset X most closely resemble: # Dataset A # Dataset B # Dataset C
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Exam generated for <EMAILADDRESS> 32 12. (6.0 points) Suppose you are estimating a true function g ( z ) = Az 2 for z 2 R (i.e. z is a scalar) with ordinary least squares linear regression , where the model is f ( z ) = z and 2 R . We train the model with just one training data point x, y , generated according to Y = g ( x ) + . Assume N (0 , σ 2 ) (i.e. normally distributed with mean 0 and variance σ 2 ). For the rest of the question, assume x = 1 . (a) (2.0 pt) For this part of the question, assume σ 2 = 0 . In other words, there is no noise in the labels. What is the bias and variance of the model f at a test point z = 1 ? Hint: Start by solving for the optimal choice of given x . # bias= 0 , variance= 0 # bias= A , variance= σ 2 # bias= A , variance= 0 # bias= 0 , variance= σ 2 (b) (4.0 pt) For this part of the question, let σ 2 > 0 . Select the correct statement about the bias and variance of the model f at a test point z = 1 . # bias= 0 , variance= 0 # bias= A , variance= σ 2 # bias= A , variance= 0 # bias= 0 , variance= σ 2
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Exam generated for <EMAILADDRESS> 33 13. (5.0 points) (a) (1.0 pt) Alice is training a model and finds that as she adds more features her training error is decreasing along with her validation error. What should she do? # Increase regularization. # Decrease regularization. # No additional regularization changes are needed. (b) (1.0 pt) Alice is training a model and her training error is rapidly decreasing but her validation error is increasing. What should she do? # Increase regularization. # Decrease regularization. # No additional regularization changes are needed. (c) (1.0 pt) Suppose you are interested in finding the minimal set of explanatory features, which form of regularization would be most appropriate? # L1 regularization. # L2 regularization. # No regularization. (d) (1.0 pt) Suppose Alice finds that her model is overfitting and she decides to add L2 regularization with regularization coe cient λ to her model. As she increases the regularization coe cient λ , which of the following are true? 2 Bias increases 2 Bias decreases 2 Variance increases 2 Variance decreases (e) (1.0 pt) Consider a simple setting in which we are predicting the height of a person in centimeters based on their weight. Suppose we include the weight measured in kilograms (kg) and milligrams (mg) as two separate features and we tune the coe cient of the L1 regularization to include only one feature. Without normalizing the data before training, which feature would be selected after the model is trained? # Weight in mg # Weight in kg
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Exam generated for <EMAILADDRESS> 34 14. (8.0 points) There are 10 blue, 15 red, and 25 green balls in a bag from which we sample uniformly at random without replacement. Let I i be the indicator that the i -th ball drawn will be red. Calculate the following terms: (a) (1.0 pt) E [ I 1 ] (b) (1.0 pt) V ar [ I 1 ] (c) (2.0 pt) E [ I 1 + I 50 ] (d) (2.0 pt) E [ P 50 i =1 I i ] (e) (2.0 pt) V ar [ P 50 i =1 I i ]
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help