sp21final
pdf
keyboard_arrow_up
School
University of California, Berkeley *
*We aren’t endorsed by this school
Course
C200
Subject
Computer Science
Date
Feb 20, 2024
Type
Pages
33
Uploaded by DeanBookKookabura6
DATA 100
Final-Exam
Spring 2021
Final-Exam
INSTRUCTIONS
This is your exam. Complete it either at exam.cs61a.org or, if that doesn’t work, by emailing course sta
ff
with your
solutions before the exam deadline.
This exam is intended for the student with email address
<EMAILADDRESS>
. If this is not your email address, notify
course sta
ff
immediately, as each exam is di
ff
erent. Do not distribute this exam PDF even after the exam ends, as
some students may be taking the exam in a di
ff
erent time zone.
For questions with
circular bubbles
, you should select exactly
one
choice.
#
You must choose either this option
#
Or this one, but not both!
For questions with
square checkboxes
, you may select
multiple
choices.
2
You could select this choice.
2
You could select this one too!
You may start your exam now. Your exam is due at <DEADLINE> Pacific Time.
Go to the next page
to begin.
Exam generated for
<EMAILADDRESS>
3
1. (7.0 points)
(a) (2.0 pt)
Recall the tips dataset that we worked with on assignments in the past, which includes data
about the tip on a restaurant bill as well as the day of week and the sex of the individual. The plot below
attempts to examine patterns between the tip as a percentage of the bill and the sex of the individual by
the day of week (DOW)
Select the best reason below for why the data visualization is misleading or poorly constructed.
#
the
y
-axis should be log transformed
#
the clustering of bars doesn’t allow a key comparison to be made
#
the plot su
ff
ers from overplotting
#
the bars for each day of week should be stacked on top of each other (e.g. the bar for “Thur” would
have a total height of approximately 0.3)
Exam generated for
<EMAILADDRESS>
4
(b) (2.0 pt)
Consider the surface whose contour plot is provided below.
gradient fields most likely corresponds to the surface shown above?
A
gradient field
is a plot that shows the direction and relative magnitude of the gradient of a surface on
a 2-dimensional plot where each point has a vector pointing from it in the direction of the gradient at that
point and the length of that vector is proportional to the magnitude of the gradient.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Exam generated for
<EMAILADDRESS>
5
#
#
Exam generated for
<EMAILADDRESS>
6
#
#
(c) (3.0 points)
We have read in some data as the dataframe
df
. Consider a subset of
df
below, which contains some
information on the background of various individuals in the US.
Exam generated for
<EMAILADDRESS>
7
i. (2.0 pt)
Suppose we want to observe the relationship between and the distributions of the
AFQT
(an
intelligence metric, with units percentile) and
log_earn_1999
(log of the individual’s earnings in 1999)
variables based on whether the individual’s parents both went to college. Select the line of code below
that generates the best plot to observe this relationship.
•
A:
sns.kdeplot(x=df[
'
AFQT
'
], y=df[
'
log_earn_1999
'
],
hue=df[
'
mother_college
'
] & df[
'
father_college
'
])
•
B:
sns.scatterplot(x=df[
'
AFQT
'
], y=df[
'
log_earn_1999
'
],
hue=df[
'
mother_college
'
] & df[
'
father_college
'
])
•
C:
sns.lineplot(x=df[
'
AFQT
'
], y=df[
'
log_earn_1999
'
],
hue=df[
'
mother_college
'
] & df[
'
father_college
'
])
•
D:
sns.kdeplot(x=
'
AFQT
'
, y=
'
log_earn_1999
'
, hue=[
'
mother_college
'
,
'
father_college
'
], data=df)
•
E:
sns.scatterplot(x=
'
AFQT
'
, y=
'
log_earn_1999
'
,hue=[
'
mother_college
'
,
'
father_college
'
], data=df)
•
F:
sns.lineplot(x=
'
AFQT
'
, y=
'
log_earn_1999
'
, hue=[
'
mother_college
'
,
'
father_college
'
], data=df)
Hint:
Consider overplotting.
#
A
#
B
#
C
#
D
#
E
#
F
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Exam generated for
<EMAILADDRESS>
8
ii. (1.0 pt)
Suppose we want to understand the relationship between
weeks_worked_1999
and the sex
of the individual. We run the following code to generate a plot:
df2 = df.groupby("zip_code").mean().reset_index()
sns.lineplot("zip_code", "log_earn_1999", data=df2)
Select the reason below for why this plot would represent a bad data visualization.
#
treats a categorical variable as a continuous variable
#
treats a continuous variable as a categorical variable
#
represents a density with a feature other than area
#
does not show the relationship between the variables of interest
Exam generated for
<EMAILADDRESS>
9
2. (9.0 points)
(a) (4.0 points)
Recall that a random forest is created from a number of decision trees, with each decision tree created
from a bootstrapped version of the original training set. One hyperparameter of a random forest is the
number of decision trees we train to create the random forest.
Define
T
to be the number of decision trees used to create the random forest. Let’s say we have two
candidate values for
T
:
var
1
and
var
2
. We want to perform
var
3
- fold cross-validation to determine the
optimal value of
T
. Assume
var
1
,
var
2
, and
var
3
are integers.
i. (2.0 pt)
In this cross-validation process, how many
random forests
will we train? Your answer
should be in terms of
var
1
,
var
2
, and/or
var
3
and should be an integer.
ii. (2.0 pt)
In this cross-validation process, how many
decision trees
will we train? Your answer should
be in terms of
var
1
,
var
2
, and/or
var
3
and should be an integer.
Exam generated for
<EMAILADDRESS>
10
(b) (2.0 pt)
Let’s say we pick three hyperparameters to tune with cross-validation. We have 5 candidate values
for hyperparameter 1, 6 candidate values for hyperparameter 2, and 7 candidate values for hyperparameter
3. We perform 4-fold cross validation to find the optimal combination of hyperparameters, across all
possible combinations.
In this cross-validation process, how many
random forests
will we train? Your answer can be left as a
product of multiple integers, e.g. “1 * 2 * 3”, or simplified to a single integer, e.g. “6”. (These are not the
correct answers to the problem).
(c) (3.0 pt)
Here is some code that attempts to implement the cross-validation procedure described above.
However, it is buggy. In one sentence, describe the bug below.
You may assume the following:
•
X_train
is a
pd.DataFrame
that contains our design matrix, and
Y_train
is a
pd.Series
that contains
our response variable, both for the full training set.
•
Assume
ensemble.RandomForestClassifier(**args)
creates a random forest with the appropriate
hyperparameter values. The bug is not on this line.
•
The candidate values for each hyperparameter have been loaded into the lists
cands1
,
cands2
, and
cands3
, respectively.
1: from sklearn.model_selection import KFold
2: from sklearn import ensemble
3: import numpy as np
4: import pandas as pd
6: kf = KFold(n_splits = 4)
7: cv_scores = []
8: for cand1 in cands1:
9:
for cand2 in cands2:
10:
for cand3 in cands3:
11:
validation_accuracies = []
12:
for train_idx, valid_idx in kf.split(X_train):
13:
split_X_train, split_X_valid = X_train.iloc[train_idx], X_train.iloc[valid_idx]
14:
split_Y_train, split_Y_valid = Y_train.iloc[train_idx], Y_train.iloc[valid_idx]
16:
model = ensemble.RandomForestClassifier(**args)
17:
model.fit(X_train, Y_train)
18:
accuracy = np.mean(model.predict(split_X_valid) == split_Y_valid)
19:
validation_accuracies.append(accuracy)
20:
cv_scores.append(np.mean(validation_accuracies))
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Exam generated for
<EMAILADDRESS>
11
3. (14.0 points)
We are trying to train a decision tree for a classification task where
0
is the negative class and
1
is the positive
class. We are given 8 data points each in pairs of
(
x
1
, x
2
)
features.
(a) (3.0 pt)
x
1
x
2
y
3
4
1
2
1
0
1
3
1
5
9
0
9
6
1
7
2
1
4
7
0
8
8
1
What is the entropy at the root of the tree? Round to 4 decimal places.
(b) (2.0 pt)
What is the gini inpurity at the root of the tree? Note that the formula for gini impurity is
1
-
P
c
i
=1
p
2
i
where
p
i
is the fraction of items labelled with class
i
and
c
is the total number of classes.
Exam generated for
<EMAILADDRESS>
12
(c) (4.0 pt)
Suppose we decide to split the root node with the rule
x
i
≥
β
where i = 1 or 2. Which of the
following minimizes the weighted entropy of the two resulting child nodes.
#
x
1
≥
6
#
x
1
≥
3
.
5
#
x
2
≥
5
#
x
2
≥
3
.
5
#
x
2
≥
6
.
5
(d) (2.0 points)
We have decided to create a food recommendation system using a decision tree! We would like to run our
decision tree to see what food it recommends in certain scenarios.
If you have trouble reading the above tree, please go to this link: https://i.imgur.com/9Z40cYP.png
i. (1.0 pt)
Bob wants to eat some unhealthy food, specifically at a fast food restaurant. When asked
what he’s in the mood for, he replies with “Mediterranean”. Which of the following restaurants could
the decision tree recommend for Bob?
#
Chipotle
#
Taco Bell
#
Dyars Cuisine
#
IBs Burgers
ii. (1.0 pt)
Larry would like to eat some unhealthy food as well! However, he got a salary bonus from
his job so he does not want to eat at a fast food restaurant. When asked how much he would like to
pay, he replies with “I have no preference”. Which of the following restaurants could the decision tree
recommend for Larry?
2
Olive Garden
2
Cheesecake Factory
2
Super Dupers Burger
2
Flemings Prime Steakhouse
Exam generated for
<EMAILADDRESS>
13
(e) (3.0 pt)
Joey and Andrew are each training their own decision tree for a classification task. Joey decides
to limit the depth of his decision tree to depth 3 while Andrew decides to not set a limit on the depth of
his decision tree. When plotting the training error, Joey’s error seems to be much higher than Andrew’s
error. However, when plotting the validation error, Andrew’s error seems be much higher than his training
error as well as Joey’s error. Andrew is confused and surmises that there must be a bug in his code that is
causing this to happen. What happened? Explain. What can he do to improve it? Name at least 3 things
he can do to improve his error. Please limit your repsonse to 2 sentences per reason.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Exam generated for
<EMAILADDRESS>
14
4. (16.0 points)
(a) (3.0 pt)
Suppose we are modeling the number of calls to MangoBot food delivery service per minute. We
believe that there are likely more calls around lunch time.
Which of the following feature encodings of the time of day (0.0 to 24.0, exclusive of both ends) would
capture this assumption? Select all that apply.
2
time_of_day ** 2
2
np.log(12 * time_of_day)
2
1-np.cos(np.pi * time_of_day / 12)
2
np.exp(-(time_of_day - 12) ** 2)
(b) (4.0 pt)
Recall that in a binary classification task, we want our data to become linearly separable so
that we can maximize the performance of our classifier. In many cases, however, our data are not directly
linearly separable. As a result, we want to apply some transformation to our data so they will become
linearly separable afterwards.
For the following dataset, select all transformations that can make the data linearly separable.
2
(
x
1
, x
2
)
!
(
x
2
1
, x
2
)
2
(
x
1
, x
2
)
!
(
x
1
, x
2
2
)
2
(
x
1
, x
2
)
!
(
x
2
1
, x
2
2
)
2
(
x
1
, x
2
)
!
(
x
2
2
, x
2
1
)
Exam generated for
<EMAILADDRESS>
15
(c) (3.0 pt)
One way to transform textual data into features is to count the frequencies for all of the words in
the text.
Consider the following preprocessing steps:
i. Remove all punctuations (
.
,
,
,
:
, . . . ).
ii.
Remove all stopwords (
did
,
the
, . . . ). Note that stopwords do not include words that negate things
such as
no
,
not
, . . .
iii. Lower case the sentence, and keep words that only consist of letters
a
-
z
.
iv. Encode the sentence as a vector containing the frequencies for all the unique words in the text.
Suppose we use the frequency vector from the steps above as our feature to train a logistic regression model
that predicts the sentiment of a sentence (positive, negative). In 1-2 sentences, describe a case where our
model would fail and make a false prediction.
Your answer must be specific to the preprocessing steps and includes an example sentence
to earn credits
.
Exam generated for
<EMAILADDRESS>
16
(d) (3.0 pt)
Recall that in the housing assignment, if we want to include a categorical variable in our linear
model, we need to convert it into a collection of dummy variables of values 0 and 1. Suppose we have a
dataframe
housing
that contains a subset of the Cook County data.
We are interested in one-hot-encoding the categorical variable
floor_material
and using the dummy
columns as the sole features to build an ordinary least squares model to predict the sale price of the houses.
Specifically, we create the design matrix
X
with the following block of code:
X = pd.get_dummies(housing[
'
floor_material
'
]).to_numpy()
In addition, running the code
housing[
'
floor_material
'
].value_counts()
gives us the following out-
put:
Which of the following statements are true about the design matrix
X
? Select all that apply. Note: define
✓
⇤
to be the vector containing the optimal parameters.
2
X
has a dimension of 3 columns and 120 rows.
2
We can add a bias column of all 1’s to
X
and still find a unique solution for the optimal parameters.
2
X
>
X
is a diagonal matrix (zeros everywhere except along the main diagonal).
2
All of the entries in
X
>
X
add up to be 120.
2
The optimal parameter vector
✓
⇤
contains the average sale price for each type of floor material.
(e) (3.0 pt)
When building your models, one way to select features is to consider the pair-wise relationship
between each column and the response variable (i.e. the column you are trying to predict). Consider the
following approach:
i.
Compute the pairwise correlation coe
ffi
cient between each column and the response variable in the
dataframe.
ii. Sort the correlation coe
ffi
cients in descending order.
iii. Pick the top
k
coe
ffi
cients and select the corresponding columns as the features.
In 1-2 sentences, describe how the approach above can result in multicollinearity and issues with feature
diversity.
Your answer must explain why multicollinearity and lack of feature diversity could poten-
tially occur to earn credits for this question.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Exam generated for
<EMAILADDRESS>
17
5. (9.0 points)
Suppose we are modelling some response using our data
X
. For a given observation we have 3 features,
x
1
,
x
2
,
x
3
. Note that the subscript does
not
refer to the first, second, and third observations, respectively. For a given
data point
x
, we come up with a model of the form
f
✓
(
x
) =
✓
1
x
1
+
✓
1
✓
2
x
2
+
✓
2
1
x
3
. We use the squared error
function, denoted
L
(
y,
ˆ
y
)
, to calculate the error for each observation and additionally use L2 regularization,
denoted
R
(
✓
)
, with penalty
λ
. You may assume that
λ
>
0
. Thus our objective function is of the form
L
(
y,
ˆ
y
)
+
λ
R
(
✓
)
.
(a) (3.0 pt)
For a single observation
x
having response
y
and features
x
1
,
x
2
,
x
3
, compute the gradient to be
used in gradient descent:
#
-
2
(
y
-
(
✓
1
x
1
+
✓
1
✓
2
x
2
+
✓
2
1
x
3
))(
x
1
+
✓
2
x
2
+ 2
✓
1
x
3
)
-
λ✓
1
(
y
-
(
✓
1
x
1
+
✓
1
✓
2
x
2
+
✓
2
1
x
3
))(
✓
1
x
2
)
-
λ✓
2
#
2(
x
1
+
✓
2
x
2
+ 2
✓
1
x
3
-
y
)(
✓
1
x
1
+
✓
1
✓
2
x
2
+
✓
2
1
x
3
) + 2
λ✓
1
2(
✓
1
x
2
-
y
)(
✓
1
x
1
+
✓
1
✓
2
x
2
+
✓
2
1
x
3
) + 2
λ✓
2
#
-
2(
y
-
(
✓
1
x
1
+
✓
1
✓
2
x
2
+
✓
2
1
x
3
))(
x
1
+
✓
2
x
2
+ 2
✓
1
x
3
)
-
2(
y
-
(
✓
1
x
1
+
✓
1
✓
2
x
2
+
✓
2
1
x
3
))(
✓
1
x
2
)
#
2
2
4
(
✓
1
x
1
+
✓
1
✓
2
x
2
+
✓
2
1
x
3
-
y
)(
✓
1
) +
λ✓
1
(
✓
1
x
1
+
✓
1
✓
2
x
2
+
✓
2
1
x
3
-
y
)(
✓
1
✓
2
) +
λ✓
1
✓
2
(
✓
1
x
1
+
✓
1
✓
2
x
2
+
✓
2
1
x
3
-
y
)(
✓
2
1
) +
λ✓
2
1
3
5
#
2
4
(
✓
1
x
1
+
✓
1
✓
2
x
2
+
✓
2
1
x
3
-
y
)
2
(
✓
1
)
(
✓
1
x
1
+
✓
1
✓
2
x
2
+
✓
2
1
x
3
-
y
)
2
(
✓
2
)
(
✓
1
x
1
+
✓
1
✓
2
x
2
+
✓
2
1
x
3
-
y
)
2
(
✓
3
)
3
5
#
2
4
(
y
-
(
✓
1
x
1
+
✓
1
✓
2
x
2
+
✓
2
1
x
3
)) +
λ
R
(
✓
1
)
(
y
-
(
✓
1
x
1
+
✓
1
✓
2
x
2
+
✓
2
1
x
3
)) +
λ
R
(
✓
2
)
(
y
-
(
✓
1
x
1
+
✓
1
✓
2
x
2
+
✓
2
1
x
3
))
3
5
(b) (2.0 pt)
Suppose that you and your friend are implementing gradient descent. Just for fun, your friend
chooses a negative learning rate
↵
and asks you to fix their code. Which of the following expressions will
always
result in the same update as the conventional gradient descent algorithm? You may assume that
the gradient
r
is correctly computed and you do not need to worry about the magnitude of
↵
.
2
✓
(
t
+1)
=
✓
(
t
)
-
↵
r
2
✓
(
t
+1)
=
✓
(
t
)
+
↵
r
2
✓
(
t
+1)
=
✓
(
t
)
-
|
↵
|
r
2
✓
(
t
+1)
=
✓
(
t
)
+
|
↵
|
r
2
None of the above
(c) (4.0 points)
Exam generated for
<EMAILADDRESS>
18
i. (2.0 pt)
We seek to optimize a given loss function using stochastic gradient descent where 1 < batch
size <
n
, where
n
is the total number of data points, We initialize all model parameters as 0 and use a
constant learning rate
⌘
(
t
) =
↵
. Based on the contour plot below, which of the following will most
likely result in better minimization of the loss function:
#
Fewer iterations
#
Greater learning rate
#
Smaller batch size
#
Greater batch size
Exam generated for
<EMAILADDRESS>
19
ii. (2.0 pt)
We seek to optimize a given convex loss function using gradient descent with a decaying
learning rate where at time
t
, the learning rate
⌘
(
t
) =
↵
t
+1
, where
↵
>
0
. Based on the contour plot
below, which of the following will most likely result in better minimization of the loss function:
2
Fewer iterations
2
Greater iterations
2
Negate
↵
2
Smaller
↵
2
Greater
↵
2
⌘
(
t
) =
↵
p
t
+1
2
⌘
(
t
) =
↵
(
t
+1)
2
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Exam generated for
<EMAILADDRESS>
20
6. (6.0 points)
Exam generated for
<EMAILADDRESS>
21
(a) (6.0 pt)
Leif wants to do a study on the number of flowers in people’s gardens. He collects data on 100
di
ff
erent gardens, classifying each of them into three di
ff
erent sizes: ‘small’, ‘medium’, and ‘large’, and
counts every flower in each person’s garden. The following is the first five rows of the data he collected:
Leif then asks you to construct the following table using the data he collected. The table represents the
total flowers in each category. For example, there are 1700 Hyacinths in “large” gardens.
Write code below such that the above table is generated. Assume the data Leif collected is placed in a
Pandas DataFrame assigned to the variable
inputdf
. The resulting table should be named
outputdf
.
Please follow the template below (you must use
pd.pivot_table
).
outputdf = pd.pivot_table(_______________________________________________)
Exam generated for
<EMAILADDRESS>
22
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Exam generated for
<EMAILADDRESS>
23
7. (8.0 points)
(a) (8.0 pt)
Kunal has a large dataset of Irish poems. He wants to analyze the many di
ff
erent sentences in
each of the poems. He has a list of words he is particularly interested in:
words = [
'
artist
'
,
'
dinner
'
,
'
data
'
,
'
pay
'
,
'
color
'
,
'
science
'
,
'
clearly
'
,
'
run
'
]
Kunal creates a Pandas Series of 100 sentences from these poems and wants to build a frequency array for
the above words for each of the 100 sentences. The frequency array should capture how many times a
word in the above list of words appears in a certain sentence.
For example, the sentence “Data Science is clearly science.” would yield an array like:
[0, 0, 1, 0, 0, 2, 1, 0]
Note: ‘science’ was recorded twice, even though the first letter has di
ff
erent capitalization.
You may assume all collected sentences have no punctuation.
Define a function that takes in a Series of sentences and a list of words as an input, and outputs a frequency
DataFrame where each row represents a frequency array for each sentence, as described above. Please start
your code with the following method signature:
def funcname(ser, words):
Note: Please limit your response to 6 lines. The sta
ff
’s solution was done in 3 lines, for reference (including
the function signature).
Hint: Try using one of the following str methods:
str.contains
,
str.get
,
str.count
,
str.split
,
str.find
.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Exam generated for
<EMAILADDRESS>
24
8. (8.0 points)
(a) (3.0 pt)
Suppose I am given a dataset [-2,4,1,3,4] and I am using the constant model,
ˆ
y
i
=
✓
.
To
find the optimal
✓
⇤
, I have the following two loss functions to work with.
L
A
(
y
i
,
✓
) = (3
y
i
-
✓
)
2
and
L
B
(
y
i
,
✓
) =
|
2
y
i
-
✓
|
. Let
✓
⇤
A
,
✓
⇤
B
be the optimal parameters found for loss functions
L
A
, L
B
respectively.
What is the relationship between
✓
⇤
A
and
✓
⇤
B
?
#
✓
⇤
A
>
✓
⇤
B
#
✓
⇤
A
=
✓
⇤
B
#
✓
⇤
A
<
✓
⇤
B
#
Need more information to tell
(b) (2.0 pt)
Which of the following models is linear in the parameters?
2
f
✓
(
x
) =
✓
1
x
+
✓
2
x
2
+
✓
3
e
cos
(
x
)
2
f
✓
(
x
) =
sin
(
✓
1
)
x
+
✓
2
x
2
f
✓
(
x
) =
✓
1
2
f
✓
(
x
) =
✓
1
log
(
x
4
+ 5
x
3
+ 6) +
✓
3
2
x
2
f
✓
(
x
) =
1
x
2
+1
✓
1
+
✓
2
(c) (3.0 pt)
Which of the following is true regarding MSE (Mean Squared Error) & MAE (Mean Absolute
Error) in Linear Regression?
2
There is a closed form solution to the optimal parameters when using MAE.
2
If our data contains many corrupted outliers, MAE loss is a better metric than MSE loss.
2
MAE encourages sparsity in the parameters (a lot of parameters are set to 0), which allows for
non-relevant features to not be included.
2
The median minimizes the MSE for a constant model.
2
The optimal parameters found in a MSE loss function and a MAE loss function will never be equal.
2
The MAE loss is not di
ff
erentiable everywhere, which makes it impossible to take the gradient at those
points.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Exam generated for
<EMAILADDRESS>
25
9. (16.0 points)
In this question we will be focusing on predicting whether a tweet is happy (1) or sad (0) using logistic regression.
Rather than using a bag-of-words featurization, we will simply count the number of positive emojis (
":-)"
,
";-)"
, . . . ) and negative emojis (
":-["
,
":-("
, . . . ).
Assume you are given a training dataframe
training
of the following form (the first 4 rows are shown):
Tweet
HappyEmojiCount
SadEmojiCount
isHappy
Woke up to a sunny day :-). Life is good :)
2
0
1
Stuck in tra
ffi
c for 1 hr on my way to work today (._.)
0
1
0
Found a new album that really slaps =ˆ_ˆ= check it
out on my Spotify
1
0
1
Grinding on this paper until 2am last night (-_-), but
last paper of the semester :)
1
1
1
You fit a logistic regression model using the following block of codes:
from sklearn.linear_models import LogisticRegression
lr = LogisticRegression(intercept=True)
lr.fit(training[[“HappyEmojiCount”, “SadEmojiCount”]], training[“isHappy”])
(a) (3.0 pt)
Given a new tweet transformed into a numpy array containing the same set of features, and the
array is assigned to the variable
x
test
, which of the following expressions computes:
P
(
Sad
|
x
test
) =
P
(
Y
=
0
|
X
=
x
test
)
under the logistic regression model (note the label Sad is the same as Not isHappy here)?
Note:
✓
is the vector containing the trained parameters from the logistic regression model.
σ
is the sigmoid
function.
1
{
x
}
is a function that returns
1
if
x
is true and
0
otherwise.
#
σ
(
✓
>
x
test
)
.
#
1
-
σ
(
✓
>
x
test
)
#
σ
(1
-
✓
>
x
test
)
#
1
{
σ
(
✓
>
x
test
)
>
0
.
5
}
#
1
{
σ
(
✓
>
x
test
)
<
0
.
5
}
(b) (4.0 pt)
Using the same model, we are now interested in seeing
how much more likely
our model will
classify a tweet as happy rather than sad. We will use the following metric (note the
log
here is in natural
base):
log
✓
P
(
Y
= 1
|
X
=
x
)
P
(
Y
= 0
|
X
=
x
)
◆
Suppose we have the following tweet and the corresponding features:
Tweet
HappyEmojiCount
SadEmojiCount
Weather is a bit dry today (=_=) :( stay hydrated during your workout
() :) :-)
3
2
Our model has the following set of trained parameters:
Features
Intercept Term
HappyEmojiCount
SadEmojiCount
Coe
ffi
cient
0.2
0.7
-0.5
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Exam generated for
<EMAILADDRESS>
26
What is the value of the metric for this tweet?
(c) (4.0 points)
Consider a small subset of tweets scattered as points
(
HappyEmojiCount
,
SadEmojiCount
)
in the 2-
dimensional plane. We will use the shape and color of a point to indicate whether the tweet is actually
happy or sad.
Suppose for the following subset of tweets, we want to train a logistic regression model (intercept included)
with
L
2
regularization on one of its parameters.
Recall that a logistic regression model with
L
2
regularization has the loss function of the following form:
Loss
(
Y,
ˆ
Y ,
✓
) =
CrossEntropyLoss
(
Y,
ˆ
Y
) +
λ✓
2
i
,
where
✓
i
is some parameter from the model.
Consider the figures below depicting three possible decision boundaries applied on the dataset.
Figure A
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Exam generated for
<EMAILADDRESS>
27
Figure B
Figure C
For each of the following parameters, explain, by matching with one of the figures above, when compared
to a model with no regularization, what would happen to the slope of the decision boundary that divides
the two classes and the training error if we set
λ
to be a really large value.
i. (1.0 pt)
Let
✓
i
=
coe
ffi
cient for the sad emoji count
.
Which of the figures above best depict the
decision boundary in this case?
#
Figure A
#
Figure B
#
Figure C
#
None of the above
ii. (1.0 pt)
Let
✓
i
=
coe
ffi
cient for the sad emoji count
. What will happen to the training error in this
case?
#
Increase
#
Decrease
#
Remain the same
#
Cannot be determined
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Exam generated for
<EMAILADDRESS>
28
iii. (1.0 pt)
Let
✓
i
=
coe
ffi
cient for the happy emoji count
. Which of the figures above best depict the
decision boundary in this case?
#
Figure A
#
Fgure B
#
Figure C
#
None of the above
iv. (1.0 pt)
Let
✓
i
=
coe
ffi
cient for the happy emoji count
. What will happen to the training error in
this case?
#
Increase
#
Decrease
#
Remain the same
#
Cannot be determined
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Exam generated for
<EMAILADDRESS>
29
(d) (2.0 pt)
We test our logistic regression model on a subset of tweets with 70-30 class imbalance. In other
words, 70% of the tweets are happy and the remaining 30% are sad. Unfortuntely, it turns out our model
only has a 40% accuracy. Suppose we invert the predictions from our model, i.e. if a model predicts happy,
output sad instead, and vice versa.
In percentages
, what would be the new accuracy of our model?
Please round your answer to the nearest integer between 0 and 100.
(e) (3.0 points)
Consider the following 3 logistic regression models with di
ff
erent features trained on the same dataset. Let
the models be denoted
A
,
B
, and
C
respectively. Shown below are the corresponding ROC curves for these
models:
i. (2.0 pt)
Suppose we fix the decision threshold for all 3 logistic regression models to be such that we
get a FPR of 0.8 when we evaluate our model. In order of most to least preferred, rank each of the
models given their ROC curves.
#
C > B > A
#
C > A > B
#
B > C > A
#
B > A > C
#
Cannot be determined
ii. (1.0 pt)
Which of the following model is a strictly worse classifer?
#
A
#
B
#
C
#
Cannot be determined
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Exam generated for
<EMAILADDRESS>
30
10. (7.0 points)
Let us say that we have some model
y
1
=
✓
1
X
(notice no intercept term), where
✓
1
2
R
and
X, y
1
2
R
n
⇥
1
(vectors), which takes in
n
data points with the same common
y
-value
c
. In other words, the model is fitted to
the data points
(
x
1
, c
)
,
(
x
2
, c
)
. . .
(
x
n
, c
)
.
Let us also say that we have some model
y
2
=
✓
2
X
(notice no intercept term), where
✓
2
2
R
and
X, y
2
2
R
n
⇥
1
(vectors), which takes in the same
x
-values of the
n
data points,
except
with the common
y
-value
c
0
. In other
words, the model is fitted to the data points
(
x
1
, c
0
)
,
(
x
2
, c
0
)
. . .
(
x
n
, c
0
)
.
(a) (3.0 pt)
If we run OLS on our data, what will be the value of
✓
1
?
Note that in the answer choices,
¯
x
=
1
n
P
n
i
=1
x
i
.
#
c
#
c
·
⇣
P
n
i
=1
x
2
i
P
n
i
=1
x
i
⌘
#
c
·
⇣
P
n
i
=1
x
i
P
n
i
=1
x
2
i
⌘
#
c
·
(P
n
i
=1
(
x
i
-
¯
x
)
2
)
#
None of the above
(b) (4.0 pt)
We now choose to combine all our data points and fit to a combined model:
y
comb
=
✓
comb
X
comb
Where
✓
comb
2
R
and
X
comb
, y
comb
2
R
2
n
⇥
1
.
In other words, the model is fitted to the data points
(
x
1
, c
)
. . .
(
x
n
, c
)
,
(
x
1
, c
0
)
, . . . ,
(
x
n
, c
0
)
and are fed into
y
comb
.
Which of the following is the relationship between
✓
comb
,
✓
1
, and
✓
2
?
#
✓
comb
=
✓
1
+
✓
2
2
#
✓
comb
=
c
✓
1
+
c
0
✓
2
c
+
c
0
#
✓
comb
=
c
⇣
✓
1
P
n
i
=1
x
i
⌘
+
c
0
⇣
✓
2
P
n
i
=1
x
i
⌘
#
None of the above
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Exam generated for
<EMAILADDRESS>
31
11. (6.0 points)
Suppose we are given three datasets A, B, and C
2
R
100
⇥
2
i.e. each dataset consists of 100 data points in two
dimensions. We visualize the datasets using scatterplots, labelled Plot A, Plot B, and Plot C, respectively:
(a) (1.0 pt)
If we applied PCA to each of the above datasets and used only the first principal component
which dataset(s) would have the lowest reconstruction error?
2
Dataset A
2
Dataset B
2
Dataset C
2
Cannot be determined
(b) (2.0 pt)
If we applied PCA to each of the above datasets and used the first two principal components,
which dataset(s) would have the lowest reconstruction error?
2
Dataset A
2
Dataset B
2
Dataset C
2
Cannot be determined
(c) (3.0 pt)
Suppose we are taking the SVD of one of the three datasets, which we will name dataset X. We
run the the following piece of code:
X_bar = X - np.mean(X, axis=0)
U, Sigma, V_T = np.linalg.svd(X_bar)
We get the following output for
Sigma
:
array([15.59204498,
3.85871854])
and the following output for
V_T
:
array([[ 0.89238775, -0.45126944],
[ 0.45126944,
0.89238775]])
Based on the given plots and the SVD, which of the following datasets does dataset X most closely resemble:
#
Dataset A
#
Dataset B
#
Dataset C
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Exam generated for
<EMAILADDRESS>
32
12. (6.0 points)
Suppose you are estimating a true function
g
(
z
) =
Az
2
for
z
2
R
(i.e.
z
is a scalar) with
ordinary least
squares linear regression
, where the model is
f
✓
(
z
) =
✓
z
and
✓
2
R
. We train the model with just one
training data point
x, y
, generated according to
Y
=
g
(
x
) +
✏
. Assume
✏
⇠
N
(0
,
σ
2
)
(i.e.
✏
normally distributed
with mean
0
and variance
σ
2
).
For the rest of the question, assume
x
= 1
.
(a) (2.0 pt)
For this part of the question, assume
σ
2
= 0
. In other words,
there is no noise in the labels.
What is the bias and variance of the model
f
✓
at a test point
z
= 1
?
Hint: Start by solving for the optimal choice of
✓
given
x
.
#
bias=
0
, variance=
0
#
bias=
A
, variance=
σ
2
#
bias=
A
, variance=
0
#
bias=
0
, variance=
σ
2
(b) (4.0 pt)
For this part of the question, let
σ
2
>
0
. Select the correct statement about the bias and variance
of the model
f
✓
at a test point
z
= 1
.
#
bias=
0
, variance=
0
#
bias=
A
, variance=
σ
2
#
bias=
A
, variance=
0
#
bias=
0
, variance=
σ
2
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Exam generated for
<EMAILADDRESS>
33
13. (5.0 points)
(a) (1.0 pt)
Alice is training a model and finds that as she adds more features her training error is decreasing
along with her validation error. What should she do?
#
Increase regularization.
#
Decrease regularization.
#
No additional regularization changes are needed.
(b) (1.0 pt)
Alice is training a model and her training error is rapidly decreasing but her validation error is
increasing. What should she do?
#
Increase regularization.
#
Decrease regularization.
#
No additional regularization changes are needed.
(c) (1.0 pt)
Suppose you are interested in finding the minimal set of explanatory features, which form of
regularization would be most appropriate?
#
L1 regularization.
#
L2 regularization.
#
No regularization.
(d) (1.0 pt)
Suppose Alice finds that her model is overfitting and she decides to add L2 regularization with
regularization coe
ffi
cient
λ
to her model. As she increases the regularization coe
ffi
cient
λ
, which of the
following are true?
2
Bias increases
2
Bias decreases
2
Variance increases
2
Variance decreases
(e) (1.0 pt)
Consider a simple setting in which we are predicting the height of a person in centimeters based
on their weight. Suppose we include the weight measured in kilograms (kg) and milligrams (mg) as two
separate features and we tune the coe
ffi
cient of the L1 regularization to include only one feature.
Without normalizing the data before training, which feature would be selected after the model is trained?
#
Weight in mg
#
Weight in kg
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Exam generated for
<EMAILADDRESS>
34
14. (8.0 points)
There are 10 blue, 15 red, and 25 green balls in a bag from which we sample uniformly at random without
replacement. Let
I
i
be the indicator that the
i
-th ball drawn will be red. Calculate the following terms:
(a) (1.0 pt)
E
[
I
1
]
(b) (1.0 pt)
V ar
[
I
1
]
(c) (2.0 pt)
E
[
I
1
+
I
50
]
(d) (2.0 pt)
E
[
P
50
i
=1
I
i
]
(e) (2.0 pt)
V ar
[
P
50
i
=1
I
i
]
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Related Documents
Recommended textbooks for you
COMPREHENSIVE MICROSOFT OFFICE 365 EXCE
Computer Science
ISBN:9780357392676
Author:FREUND, Steven
Publisher:CENGAGE L
Np Ms Office 365/Excel 2016 I Ntermed
Computer Science
ISBN:9781337508841
Author:Carey
Publisher:Cengage

Programming with Microsoft Visual Basic 2017
Computer Science
ISBN:9781337102124
Author:Diane Zak
Publisher:Cengage Learning

New Perspectives on HTML5, CSS3, and JavaScript
Computer Science
ISBN:9781305503922
Author:Patrick M. Carey
Publisher:Cengage Learning

Recommended textbooks for you
- COMPREHENSIVE MICROSOFT OFFICE 365 EXCEComputer ScienceISBN:9780357392676Author:FREUND, StevenPublisher:CENGAGE LNp Ms Office 365/Excel 2016 I NtermedComputer ScienceISBN:9781337508841Author:CareyPublisher:CengageProgramming with Microsoft Visual Basic 2017Computer ScienceISBN:9781337102124Author:Diane ZakPublisher:Cengage Learning
- New Perspectives on HTML5, CSS3, and JavaScriptComputer ScienceISBN:9781305503922Author:Patrick M. CareyPublisher:Cengage Learning
COMPREHENSIVE MICROSOFT OFFICE 365 EXCE
Computer Science
ISBN:9780357392676
Author:FREUND, Steven
Publisher:CENGAGE L
Np Ms Office 365/Excel 2016 I Ntermed
Computer Science
ISBN:9781337508841
Author:Carey
Publisher:Cengage

Programming with Microsoft Visual Basic 2017
Computer Science
ISBN:9781337102124
Author:Diane Zak
Publisher:Cengage Learning

New Perspectives on HTML5, CSS3, and JavaScript
Computer Science
ISBN:9781305503922
Author:Patrick M. Carey
Publisher:Cengage Learning
