Spring2022Exam2Solution
pdf
keyboard_arrow_up
School
Purdue University *
*We aren’t endorsed by this school
Course
20875
Subject
Mathematics
Date
Feb 20, 2024
Type
Pages
16
Uploaded by ProfessorMorningSnake44
1 P
URDUE Section: Instructions You have 60 minutes to complete this 6-question exam (Question 0 is the honor pledge). That gives about 10 minutes per question, so be sure to pace yourself accordingly. The exam is worth 100 points total, and each question is worth between 15 and 18 points. You are free to consult printed out materials from the course website and handwritten notes for the exam. However, you are not permitted to use a computer, calculator, or any other resources. It is critical that you follow the exact template we have provided in the exam packet
. This includes 1)
Writing all of your work on the exact sheets of paper provided for each question. Please note that we have used the front and back of the pages. For each question, you must show any work you used to arrive at your answer. 2)
Writing your name and PUID at the top of every page. To facilitate this, the first 5 minutes (before the 60-minute counter starts) will be used solely to fill out your name/ID on each page. Good luck! NAME ____________________________________ PUID ________________________________________ ECE 20875: Python for Data Science Spring 2022 Exam #2 Qiang Qiu (SEC001), Mahsa Ghasemi (SEC003), and Murat Kocaoglu (SEC002)
NAME: PUID: 2 Question 0: Honor Pledge and Acknowledgment
Please sign the honor pledge below with a signature of your full legal name. I understand and acknowledge the above instructions and notes. I also affirm that the answers given on this test are mine and mine alone. I did not receive help from any person or material (other than those explicitly allowed).
X___________________________________________
NAME: PUID: 3 Question 1: Confidence intervals [18 points] A chemist is interested in assessing the amount of calcium carbonate produced at the conclusion of a particular chemical reaction. She performs 100
reactions and records an average yield of 50
grams with a standard deviation of 5
grams. [
Note:
Use the provided z-table for this problem.] (a)
[9 points]
Using the table above, calculate the 80% two-sided confidence interval for the average calcium carbonate yield (i.e., calculate the upper and lower bounds of the 80% confidence interval). ?𝐼
80
= (?̅ − (?
80
⋅
?
√
?
) , ?̅ + (?
80
⋅
?
√
?
))
?𝐼
80
= (50 − (1.28 ⋅
5
√100
) , 50 + (1.28 ⋅
5
√100
))
?𝐼
80
= (50 − (1.28 ⋅
1
2
) , 50 + (1.28 ⋅
1
2
))
?𝐼
80
= (50 − 0.64, 50 + 0.64)
?𝐼
80
= (49.36, 50.64)
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
NAME: PUID: 4 (b)
[3 points]
Which of the following interpretations about the calculated 80% confidence interval is incorrect
[Select one]
? a.
If we repeat the experiment a large number of times, 80% of confidence intervals would contain the population mean b.
There is a 80% probability that the population mean is contained in the calculated confidence interval c.
Before we run the experiment, there is a 80% chance that the population mean will fall within the computed confidence interval d.
If the population mean is inside the 80% confidence interval, it would not be statistically significant (c)
[3 points]
How would a 70% confidence interval compare to the 80% interval calculated in Part (a), assuming everything else stays constant [Select one]?
a.
The 70% confidence interval would be narrower
than the 80% confidence interval b.
The 70% confidence interval would be wider
than the 80% confidence interval c.
It is impossible to determine their relationship given this information (d)
[3 points]
How would a 70% confidence interval compare to the 80% interval calculated in Part (a) above, assuming both
the number of reactions and standard deviation increase [Select one]?
a.
The 70% confidence interval would be narrower
than the 80% confidence interval b.
The 70% confidence interval would be wider
than the 80% confidence interval c.
It is impossible to determine their relationship given this information
NAME: PUID: 5 Question 2: Regular expressions [15 points] (a)
[4 points]
Circle all the strings below that would provide a valid match for the following regex (i.e., the beginning of the string matches the regex pattern) [Select one or more]
. [a-z]*\d+@\w+\.(com|net|org|edu) ‘aly@gmail.com’
‘adime10@verizon.net’
‘abcde1@test.org’
‘gmail@gmail.gmail.com’
(b)
[3 points]
Circle the regex that provides a valid match for all
strings below [Select one]
. ‘hello there!’ ‘general kenobi’
‘
i hate sand’
a.
(\s+)\!+
b.
([a-z]+)(\!|\.|\?)*
c.
([a-z]+ )*[a-z]+\!*
(c)
[4 points]
What would the following statement print? print(
re.sub(r’ee’, ‘o’, ‘freezy breeze’))
(d)
[4 points]
What would the following statement print? print(
re.sub(r’
([aeiou
]+)([^aeiou]+)’,r’
\2\
1’,
‘eat’)
frozy broze tea
NAME: PUID: 6 Question 3: Linear regression (Regularization, and cross validation) [16 points] (a)
[6 points]
The following plots illustrate different regression models fitted to the same data set. Label each plot as either overfitting, proper fitting, or underfitting the data. (b)
[4 points]
Suppose we train the model y
̂(?) = −12?
1
2
+ 8?
1
+ 25?
2
− 19
using linear regression without regularization from a data set. Which of the following models are a possible outcome if we run the regression algorithm with regularization over the same data set? Select all that apply. a.
?
̂(?) = −1.5?
1
+ 0.4?
2
+ 10
b.
?
̂(?) = −5?
1
2
+ 8?
1
+ 7?
2
− 11
c.
?
̂(?) = 20?
1
2
+ 40.5?
1
− 28?
2
+ 22.6
d.
?
̂(?) = −100?
1
2
− 9?
1
− 34?
2
+ 42
Proper fitting Overfitting Under fitting
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
NAME: PUID: 7 (c)
[6 points]
Suppose we want to pick the best regularization parameter 𝜆
for a ridge regression problem using a 3-fold cross-validation
. The following table shows the training and testing mean squared error (MSE) for each fold and each candidate 𝜆
. Select the best value for 𝜆
among these four values and provide justification for your selection.
𝜆 = 0.0
𝜆 = 0.1
𝜆 = 1.0
𝜆 = 10.0
Fold 1 𝑀𝑆?
??𝑎𝑖?
= 10
𝑀𝑆?
????
= 7
𝑀𝑆?
??𝑎𝑖?
= 13
𝑀𝑆?
????
= 4.5
𝑀𝑆?
??𝑎𝑖?
= 15
𝑀𝑆?
????
= 2.5
𝑀𝑆?
??𝑎𝑖?
= 31.5
𝑀𝑆?
????
= 6
Fold 2 𝑀𝑆?
??𝑎𝑖?
= 12.5
𝑀𝑆?
????
= 11.5
𝑀𝑆?
??𝑎𝑖?
= 13.5
𝑀𝑆?
????
= 10
𝑀𝑆?
??𝑎𝑖?
= 24
𝑀𝑆?
????
= 4
𝑀𝑆?
??𝑎𝑖?
= 26.5
𝑀𝑆?
????
= 15.5
Fold 3 𝑀𝑆?
??𝑎𝑖?
= 16
𝑀𝑆?
????
= 23
𝑀𝑆?
??𝑎𝑖?
= 21
𝑀𝑆?
????
= 9
𝑀𝑆?
??𝑎𝑖?
= 31.5
𝑀𝑆?
????
= 8.5
𝑀𝑆?
??𝑎𝑖?
= 43
𝑀𝑆?
????
= 19.5
Your selection and justification here: 𝜆 = 1.0
is the best regularization parameter because it provides the least average 𝑀𝑆?
????
.
NAME: PUID: 8 Question 4: n-grams and tf-idf [
18 points] Consider three documents (doc1, doc2, and doc3) with the following contents: doc1: The dog is red doc2: A dog hit a dog doc3: The dog is heavy (a)
[4 points]
List all the word-based
2-grams of doc3 (transfer all characters to lowercase too). (b)
[4 points]
List the words remaining in doc1, doc2, and doc3 after removing all stop words. “the dog” “dog is”
“is heavy”
doc1: dog red doc2: dog hit dog doc3: dog heavy
NAME: PUID: 9 (c)
[4 points]
Complete the following word-doc matrix (co-occurrence matrix) based on (b) with word-based 1-gram. dog red hit heavy doc1 1 1 0 0 doc2 2 0 1 0 doc3 1 0 0 1 (d)
Use the following word-doc matrix to answer the following questions. blinder machine learning block chain money Document A 1 1 2 0 0 0 Document B 5 5 5 10 0 0 Document C 0 3 0 0 2 3 Document D 10 10 2 0 4 0 i.
[2 points]
In which document is the term frequency score for the word “learning” the highest? ii.
[2 points]
Which words have the highest inverse document frequency score? Document A “block” and “money”
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
NAME: PUID: 10 iii.
[2 points]
By the tf-idf measure, what is the most important words in Document C? “money”
NAME: PUID: 11 Question 5: Linear regression [18 points] Taylor has the hypothesis that Desla
’s
stock price (
?
) on a day can be predicted by the number of its employees that quit the day before (
?
). Taylor collects the following data set over 5 days. Day Desla
’s
stock price (
?
) # employees that quit the day before (
?
) Testing or Training 1 1020 10 Training 2 1040 20 Training 3 1010 30 Training 4 1041 20 Testing 5 1032 10 Testing [Note:
Taylor plans to use the first three observations for training and the remaining two observations for testing.] (a)
[2 points]
Taylor wants to learn a quadratic function ?
̂
𝑖
= ?
1
?
𝑖
2
+ ?
2
?
𝑖
+ ?
using linear regression (without regularization). Form the feature matrix 𝑋
. 𝑋 = [
100
10
1
400
20
1
900
30
1
]
NAME: PUID: 12 (b)
[4 points]
Given the following means and standard deviations, form the normalized feature matrix 𝑋
????𝑎?𝑖𝑧??
.
𝑀???
𝑥
= 5,
𝑆??????? ??𝑣𝑖??𝑖??
𝑥
= 5 𝑀???
𝑥
2
= 300,
𝑆??????? ??𝑣𝑖??𝑖??
𝑥
2
= 100
(c)
[3 points]
Using the normalized feature matrix 𝑋
????𝑎?𝑖𝑧??
, write down the least squares equation for finding the optimal parameter vector 𝛽 = [?
1
, ?
2
,?]
𝑇
, without solving. All components of the equation except the parameter vector should be in numerical form.
𝑋
????𝑎?𝑖𝑧??
= [
−2
1
1
1
3
1
6
5
1
]
𝛽 = ([
−2
1
6
1
3
5
1
1
1
][
−2
1
1
1
3
1
6
5
1
])
−1
[
−2
1
6
1
3
5
1
1
1
] (
1020
1040
1010
)
OR ([
−2
1
6
1
3
5
1
1
1
][
−2
1
1
1
3
1
6
5
1
]) 𝛽 = [
−2
1
6
1
3
5
1
1
1
] (
1020
1040
1010
)
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
NAME: PUID: 13 (d)
[5 points]
Taylor solved your least squares equation and found the optimal parameters to be
𝛽 = [1, 2, 1030]
𝑇
.
Given Taylor’s model,
calculate the predicted D
esla’s stock price ?
̂
for the two testing observations. ?
4,????𝑎?𝑖𝑧?? = 20 − 5
5
= 3 ,
?
4,????𝑎?𝑖𝑧??
2
= 400 − 300
100
= 1
?
̂
1
= 1 ∗ 1 + 2 ∗ 3 + 1030 = 1037
?
5,????𝑎?𝑖𝑧??
= 10 − 5
5
= 1,
?
5,????𝑎?𝑖𝑧??
2
= 100 − 300
100
= −2
?
̂
2
= 1 ∗ (−2) + 2 ∗ 1 + 1030 = 1030
NAME: PUID: 14 (e)
[4 points]
Calculate the testing mean squared error. 𝑀𝑆?
????
= (1037 − 1041)
2
+ (1030 − 1032)
2
2
𝑀𝑆?
????
= (−4)
2
+ (2)
2
2
𝑀𝑆?
????
= 20
2
= 10
NAME: PUID: 15 Question 6: Objects and classes [15 points] Write a class called Student
so that the following piece of code can be used to calculate the letter grades of each student and the class average. Assume that names
and exam1
, exam2
and final
are already defined previously as a list of names of students and lists of their grades for exam 1, exam 2 and the final, respectively for the students enrolled in Cobra for Data Science class. All exams are out of 100 points. Suppose students pass or fail the class with threshold of 40 (i.e., students need a score of 40 or more to pass). Assume each exam is equally weighted and there are no other assignments in the class. NOTE: Make sure that no new instance or class variables are introduced by any of the lines below
, i.e., all the necessary class and instance variables should be predefined to receive full credit. In addition, note that calc_point_grade() should return the average score of exam1
, exam2
and final
, whereas calc_letter_grade() should retur
n either ‘Pass’ or ‘Fail’ depending on whether the score was greater than or equal to 40 or less than 40, respectively. roster = [Student(i) for i in names] for i,j in enumerate(roster): j.exam1 = exam1[i] j.exam2 = exam2[i] j.final = final[i] for i in roster: i.calc_point_grade() i.calc_letter_grade() grade_total = 0 for i in roster: grade_total = i + grade_total Student.class_avg = grade_total / len(roster) [Write your solution in the box on the next page.]
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
NAME: PUID: 16 class Student: class_avg = 0 def __init__(self,name): self.name = name self.exam1 = 0 self.exam2 = 0 self.final = 0 self.point_grade = 0 def calc_point_grade(self): self.point_grade = (self.exam1+self.exam2+self.final)/3 return self.point_grade def calc_letter_grade(self): my_score = (self.exam1+self.exam2+self.final)/3 if(my_score<40): return ‘Fail’
else: return ‘Pass’
def __add__(self,i): return self.point_grade+i