hw10
pdf
keyboard_arrow_up
School
University of California, Berkeley *
*We aren’t endorsed by this school
Course
C8
Subject
Computer Science
Date
Dec 6, 2023
Type
Pages
14
Uploaded by AdmiralAtom103517
hw10
November 30, 2023
[1]:
# Initialize Otter
import
otter
grader
=
otter
.
Notebook(
"hw10.ipynb"
)
1
Homework 10: Linear Regression
Helpful Resource:
•
Python Reference
: Cheat sheet of helpful array & table methods used in Data 8!
Recommended Readings
:
•
Correlation
•
The Regression Line
•
Method of Least Squares
•
Least Squares Regression
Please complete this notebook by filling in the cells provided.
Before you begin, execute the
cell below to setup the notebook by importing some helpful libraries.
Each time you
start your server, you will need to execute this cell again.
For all problems that you must write explanations and sentences for, you
must
provide your
answer in the designated space. Moreover, throughout this homework and all future ones,
please
be sure to not re-assign variables throughout the notebook!
For example, if you use
max_temperature
in your answer to one question, do not reassign it later on. Otherwise, you will
fail tests that you thought you were passing previously!
Deadline:
This assignment is
due Wednesday, 11/8 at 11:00pm PT
. Turn it in by Tuesday, 11/7 at
11:00pm PT for 5 extra credit points. Late work will not be accepted as per the
policies
page.
Note: This homework has hidden tests on it. That means even though tests may say
100% passed, it doesn’t mean your final grade will be 100%. We will be running more
tests for correctness once everyone turns in the homework.
Directly sharing answers is not okay, but discussing problems with the course staff or with other
students is encouraged. Refer to the
policies
page to learn more about how to learn cooperatively.
You should start early so that you have time to get help if you’re stuck.
Offce hours are held
Monday through Friday in
Warren Hall
101B. The offce hours schedule appears
here
.
1
[2]:
# Run this cell to set up the notebook, but please don't change it.
import
numpy
as
np
from
datascience
import
*
# These lines do some fancy plotting magic.
import
matplotlib
%
matplotlib
inline
import
matplotlib.pyplot
as
plt
plt
.
style
.
use(
'fivethirtyeight'
)
import
warnings
warnings
.
simplefilter(
'ignore'
,
FutureWarning
)
from
datetime
import
datetime
1.1
1. Linear Regression Setup
When performing linear regression, we need to compute several important quantities which will be
used throughout our analysis.
Unless otherwise specified when asked to make a prediction
please assume we are predicting y from x throughout this assignment.
To help with our
later analysis, we will begin by writing some of these functions and understanding what they can
do for us.
Question 1.1.
Define a function
standard_units
that converts a given array to standard units.
(3 points)
Hint:
You may find the
np.mean
and
np.std
functions helpful.
[3]:
def
standard_units
(data):
return
((data
-
np
.
mean(data))
/
np
.
std(data))
[4]:
grader
.
check(
"q1_1"
)
[4]:
q1_1 results: All test cases passed!
Question 1.2.
Which of the following are true about standard units? Assume we have converted
an array of data into standard units using the function above.
(5 points)
1. The unit of all our data when converted into standard units is the same as the unit of the
original data.
2. The sum of all our data when converted into standard units is 0.
3. The standard deviation of all our data when converted into standard units is 1.
4. Adding a constant, C, to our original data has no impact on the resultant data when converted
to standard units.
5. Multiplying our original data by a positive constant, C (>0), has no impact on the resultant
data when converted to standard units.
Assign
standard_array
to an array of your selections, in increasing numerical order. For example,
if you wanted to select options 1, 3, and 5, you would assign
standard_array
to
make_array(1,
3, 5)
.
2
[5]:
standard_array
=
make_array(
2
,
3
,
4
,
5
)
[6]:
grader
.
check(
"q1_2"
)
[6]:
q1_2 results: All test cases passed!
Question 1.3.
Define a function
correlation
that computes the correlation between 2 arrays of
data in original units.
(3 points)
Hint:
Feel free to use functions you have defined previously.
[7]:
def
correlation
(x, y):
return
(np
.
mean(standard_units(x)
*
standard_units(y)))
[8]:
grader
.
check(
"q1_3"
)
[8]:
q1_3 results: All test cases passed!
Question 1.4.
Which of the following are true about the correlation coeffcient
𝑟
?
(5 points)
1. The correlation coeffcient measures the strength of a linear relationship.
2. A correlation coeffcient of 1.0 means an increase in one variable always means an increase in
the other variable.
3. The correlation coeffcient is the slope of the regression line in standard units.
4. The correlation coeffcient stays the same if we invert our axes.
5. If we add a constant, C, to our original data, our correlation coeffcient will increase by the
same C.
Assign
r_array
to an array of your selections, in increasing numerical order. For example, if you
wanted to select options 1, 3, and 5, you would assign
r_array
to
make_array(1, 3, 5)
.
[9]:
r_array
=
make_array(
1
,
2
,
3
,
4
)
[10]:
grader
.
check(
"q1_4"
)
[10]:
q1_4 results: All test cases passed!
Question 1.5.
Define a function
slope
that computes the slope of our line of best fit (to predict
y given x), given two arrays of data in original units. Assume we want to create a line of best fit
in original units.
(3 points)
Hint:
Feel free to use functions you have defined previously.
[11]:
def
slope
(x, y):
r
=
correlation(x,y)
return
(r
*
(np
.
std(y)
/
np
.
std(x)))
[12]:
grader
.
check(
"q1_5"
)
[12]:
q1_5 results: All test cases passed!
3
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Question 1.6.
Which of the following are true about the slope of our line of best fit? Assume
x
refers to the value of one variable that we use to predict the value of
y
.
(5 points)
1. In original units, the slope has the unit: unit of x / unit of y.
2. In standard units, the slope is unitless.
3. In original units, the slope is unchanged by swapping x and y.
4. In standard units, a slope of 1 means our data is perfectly linearly correlated.
5. In original units and standard units, the slope always has the same positive or negative sign.
Assign
slope_array
to an array of your selections, in increasing numerical order. For example, if
you wanted to select options 1, 3, and 5, you would assign
slope_array
to
make_array(1, 3, 5)
.
[13]:
slope_array
=
make_array(
2
,
4
,
5
)
[14]:
grader
.
check(
"q1_6"
)
[14]:
q1_6 results: All test cases passed!
Question 1.7.
Define a function
intercept
that computes the intercept of our line of best fit (to
predict y given x), given 2 arrays of data in original units. Assume we want to create a line of best
fit in original units.
(3 points)
Hint:
Feel free to use functions you have defined previously.
[15]:
def
intercept
(x, y):
return
(np
.
mean(y)
-
(slope(x,y)
*
np
.
mean(x)))
[16]:
grader
.
check(
"q1_7"
)
[16]:
q1_7 results: All test cases passed!
Question 1.8.
Which of the following are true about the intercept of our line of best fit? Assume
x
refers to the value of one variable that we use to predict the value of
y
.
(5 points)
1. In original units, the intercept has the same unit as the y values.
2. In original units, the intercept has the same unit as the x values.
3. In original units, the slope and intercept have the same unit.
4. In standard units, the intercept for the regression line is 0.
5. In original units and standard units, the intercept always has the same magnitude.
Assign
intercept_array
to an array of your selections, in increasing numerical order. For example,
if you wanted to select options 1, 3, and 5, you would assign
intercept_array
to
make_array(1,
3, 5)
.
[17]:
intercept_array
=
make_array(
1
,
4
)
[18]:
grader
.
check(
"q1_8"
)
[18]:
q1_8 results: All test cases passed!
4
Question 1.9.
Define a function
predict
that takes in a table and 2 column names, and returns
an array of predictions. The predictions should be created using a fitted
regression line
. We are
predicting
"col2"
from
"col1"
, both in original units.
(5 points)
Hint 1:
Feel free to use functions you have defined previously.
Hint 2:
Re-reading
15.2
might be helpful here.
Note: The public tests are quite comprehensive for this question, so passing them means that your
function most likely works correctly.
[19]:
def
predict
(tbl, col1, col2):
x
=
tbl
.
column(col1)
y
=
tbl
.
column(col2)
return
((slope(x,y)
*
x)
+
intercept(x,y))
[20]:
grader
.
check(
"q1_9"
)
[20]:
q1_9 results: All test cases passed!
1.2
2. FIFA Predictions
The following data was scraped from
sofifa.com
, a website dedicated to collecting information from
FIFA video games. The dataset consists of all players in FIFA 22 and their corresponding attributes.
We have truncated the dataset to a limited number of rows (100) to ease with our visualizations
and analysis.
Since we’re learning about linear regression, we will look specifically for a linear
association between various player attributes.
To help with understanding where the line
of best fit generated in linear regression comes from please do not use the
.fit_line
argument in
.scatter
at any point on question 2 unless the code was provided for you.
Feel free to read more about the video game on
Wikipedia
.
[21]:
# Run this cell to load the data
fifa
=
Table
.
read_table(
'fifa22.csv'
)
# Select a subset of columns to analyze (there are 110 columns in the original
␣
↪
dataset)
fifa
=
fifa
.
select(
"short_name"
,
"overall"
,
"value_eur"
,
"wage_eur"
,
"age"
,
␣
↪
"pace"
,
"shooting"
,
"passing"
,
"attacking_finishing"
)
fifa
.
show(
5
)
<IPython.core.display.HTML object>
Question 2.1.
Before jumping into any statistical techniques, it’s important to see what the data
looks like, because data visualizations allow us to uncover patterns in our data that would have
otherwise been much more diffcult to see.
(3 points)
Create a scatter plot with age on the x-axis (“age”), and the player’s value in Euros (“value_eur”)
on the y-axis.
[22]:
fifa
.
scatter(
"age"
,
"value_eur"
)
5
Question 2.2.
Does the correlation coeffcient
r
for the data in our scatter plot in 2.1 look closest
to 0, 0.75, or -0.75?
(3 points)
Assign
r_guess
to one of 0, 0.75, or -0.75.
[23]:
r_guess
= -0.75
[24]:
grader
.
check(
"q2_2"
)
[24]:
q2_2 results: All test cases passed!
Question 2.3.
Create a scatter plot with player age (“age”) along the x-axis and both real player
value (“value_eur”) and predicted player value along the y-axis. The predictions should be created
using a fitted
regression line
. The color of the dots for the real player values should be different
from the color for the predicted player values.
(8 points)
Hint 1:
Feel free to use functions you have defined previously.
Hint 2:
15.2
has examples of creating such scatter plots.
6
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
[25]:
predictions
=
predict(fifa,
"age"
,
"value_eur"
)
fifa_with_predictions
=
fifa
.
with_columns(
"preds"
,predictions)
.
↪
select(
"age"
,
"value_eur"
,
"preds"
)
fifa_with_predictions
.
scatter(
"age"
)
Question 2.4.
Looking at the scatter plot you produced above, is linear regression a good model
to use? If so, what features or characteristics make this model reasonable? If not, what features or
characteristics make it unreasonable?
(5 points)
Yes, because on average, the real values are not far off from the values predicted by the regression
line, and there is roughly an even number of values with positive residuals as their is with negative
residuals
Question 2.5.
In 2.3, we created a scatter plot
in original units
.
Now, create a scatter plot
with player age
in standard units
along the x-axis and both real and predicted player value
in
standard units
along the y-axis. The color of the dots of the real and predicted values should be
different.
(8 points)
Hint:
Feel free to use functions you have defined previously.
[26]:
predictions_su
=
standard_units(fifa
.
column(
"age"
))
*
correlation(fifa
.
↪
column(
"age"
),fifa
.
column(
"value_eur"
))
fifa_su
=
fifa
.
with_columns(
"agesu"
, standard_units(fifa
.
column(
"age"
)),
␣
↪
"valsu"
, standard_units(fifa
.
column(
"value_eur"
))
7
,
"predsu"
, predictions_su)
.
↪
select(
"agesu"
,
"valsu"
,
"predsu"
)
fifa_su
.
scatter(
"agesu"
)
Question 2.6.
Compare your plots in 2.3 and 2.5.
What similarities do they share?
What
differences do they have?
(5 points)
The data has the same exact shape in pattern in both plots, and the regression line fits the data
well in both plots. However, the values of the data points change in each plot. In the plot in 2.3,
the x values range from ~ 20 to 40 and the y values range from ~ 0 to 2*10^8, but in the plot in
2.5 the both the x and y values only range from ~ -2.5 to 2.5. The values for the age and value in
euros seem to be centered around 0 for the plot in 2.5, but this is not true for the plot in 2.3.
Question 2.7.
Define a function
rmse
that takes in two arguments: a slope and an intercept for
a potential regression line. The function should return the root mean squared error between the
values predicted by a regression line with the given slope and intercept and the actual outcomes.
(6 points)
Assume we are still predicting “value_eur” from “age” in original units from the
fifa
table.
[27]:
def
rmse
(slope, intercept):
predictions
=
(fifa
.
column(
"age"
)
*
slope)
+
intercept
errors
=
(fifa
.
column(
"value_eur"
)
-
predictions)
** 2
8
return
((np
.
mean(errors))
** 0.5
)
[28]:
grader
.
check(
"q2_7"
)
[28]:
q2_7 results: All test cases passed!
Question 2.8.
Use the
rmse
function you defined along with
minimize
to find the least-squares
regression parameters predicting player value from player age.
Here’s an
example
of using the
minimize
function from the textbook.
(10 points)
Then set
lsq_slope
and
lsq_intercept
to be the least-squares regression line slope and intercept,
respectively.
Finally, create a scatter plot like you did in 2.3 with player age (“age”) along the x-axis and both
real player value (“value_eur”) and predicted player value along the y-axis.
Be sure to use your
least-squares regression line to compute the predicted values.
The color of the dots for
the real player values should be different from the color for the predicted player values.
Note:
Your solution should not make any calls to the slope or intercept functions
defined earlier.
Hint:
Your call to
minimize
will return an array of argument values that minimize the return value
of the function passed to
minimize
.
[29]:
minimized_parameters
=
minimize(rmse)
lsq_slope
=
minimized_parameters[
0
]
lsq_intercept
=
minimized_parameters[
1
]
# This just prints your slope and intercept
print
(
"Slope:
{:g}
| Intercept:
{:g}
"
.
format(lsq_slope, lsq_intercept))
fifa_with_lsq_predictions
=
fifa
.
with_columns(
"preds"
,((fifa
.
column(
"age"
)
*
␣
↪
lsq_slope)
+
lsq_intercept))
.
select(
"age"
,
"value_eur"
,
"preds"
)
fifa_with_lsq_predictions
.
scatter(
"age"
)
Slope: -6.41462e+06 | Intercept: 2.55525e+08
9
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Question 2.9.
The resulting line you found in 2.8 should appear very similar to the line you found
in 2.3. Why were we able to minimize RMSE to find nearly the same slope and intercept from the
previous formulas?
(5 points)
Hint:
Re-reading
15.3
might be helpful here.
There is only one “best fit” regression line, which is the line that minimizes mean squared error. For
mathematical reasons, the line you get with calculus-based numerical optimization (the approach
that created the line from 2.8) as well as the the line you get by using the formulas related to the
correlation coeffcient (the approach that created the line from 2.3) both yield this best fit line, and
thus the line you get from either approach will be roughly the same.
Question 2.10
For which of the following error functions would we have resulted in the same slope
and intercept values in 2.8 instead of using RMSE? Assume
error
is assigned to the actual values
minus the predicted values.
(5 points)
1.
np.sum(error) ** 0.5
2.
np.sum(error ** 2)
3.
np.mean(error) ** 0.5
4.
np.mean(error ** 2)
Assign
error_array
to an array of your selections, in increasing numerical order. For example, if
you wanted to select options 1, 3, and 5, you would assign
error_array
to
make_array(1, 3, 5)
.
Hint: What was the purpose of RMSE? Are there any alternatives, and if so, does minimizing them
10
them yield the same results as minimizing the RMSE?
[30]:
error_array
=
make_array(
2
,
4
)
[31]:
grader
.
check(
"q2_10"
)
[31]:
q2_10 results: All test cases passed!
[32]:
# goalies don't have shooting in our dataset so we removed them before looking
␣
↪
at the pace stat
no_goalies
=
fifa
.
where(
"shooting"
, are
.
above(
0
))
no_goalies
[32]:
short_name
| overall | value_eur | wage_eur | age
| pace | shooting |
passing | attacking_finishing
L. Messi
| 93
| 78000000
| 320000
| 34
| 85
| 92
| 91
| 95
R. Lewandowski
| 92
| 119500000 | 270000
| 32
| 78
| 92
| 79
| 95
Cristiano Ronaldo | 91
| 45000000
| 270000
| 36
| 87
| 94
| 80
| 95
Neymar Jr
| 91
| 129000000 | 270000
| 29
| 91
| 83
| 86
| 83
K. De Bruyne
| 91
| 125500000 | 350000
| 30
| 76
| 86
| 93
| 82
K. Mbappé
| 91
| 194000000 | 230000
| 22
| 97
| 88
| 80
| 93
H. Kane
| 90
| 129500000 | 240000
| 27
| 70
| 91
| 83
| 94
N. Kanté
| 90
| 100000000 | 230000
| 30
| 78
| 66
| 75
| 65
K. Benzema
| 89
| 66000000
| 350000
| 33
| 76
| 86
| 81
| 90
H. Son
| 89
| 104000000 | 220000
| 28
| 88
| 87
| 82
| 88
… (75 rows omitted)
[33]:
# Run this cell to generate a scatter plot for the next part.
no_goalies
.
scatter(
'shooting'
,
'attacking_finishing'
, fit_line
=
True
)
11
Question 2.11.
Above is a scatter plot showing the relationship between a player’s shooting ability
(“shooting”) and their scoring ability (“attacking_finishing”).
There is clearly a strong positive correlation between the 2 variables, and we’d like to predict a
player’s scoring ability from their shooting ability. Which of the following are true, assuming linear
regression is a reasonable model?
(5 points)
Hint:
Re-reading
15.2
might be helpful here.
1. For a majority of players with a
shooting
attribute above 80 our model predicts they have
a better scoring ability than shooting ability.
2. A randomly selected player’s predicted scoring ability in standard units will always be less
than their shooting ability in standard units.
3. If we select a player who’s shooting ability is 1.0 in standard units, their scoring ability, on
average, will be less than 1.0 in standard units.
4. Goalies have attatcking_finishing scores in our dataset but do not have shooting scores. We
can still use our model to predict their attacking_finishing scores.
Assign
scoring_array
to an array of your selections, in increasing numerical order. For example,
if you wanted to select options 1, 3, and 5, you would assign
scoring_array
to
make_array(1,
3, 5)
.
12
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
[34]:
scoring_array
=
make_array(
1
,
3
)
[35]:
grader
.
check(
"q2_11"
)
[35]:
q2_11 results: All test cases passed!
You’re done with Homework 10!
Important submission steps:
1. Run the tests and verify that they all pass. 2. Choose
Save
Notebook
from the
File
menu, then
run the final cell
. 3. Click the link to download the zip
file. 4. Go to
Gradescope
and submit the zip file to the corresponding assignment. The name of
this assignment is “HW 10 Autograder”.
It is your responsibility to make sure your work is saved before running the last cell.
1.3
Pets of Data 8
Nico
says congrats on finishing Homework 10! Only two more to go!
Pet of the week:
Nico
1.4
Submission
Make sure you have run all cells in your notebook in order before running the cell below, so that
all images/graphs appear in the output. The cell below will generate a zip file for you to submit.
Please save before exporting!
[36]:
# Save your notebook first, then run this cell to export your submission.
grader
.
export(pdf
=
False
, run_tests
=
True
)
Running your submission against local test cases…
Your submission received the following results when run against available test
cases:
q1_1 results: All test cases passed!
q1_2 results: All test cases passed!
q1_3 results: All test cases passed!
q1_4 results: All test cases passed!
q1_5 results: All test cases passed!
q1_6 results: All test cases passed!
q1_7 results: All test cases passed!
13
q1_8 results: All test cases passed!
q1_9 results: All test cases passed!
q2_2 results: All test cases passed!
q2_7 results: All test cases passed!
q2_10 results: All test cases passed!
q2_11 results: All test cases passed!
<IPython.core.display.HTML object>
14
Recommended textbooks for you

C++ for Engineers and Scientists
Computer Science
ISBN:9781133187844
Author:Bronson, Gary J.
Publisher:Course Technology Ptr

C++ Programming: From Problem Analysis to Program...
Computer Science
ISBN:9781337102087
Author:D. S. Malik
Publisher:Cengage Learning

EBK JAVA PROGRAMMING
Computer Science
ISBN:9781337671385
Author:FARRELL
Publisher:CENGAGE LEARNING - CONSIGNMENT

Microsoft Visual C#
Computer Science
ISBN:9781337102100
Author:Joyce, Farrell.
Publisher:Cengage Learning,
Np Ms Office 365/Excel 2016 I Ntermed
Computer Science
ISBN:9781337508841
Author:Carey
Publisher:Cengage
Programming Logic & Design Comprehensive
Computer Science
ISBN:9781337669405
Author:FARRELL
Publisher:Cengage
Recommended textbooks for you
- C++ for Engineers and ScientistsComputer ScienceISBN:9781133187844Author:Bronson, Gary J.Publisher:Course Technology PtrC++ Programming: From Problem Analysis to Program...Computer ScienceISBN:9781337102087Author:D. S. MalikPublisher:Cengage LearningEBK JAVA PROGRAMMINGComputer ScienceISBN:9781337671385Author:FARRELLPublisher:CENGAGE LEARNING - CONSIGNMENT
- Microsoft Visual C#Computer ScienceISBN:9781337102100Author:Joyce, Farrell.Publisher:Cengage Learning,Np Ms Office 365/Excel 2016 I NtermedComputer ScienceISBN:9781337508841Author:CareyPublisher:CengageProgramming Logic & Design ComprehensiveComputer ScienceISBN:9781337669405Author:FARRELLPublisher:Cengage

C++ for Engineers and Scientists
Computer Science
ISBN:9781133187844
Author:Bronson, Gary J.
Publisher:Course Technology Ptr

C++ Programming: From Problem Analysis to Program...
Computer Science
ISBN:9781337102087
Author:D. S. Malik
Publisher:Cengage Learning

EBK JAVA PROGRAMMING
Computer Science
ISBN:9781337671385
Author:FARRELL
Publisher:CENGAGE LEARNING - CONSIGNMENT

Microsoft Visual C#
Computer Science
ISBN:9781337102100
Author:Joyce, Farrell.
Publisher:Cengage Learning,
Np Ms Office 365/Excel 2016 I Ntermed
Computer Science
ISBN:9781337508841
Author:Carey
Publisher:Cengage
Programming Logic & Design Comprehensive
Computer Science
ISBN:9781337669405
Author:FARRELL
Publisher:Cengage