hw2
pdf
keyboard_arrow_up
School
University of California, Berkeley *
*We aren’t endorsed by this school
Course
102
Subject
Statistics
Date
Feb 20, 2024
Type
Pages
7
Uploaded by EarlIbisPerson993
Data 102, Fall 2023
Homework 2
Due:
5:00 PM
Friday, February 16, 2024
Submission Instructions
Homework assignments throughout the course will have a written portion and a code portion.
Please follow the directions below to properly submit both portions.
Written Portion:
•
Every answer should contain a calculation or reasoning.
•
You may write the written portions on paper or in
L
A
T
E
X
.
•
If you type your written responses, please make sure to put it in a markdown cell instead
of writing it as a comment in a code cell.
•
Please start each question on a new page.
•
It is your responsibility to check that work on all the scanned pages is legible.
Code Portion:
•
You should append any code you wrote in the PDF you submit. You can either do so
by copy and paste the code into a text file or convert your Jupyter Notebook to PDF.
•
Run your notebook and make sure you print out your outputs from running the code.
•
It is your responsibility to check that your code and answers show up in the PDF file.
Submitting:
You will submit a PDF file to Gradescope containing all the work you want graded (including
your math and code).
•
When downloading your Jupyter Notebook, make sure you go to File
→
Save and
Export Notebook As
→
PDF; do not just print page from your web browser because
your code and written responses will be cut off.
•
Combine the PDFs from the written and code portions into one PDF.
Here
is a useful
tool for doing so. As a Berkeley student, you get
free access to Adobe Acrobat
, which
you can use to merge as many PDFs as you want.
•
Please see this
guide
for how to submit your PDF on Gradescope.
In particular, for
each question on the assignment, please make sure you understand how to select the
corresponding page(s) that contain your solution (see item 2 on the last page).
1
Data 102 Homework 2
Due: 5:00 PM Friday, February 16, 2024
Late assignments will count towards your slip days; it is your responsibility to ensure you
have enough time to submit your work.
Data science is a collaborative activity. While you may talk with others about the home-
work, please write up your solutions individually.
If you discuss the homework with your
peers, please include their names on your submission.
Please make sure any handwritten
answers are legible, as we may deduct points otherwise.
The One with all the Beetles
1. (9 points) Cindy has an inordinate fondness for beetles and for statistical modeling. She
observes one beetle everyday and keeps track of their lengths. From her studies she feels
that the beetle lengths she sees are uniformly distributed. So she chooses a model that the
lengths of the beetles come from a uniform distribution on [0
, w
]: here
w
is an unknown
parameter corresponding to the size of the largest possible beetle. Since the maximum
size
w
is unknown to her, she would like to estimate it from the data. She observes lengths
of
n
beetles, and calls them
x
1
, . . . , x
n
.
(a) (1 point) What is the likelihood function of the observations
x
1
, . . . , x
n
?
Express
your answer as a function of the parameter
w
.
Hint
: Your answer should include the indicator function
(max
i
x
i
≤
w
).
To see
why, consider what happens if
w
= 3 cm and
x
1
= 5 cm.
(b) (2 points) Use your answer from Part (a) to explain why the maximum likelihood
estimate (MLE) for
w
is the maximum of the observed lengths, that is,
ˆ
w
MLE
= max
{
x
1
, x
2
, . . . , x
n
}
Hint
: You don’t need to use calculus.
(c) (2 points) Cindy decides to instead use a Bayesian approach. She has a prior belief
that
w
follows a
Pareto distribution
with parameters
α, β >
0. We can write:
w
∼
Pareto(
α, β
)
Then the density function of
w
is
p
(
w
) =
αβ
α
w
α
+1
(
w
≥
β
)
Show that the posterior distribution for
w
is also a Pareto distribution, and compute
the parameters as a function of
α
,
β
, and the observations
x
1
, . . . , x
n
.
(d) (2 points) Provide a short description in plain English that explains what the param-
eters of the Pareto distribution mean, in the context of the Pareto-uniform conjugate
pair.
Hint
: For the Beta-Binomial conjugate pair that we explored in class, the answer
would be that the Beta parameters act as pseudo-counts of observed positive and
negative examples.
2
Data 102 Homework 2
Due: 5:00 PM Friday, February 16, 2024
(e) (2 points) Cindy started with the initial prior with parameters
α
= 1 and
β
= 10 on
day 0. Using the starter code in
beetledata.py
, generate the data for the lengths of
the beetles she sees, starting from Day 1 to Day 100. Use the data to make a graph
of one curve for each of the days 1
,
10
,
50 and 100 (so four curves total), where each
curve is the probability density function of Cindy’s posterior for the respective day.
Note
: For the Pareto distribution, code the density function by hand rather than
relying on the distribution provided by
scipy
.
(f) (0 points) (Optional) Use
pymc
to sample from the posterior for days 1
,
10
,
50 and
100 and plot a density function for each of the cases. Compare the results from the
analytic and simulation based computation of the densities.
3
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Data 102 Homework 2
Due: 5:00 PM Friday, February 16, 2024
Baseball Average Prediction
2. (8 points) The following historical dataset is famous in the field of statistics ever since
it was used by Brad Efron and Carl Morris to illustrate the James-Stein estimator and
the Stein shrinkage phenomenon (see, for example, the Scientic American paper titled
“
Stein’s Paradox in Statistics
” by Efron and Morris, or Section 1.2 of Efron’s book on
Large Scale Inference
).
In baseball, an
at-bat
(AB) is a hitter’s turn batting against a pitcher. In each at-bat,
the hitter can reach (or pass) first base on a
hit
(H).
Batting average
is used to measure
a hitter’s success and is calculated as the fraction AVG =
H
AB
.
The
baseball.csv
dataset (shown in Table
1
) contains 18 rows and 3 columns.
Each
row represents a baseball player and contains the following information:
•
The player’s name
•
The player’s number of
hits
(H) in the first 45
at-bats
(AB)
•
The player’s
End of Season Batting Average
(EoSAverage), calculated as the pro-
portion of hits over the total number of at-bats over the entire season
For example, the first row shows that Clemente had 18 hits in his first 45 at-bats and a
.
346 EoSAverage.
The goal is to use the players’ early season performance (as indicated by the second
column) to predict their end of season performance (as indicated by the third column).
Player Name
Number
of
Hits
in
the first 45 At-Bats
EoSAverage
Clemente
18
.346
F Robinson
17
.298
F Howard
16
.276
Johnstone
15
.222
Berry
14
.273
Spencer
14
.270
Kessinger
13
.263
L Alvarado
12
.210
Santo
11
.269
Swoboda
11
.230
Unser
10
.264
Williams
10
.256
Scott
10
.303
Petrocelli
10
.264
E Rodriguez
10
.226
Campaneris
9
.286
Munson
8
.316
Alvis
7
.200
Table 1: Some Statistics of 18 Baseball Players from the 1970 Season
4
Data 102 Homework 2
Due: 5:00 PM Friday, February 16, 2024
(a) (1 point) For the
i
th
player, we model their number of hits in the first 45 at-bats
H
i
as
H
i
∼
Bin(45
, θ
i
)
,
where
θ
i
is their EoSAverage. This model places the problem of predicting EoSAv-
erages (based on hits in the first 45 at-bats) inside the framework of estimating
probabilities in a Bernoulli/Binomial model. Is this a sensible model? Why or why
not?
(b) (1 point) Calculate the mean squared error (MSE) of the naive proportion estimates
of
θ
i
given by
ˆ
θ
i
=
H
i
45
. Note that you are given values of
θ
i
in the last column of
Table
1
.
(c) (2 points) The goal now is to compare the naive estimate with Bayes estimates. To
calculate Bayes estimates, we shall use a suitable Beta(
a, b
) prior.
To find the appropriate
a
and
b
, use the following procedure:
•
Ignore the top four players as well as the bottom four players in Table
1
, as
these players have either performed exceptionally well or exceptionally poorly
in the first 45 at-bats so their current averages may not be reflective of their
EoSAverages.
•
Calculate the mean
m
and variance
v
of the remaining 10 players.
•
Find
a
and
b
such that the mean and variance corresponding to the Beta(
a, b
)
distribution matches with
m
and
v
.
Report the values of
a
and
b
, and plot the Beta(
a, b
) density function.
(d) (1 point) Calculate the Bayes estimates using the posterior mean for each
θ
i
using
the Beta(
a, b
) prior from the previous part.
(e) (1 point) Calculate the MSE of the Bayes estimates you calculated in part
(d)
.
The MSE of these estimates should be much smaller than the MSE of the naive
proportions from part
(b)
.
(f) (2 points) The naive estimates and the Bayes estimates differ in one crucial as-
pect.
The naive estimate of the EoSAverage for a player only uses data on this
player’s current record. On the other hand, the Bayes estimate uses also data from
other players’ current records (because this data was used to calculate
a
and
b
).
Some people find this paradoxical that the EoSAverage prediction for a particular
player should use data from other players, and find it hard to reconcile that these
paradoxical estimates often significantly outperform the naive estimates in terms of
accuracy. Provide a brief explanation of this paradox which sometimes goes by the
name “Stein’s Paradox”.
5
Data 102 Homework 2
Due: 5:00 PM Friday, February 16, 2024
School District Funding Gaps
3. (15 points) In this question, you’ll work with data on school funding provided by the
School Finance Indicators Database
.
The dataset contains information on each school
district in the US, including student demographics, district spending per student, test
score outcomes, and more. You’ll work with the following three columns:
•
state_name
•
fundinggap
, the difference in how much the district should spend on each student
and the amount it actually spends per student. Negative values indicate insufficient
spending.
You can find more information about the data at the
SFID website
.
For ease of visualization, we’ll limit our analysis to the following five states: California,
the District of Columbia, Nevada, Oregon, and Texas.
The file
dcd.csv
contains the dataset we will be using.
You should use the provided
q3.ipynb
file to get started, which has cells with some useful variables already defined,
and a hint about how to use fancy indexing. This notebook is
not
comprehensive: it only
has some starter code and useful functions for this question.
(a) (1 point) Visualize the funding gap for all districts in the five states above. In two
sentences or less, describe any differences you see between the data from larger states
(California and Texas) and smaller ones (Nevada and DC).
(b) (3 points) We’ll use a hierarchical model to help us understand state-level averages
in the funding gap: each state will have a state-level mean
µ
i
with common mean
α
, and for each district
j
in state
i
, the funding gap
y
ij
will be normally distributed
with mean
µ
i
. We’ll assume the variances are known, so the model can be written
as:
µ
i
∼
Normal(
α, σ
2
0
)
y
ij
∼
Normal(
µ
i
, σ
2
)
Draw a graphical model for this setup.
(c) (4 points) Implement the model from part (b) in PyMC, using
α
= $700,
σ
0
=
σ
= $4000. Using the
plot_state_posterior_means
function provided for you in
the notebook, visualize the posterior distributions for the means of each of the five
states.
For which state(s)/district(s) is the posterior mean the most certain?
For
which state/district is it least certain? Explain why.
(d) (2 points) Re-run your model from the previous part, changing only one variable at
a time as follows:
(i)
α
= $700
, σ
0
= $4000
, σ
= $
400
(ii)
α
= $700
, σ
0
= $
400
, σ
= $4000
(iii)
α
=
−
$
700
, σ
0
= $4000
, σ
= $4000
6
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Data 102 Homework 2
Due: 5:00 PM Friday, February 16, 2024
What changes in each of the three cases, and why?
Hint: you can answer this question by focusing on the changes in the mean for the
District of Columbia.
(e) (2 points) Suppose we had treated
α
as a normal random variable with mean
γ
and
standard deviation
λ
. Draw a graphical model for this new model.
Hint: the answer should only require a small change from your answer to part (b).
(f) (3 points) Implement the model from the previous part in PyMC, using
γ
= 0,
σ
= 4000, and
λ
= 10000. Using your samples, compute the posterior variance of
the mean for the District of Columbia (DC), var(
µ
DC
|
y
)., and the posterior variance
of the mean for California, var(
µ
CA
|
y
).
(g) (0 points) (Optional) Use empirical Bayes and the data from all 50 states to deter-
mine the values of
α
,
σ
, and
σ
0
to use. Explain in three sentences or less how and
why you chose the data to use when computing this value.
(h) (0 points) (Optional) Re-run your model from the previous part, changing only one
variable at a time as follows:
(i)
γ
= $0
, λ
= $10000
, σ
= $
400
(ii)
γ
= $0
, λ
= $
100
, σ
= $4000
(iii)
γ
=
−
$
0
, λ
= $10000
, σ
= $4000
What changes in each of the three cases, and why? Explain any differences between
your findings here and your findings from part (e)
(i) (0 points) (Optional) The histograms from parts (a) and (d) both have one color
per state, but the quantity being visualized in each graph is fundamentally different.
Explain this difference.
(j) (0 points) (Optional) Re-run your model from part (g) on all 50 states. How do the
results change?
7
Recommended textbooks for you
Glencoe Algebra 1, Student Edition, 9780079039897...
Algebra
ISBN:9780079039897
Author:Carter
Publisher:McGraw Hill
Holt Mcdougal Larson Pre-algebra: Student Edition...
Algebra
ISBN:9780547587776
Author:HOLT MCDOUGAL
Publisher:HOLT MCDOUGAL
Algebra: Structure And Method, Book 1
Algebra
ISBN:9780395977224
Author:Richard G. Brown, Mary P. Dolciani, Robert H. Sorgenfrey, William L. Cole
Publisher:McDougal Littell
Recommended textbooks for you
- Glencoe Algebra 1, Student Edition, 9780079039897...AlgebraISBN:9780079039897Author:CarterPublisher:McGraw HillHolt Mcdougal Larson Pre-algebra: Student Edition...AlgebraISBN:9780547587776Author:HOLT MCDOUGALPublisher:HOLT MCDOUGALAlgebra: Structure And Method, Book 1AlgebraISBN:9780395977224Author:Richard G. Brown, Mary P. Dolciani, Robert H. Sorgenfrey, William L. ColePublisher:McDougal Littell
Glencoe Algebra 1, Student Edition, 9780079039897...
Algebra
ISBN:9780079039897
Author:Carter
Publisher:McGraw Hill
Holt Mcdougal Larson Pre-algebra: Student Edition...
Algebra
ISBN:9780547587776
Author:HOLT MCDOUGAL
Publisher:HOLT MCDOUGAL
Algebra: Structure And Method, Book 1
Algebra
ISBN:9780395977224
Author:Richard G. Brown, Mary P. Dolciani, Robert H. Sorgenfrey, William L. Cole
Publisher:McDougal Littell