HW2_Solutions

pdf

School

University of California, Berkeley *

*We aren’t endorsed by this school

Course

102

Subject

Statistics

Date

Feb 20, 2024

Type

pdf

Pages

8

Uploaded by ProfComputer848

Report
Data 102 Homework 2 Due: 11:59 PM Wednesday, September 28, 2022 Overview Submit your writeup, including any code, as a PDF via gradescope. 1 We recommend reading through the entire homework beforehand and carefully using functions for testing procedures, plotting, and running experiments. Taking the time to reuse code will help in the long run! Data science is a collaborative activity. While you may talk with others about the home- work, please write up your solutions individually. If you discuss the homework with your peers, please include their names on your submission. Please make sure any handwritten answers are legible, as we may deduct points otherwise. The One with all the Beetles 1. (9 points) Cindy has an inordinate fondness for beetles and for statistical modeling. She observes one beetle everyday and keeps track of their lengths. From her studies she feels that the beetle lengths she sees are uniformly distributed. So she chooses a model that the lengths of the beetles come from a uniform distribution on [0 , w ]: here w is an unknown parameter corresponding to the size of the largest possible beetle. Since the maximum size w is unknown to her, she would like to estimate it from the data. She observes lengths of n beetles, and calls them x 1 , . . . , x n . (a) (1 point) What is the likelihood function of the observations x 1 , . . . , x n ? Express your answer as a function of the largest size parameter w . Solution: The liklihood function for each sample is that of the uniform distri- bution on [0 , w ]. This is given by f 1 ( x ; w ) = 1 w [ x w ] (1) for x 0. Here the subscript indicates the number of samples. Since all the samples are independent, we get f n ( x 1 , . . . , x n ; w ) = 1 w n Y i [ x i w ] (2) = 1 w n max i x i w . (3) The last equality follows by noting that the each of the x i is less than w if and only if the maximum is less than w . This makes intuitive sense since the uniform distribution puts no probability on points outside its domain. Hint: Your answer should include the indicator function (max i x i w ) . To see why, consider what happens if w = 3 cm and x 1 = 5 cm. (b) (2 points) Use your answer from part (a) to explain why the maximum likelihood estimate for w is max i x i . 1 In Jupyter, you can download as PDF or print to save as PDF 1
Data 102 Homework 2 Due: 11:59 PM Wednesday, September 28, 2022 Solution: We know from our earlier calculation that the liklihood is given by f n ( x 1 , . . . , x n ; w ) = 1 w n max i x i w (4) Now we consider this as a function of w for fixed x 1 . . . x n . First note that this is zero for w max i x i . Next for w max i x i this is a strictly decreasing function in w . Thus, the maximum is achieved at w = max i x i . (c) (2 points) Cindy decides to instead use a Bayesian approach. She has a prior belief that w follows a Pareto distribution with parameters α, β > 0. We can write: w Pareto( α, β ) p ( w ) = αβ α w α +1 ( w > β ) If she observed values x 1 , .... , x n , show that the posterior distribution for w is also a Pareto distribution, and compute the parameters as a function of α , β , and the observations x 1 , . . . , x n . Solution: Let us denote the prior distribution as p and the posterior as p 1 . Recall from Bayes rule, we have the the posterior is given by p 1 ( w | x 1 , . . . , x n ) = f n ( x 1 , . . . , x n ; w ) p ( w ) R f n ( x 1 , . . . , x n ; t ) p ( t ) dt Here the demoninator uses the law of total probability. Let us look at the numerator and denominator separately. f n ( x 1 , . . . x ; w ) p ( w ) = αβ α w α 1 w n max i x i w [ w > β ] = αβ α w α n 1 max i x i w [ w > β ] = αβ α w α n 1 max i x i w w > β = αβ α w α n 1 w > max { β, max i x i } . Let us denote max { β, max i x i } as β . Now looking at the denominator, we get Z 0 f n ( x 1 , . . . , x n ; t ) p ( t ) dt = Z 0 αβ α t α n 1 t > β dt = αβ α Z β t α n 1 dt = αβ α t α n α n β = αβ α β ′− α n α + n 2
Data 102 Homework 2 Due: 11:59 PM Wednesday, September 28, 2022 Let α = α + n . Then, taking the ratio, we get p 1 ( w | x 1 , . . . , x n ) = α β α w α +1 w > β (d) (2 points) Provide a short description in plain English that explains what the param- eters of the Pareto distribution mean, in the context of the Pareto-uniform conjugate pair. Hint: For the Beta-Binomial conjugate pair that we explored in class, the answer would be that the Beta parameters acted as pseudo-counts of observed positive and negative examples. Solution: From the previous part, we saw that β = max { β, max i x i } and α = α + n . This indicates that the parameters α and β keep track of the number of samples seen and maximum of the samples seen so far respectively. (e) (2 points) Let us say Cindy started with the initial prior values α = 1 and β = 10 on day 0. Use the code in beetledata.py to generate the data for the lengths of the beetles she sees, starting with day one to day one hundred. Use the data to make a graph of one curve for each of the days 1 , 10 , 50 and 100 (so four curves total), where each curve is the probability density function of Cindy’s posterior for the respective day. As with other familiar distributions, you can find the Pareto distribution in scipy. Solution: Note that as the days progress the plot shift to the right as we see larger and larger maximum values. Also our confidence our estimate increases which can be seen by the strength of the peaks. (f) (0 points) (Optional) Use pymc3 to sample from the posterior for days 1 , 10 , 50 and 100 and plot a density function for each of the cases. Compare the results from the 3
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Data 102 Homework 2 Due: 11:59 PM Wednesday, September 28, 2022 analytic and simulation based computation of the densities. Bayesian Fidget Spinners 2. (10 points) Nat’s company manufactures fidget spinners. The company uses two facto- ries, which we’ll call factory 0 and factory 1. Each fidget spinner from factory k is defective with probability q k ( k ∈ { 0 , 1 } ). Nat knows that factory 0 produces fewer defective fidget spinners than factory 1 (in other words, q 0 < q 1 ). She receives n boxes full of fidget spinners, but the boxes aren’t labeled (in other words, she doesn’t know which box is from which factory). For each box, she starts randomly pulling out fidget spinners until she finds a defective one, and records how many fidget spinners she pulled out (including the defective one). She calls this number x i for box i , for i = 1 , . . . , n . She wants to estimate the following pieces of information: Which boxes came from factory 0, and which came from factory 1? She defines a binary random variable for each box z i with the factory label (i.e., z i = 0 if box i is from factory 0, and z i = 1 if box i is from factory 1). How reliable is each factory? In other words, what are q 0 and q 1 ? Inspired by what she learned about Gaussian mixture models, she sets up the following probability model: z i Bernoulli( π ) q k Beta( α k , β k ) x i | z i , q 0 , q 1 Geometric( q z i ) (a) (1 point) Draw a graphical model for the probability model described above if n = 2 (i.e., there are only two boxes of fidget spinners). Nat decides to implement the model above setting the following hyperparameters: π = 0 . 45 , q 0 Beta(1 , 5) , q 1 Beta(5 , 1) (b) (2 points) The choices of the parameters in Nat’s model represents her beliefs about the factories. i. (1 point) From her choice for π , what can we infer about her beliefs about the two factories? Solution: Note that Nat’s choice of π < 1 / 2. So we can infer that she thinks that factory 0 produces more boxes of fidget spinners. 4
Data 102 Homework 2 Due: 11:59 PM Wednesday, September 28, 2022 ii. (1 point) Similarly, from her choices for α and β , what can we infer about her beliefs about the factories? Solution: Since Beta(1 , 5) is more concentrated on smaller values than Beta(5 , 1) one can infer that Nat believes that factory 0 has lower rate of defects than factory 1. (c) (5 points) Use fidget_data.py to generate the data that Nat observes, then, using PyMC3, fit the model outlined above, setting the hyperparameters to the values that Nat chose. Obtain 1000 samples from the posterior distribution p ( q 0 , q 1 | x 1 , . . . , x n ), and generate a scatterplot (one point per sample). You can use the code below (also provided to you in fidget_model.py ) to help you get started. 1 import pymc3 as pm 2 3 4 alphas = ... 5 betas = ... 6 pi = ... 7 8 with pm.Model () as model: 9 z = pm.Bernoulli( 10 # Define the Bernoulli Model Here 11 ) 12 13 # Hint: you should use the shape= parameter here so that 14 # q is a PyMC3 array with both q0 and q1. 15 q = ... 16 17 # Hint: it may be useful to use "fancy indexing" like we did in class. 18 # See below for an example 19 X = pm.Geometric( 20 # DEFINE THE GEOMETRIC MODEL HERE 21 ) 22 23 trace = .... 24 25 # FANCY INDEXING 26 27 my_binary_array = np.array ([0, 0, 1, 1, 0, 1]) 28 my_real_array = np.array ([0.27 , 0.34]) 29 print ( my_real_array [ my_binary_array ]) i. Under the posterior, what is the probability that factory 0 produces more boxes than factory 1? Solution: Under the posterior, the value for π is approximately 0 . 45 (ie the model should tend toward estimating the ground truth of the simulated data). Thus, a probability that a box is from factory 0 is approximately 0 . 55 and that from factory 1 is approximately 0 . 45. Thus that probability that 5
Data 102 Homework 2 Due: 11:59 PM Wednesday, September 28, 2022 factory 0 produces more boxes is Pr [Binom(50 , 0 . 45) 50] . One can explitly compute this as 50 X i =0 100 i (0 . 45) i (0 . 55) 100 i = 0 . 8654 An acceptable way for this problem set to compute this is take samples from the posterior. Count the number of number of instances in which 1 m i [ i z i 50] where m is the sample size and output this as the empirical estimate for the probability. Note that the model-fitting isn’t super stable over various runs and this empirical estimate can deviate substantially from the analytical estimate. ii. What is your median estimate of factory 0’s defect rate, based on the samples from the posterior? Solution: This would be a number close to 0 . 05. (d) (2 points) Nat’s friend Yaro suggests using Gibbs sampling. What is the Gibbs sam- pling update for q k ? Your answer should be in the form of a well-known distribution, along with values for the parameter(s) of that distribution. Justify your answer. Hint : you can derive the update analytically, or you can use the fact that the Beta distribution is a conjugate prior for a Geometric likelihood. Solution: q 0 Beta( α 0 + n 0 , β 0 n 0 + X i : z i =0 x i ) q 1 Beta( α 1 + n 1 , β 1 n 1 + X i : z i =1 x i ) where n 0 = n i =1 1 z i =0 , and n 1 = n i =1 1 z i =1 . This follows by assuming that the likelihood for a generic x , coming from firm k ∈ { 0 , 1 } , is (1 q k ) x 1 q k (indeed you get x i = x if you sample k 1 non- defective fidget spinners and a defective one). Note : when grading, we will also accept the less precise solution where you assume that the likelihood is (1 q k ) x q k , and you get the following Gibbs sampling updates q 0 Beta( α 0 + n 0 , β 0 + X i : z i =0 x i ) q 1 Beta( α 1 + n 1 , β 1 + X i : z i =1 x i ) . 6
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Data 102 Homework 2 Due: 11:59 PM Wednesday, September 28, 2022 Rejection Sampling 3. (6 points) Consider the function g ( x ) = cos 2 (12 x ) × | x 3 + 6 x 2 | × x ( 1 , . 25) (0 , 1) . In this problem, we use rejection sampling to generate random variables with pdf f ( x ) = cg ( x ). (a) (2 points) Plot g over its domain. What is a uniform proposal distribution q that covers the support of f ? What is the largest possible constant M such that the scaled target distribution p ( x ) = Mg ( x ) satisfies p ( x ) q ( x ) for all x ? Solution: One proposal is q ( x ) = 1 2 x ( 1 , 1) . Since g ( x ) 8 on the range 1 x 1 we will take M = 1 16 so that p ( x ) q ( x ) . Since max x g ( x ) 7 . 21 we can really take any M 1 14 . 42 for this proposal distribution. (b) (2 points) Suppose you run rejection sampling with target p and proposal q from part (a) until you generate n samples and your sampler runs a total of N n times, including n acceptances and N n rejections. Explain how you can use n, N and M to estimate c . Hint : the ratio of acceptances n to total runs N is an approximation of the ratio between the area under the curve p ( x ) and the area under q ( x ). Hint : remember what happens if you integrate a pdf over its entire support. Solution: Below we plot the target p and proposal q . 7
Data 102 Homework 2 Due: 11:59 PM Wednesday, September 28, 2022 Rejection sampling is set up so that we accept a sample if it lies below the blue target curve. In particular n N R 1 1 p ( x ) dx R 1 1 q ( x ) dx = M Z 1 1 g ( x ) dx = M c Z f ( x ) dx = M c . Thus, we shall use ˆ c = NM n to estimate c . (c) (2 points) Use rejection sampling to generate a sample of size 10 3 from p ( x ). Since f ( x ) is a pdf and it’s proportional to p ( x ), we can display its estimate easily: plot a normalized histogram of your sample, and overlay a smooth kernel density estimate, that will provide more information on the shape of the estimated distribution. Repeat the previous steps increasing the number of samples to 10 6 . Solution: 8