lab01_solutions

pdf

School

University of California, Berkeley *

*We aren’t endorsed by this school

Course

102

Subject

Computer Science

Date

Feb 20, 2024

Type

pdf

Pages

Uploaded by ProfComputer848

lab01_solutions October 4, 2022 1 Lab 1: Basics of Testing Welcome to the first Data 102 lab! The goals of this lab are to get familiar with concepts in decision theory. We will learn more about testing, p-values and FDR control. The code you need to write is commented out with a message “TODO: fill…” . There is additional documentation for each part as you go along. 1.1 Collaboration Policy Data science is a collaborative activity. While you may talk with others about the labs, we ask that you write your solutions individually . If you do discuss the assignments with others please include their names in the cell below. 1.2 Submission To submit this assignment, rerun the notebook from scratch (by selecting Kernel > Restart & Run all), and then print as a pdf (File > download as > pdf) and submit it to Gradescope. For full credit, this assignment should be completed and submitted before Friday, September 9, 2022 at 11:59 PM. PST 1.3 Collaborators Write the names of your collaborators in this cell. <Collaborator Name> <Collaborator e-mail> 2 Setup Let’s begin by importing the libraries we will use. You can find the documentation for the libraries here: * matplotlib: https://matplotlib.org/3.1.1/contents.html * numpy: https://docs.scipy.org/doc/ * pandas: https://pandas.pydata.org/pandas-docs/stable/ * seaborn: https://seaborn.pydata.org/ [1]: import matplotlib.pyplot as plt import numpy as np import pandas as pd import seaborn as sns 1

import scipy.stats from scipy.stats import norm import hashlib % matplotlib inline sns . set(style = "dark" ) plt . style . use( "ggplot" ) def get_hash (num): # <- helper function for assessing correctness return hashlib . md5( str (num) . encode()) . hexdigest() 3 Question 1: Hypothesis testing, LRT, decision rules, P-values. The first question looks at the basics of testing. You will have to put yourself in the shoes of a detective who is trying to use ‘evidence’ to find the ‘truth’. Given a piece of evidence 𝑋 your job will be to decide between two hypotheses. The two hypothesis you consider are: The null hypothesis: 𝐻 0 ∶ 𝑋 ∼ 𝒩(0, 1) The alternative hypothesis: 𝐻 1 ∶ 𝑋 ∼ 𝒩(2, 1) Granted you don’t know the truth, but you have to make a decision that maximizes the True Positive Probability and minimizes the False Positive Probability. In this exercise you will look at: - The intuitive relationship between Likelihood Ratio Test and decisions based on thresholding 𝑋 . - The performance of a level- 𝛼 test. - The distribution of p-values for samples from the null distribution as well as samples from the alternative. Let’s start by plotting the distributions of the null and alternative hypothesis. [2]: # NOTE: you just need to run this cell to plot the pdf; don't change this code. def null_pdf (x): return norm . pdf(x, 0 , 1 ) def alt_pdf (x): return norm . pdf(x, 2 , 1 ) # Plot the distribution under the null and alternative x_axis = np . arange( -4 , 6 , 0.001 ) plt . plot(x_axis, null_pdf(x_axis), label = '$H_0$' ) # <- likelihood under the ␣ ↪ null plt . fill_between(x_axis, null_pdf(x_axis), alpha = 0.3 ) plt . plot(x_axis, alt_pdf(x_axis), label = '$H_1$' ) # <- likelihood alternative plt . fill_between(x_axis, alt_pdf(x_axis), alpha = 0.3 ) 2

plt . xlabel( "X" ) plt . ylabel( "Likelihood" ) plt . title( "Comparison of null and alternative likelihoods" ); plt . legend() plt . tight_layout() plt . show() By inspecting the image above we can see that if the data lies towards the right, then it seems more plausible that the alternative is true. For example 𝑋 ≥ 1.64 seems much less likely to belong to the null pdf than the alternative pdf. 3.0.1 Likelihood Ratio Test In class we said that the optimal test is the Likelihood Ratio Test (LRT), which is the result of the celebrated Neyman-Pearson Lemma. It says that the optimal level 𝛼 test is the one that rejects the null (aka makes a discovery, favors the alternative) whenever: 𝐿𝑅(𝑥) ∶= 𝑓 1 (𝑥) 𝑓 0 (𝑥) ≥ 𝜂 where 𝜂 is chosen such that the false positive rate is equal to 𝛼 . 3

Your preview ends here