hw07

pdf

School

University of North Georgia, Gainesville *

*We aren’t endorsed by this school

Course

1401

Subject

Statistics

Date

Apr 3, 2024

Type

pdf

Pages

12

Uploaded by MinisterIron104132

Report
hw07 March 19, 2024 1 Homework 7: Testing Hypotheses Reading : * Testing Hypotheses Please complete this notebook by filling in the cells provided. HOWEVER, the autograder files have not been integrated online and running the cell above will cause an error! Ignore this cell and cells of the form grader.check(“qx.k”). Directly sharing answers is not okay, but discussing problems with the course staff or with other students is encouraged. Refer to the policies page to learn more about how to learn cooperatively. For all problems that you must write our explanations and sentences for, you must provide your answer in the designated space. Moreover, throughout this homework and all future ones, please be sure to not re-assign variables throughout the notebook! For example, if you use max_temperature in your answer to one question, do not reassign it later on. [8]: # Don't change this cell; just run it. import numpy as np from datascience import * # These lines do some fancy plotting magic. import matplotlib % matplotlib inline import matplotlib.pyplot as plt plt . style . use( 'fivethirtyeight' ) import warnings warnings . simplefilter( 'ignore' , FutureWarning ) 1.1 1. Spam Calls 1.2 Part 1: 781 Fun Yanay gets a lot of spam calls. An area code is defined to be a three digit number from 200-999 inclusive. In reality, many of these area codes are not in use, but for this question we’ll simplify things and assume they all are. Throughout these questions, you should assume that Yanay’s area code is 781. Question 1. Assuming each area code is just as likely as any other, what’s the probability that the area code of two back to back spam calls are 781? 1
[9]: prob_781 = ( 1/800 ) **2 prob_781 [9]: 1.5625e-06 Question 2. Rohan already knows that Yanay’s area code is 781. Rohan randomly guesses the last 7 digits (0-9 inclusive) of his phone number. What’s the probability that Rohan correctly guesses Yanay’s number, assuming he’s equally likely to choose any digit? Note: A phone number contains an area code and 7 additional digits, i.e. xxx-xxx-xxxx [10]: prob_yanay_num = 1/10**7 prob_yanay_num [10]: 1e-07 Yanay suspects that there’s a higher chance that the spammers are using his area code (781) to trick him into thinking it’s someone from his area calling him. Ashley thinks that this is not the case, and that spammers are just choosing area codes of the spam calls at random from all possible area codes ( Remember, for this question we’re assuming the possible area codes are 200-999, inclusive ). Yanay wants to test his claim using the 50 spam calls he received in the past month. Here’s a dataset of the area codes of the 50 spam calls he received in the past month. [11]: # Just run this cell spam = Table() . read_table( 'spam.csv' ) spam [11]: Area Code 891 924 516 512 328 613 214 781 591 950 … (40 rows omitted) Question 3. Define the null hypothesis and alternative hypothesis for this investigation. Hint: Don’t forget that your null hypothesis should fully describe a probability model that we can use for simulation later. The null hypothesis for this investigation: the probability of spammers using area code 781 to trick Yanay into thinking it’s someone from his area calling him is the same as the probability of them using any other area code in the range of 200-999. 2
The alternaitve hypothesis for this investigation: the probability of spammers using area code 781 to trick Yanay into thinking it’s someone from his area calling him is higher than the probability of them using any other area code in the range of 200-999. Question 4. Which of the following test statistics would be a reasonable choice to help differentiate between the two hypotheses? Hint : For a refresher on choosing test statistics, check out the textbook section on Test Statistics . 1. The proportion of area codes that are 781 in 50 random spam calls 2. The total variation distance (TVD) between probability distribution of randomly chosen area codes, and the observed distribution of area codes. ( Remember the possible area codes are 200-999 inclusive ) 3. The probability of getting an area code of 781 out of all the possible area codes. 4. The proportion of area codes that are 781 in 50 random spam calls divided by 2 5. The number of times you see the area code 781 in 50 random spam calls Assign reasonable_test_statistics to an array of numbers corresponding to these test statistics. [12]: reasonable_test_statistics = make_array( 1 , 5 ) 3
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
For the rest of this question, suppose you decide to use the number of times you see the area code 781 in 50 spam calls as your test statistic. Question 5. Write a function called simulate that generates exactly one simulated value of your test statistic under the null hypothesis. It should take no arguments and simulate 50 area codes under the assumption that the result of each area is sampled from the range 200-999 inclusive with equal probability. Your function should return the number of times you saw the 781 area code in those 50 random spam calls. [13]: possible_area_codes = np . arange( 200 , 1000 ) def simulate (): codes = np . random . choice(possible_area_codes, 50 ) return sum (codes == 781 ) # Call your function to make sure it works simulate() [13]: 0 Question 6. Generate 20,000 simulated values of the number of times you see the area code 781 in 50 random spam calls. Assign test_statistics_under_null to an array that stores the result of each of these trials. Hint : Use the function you defined in Question 5. [14]: test_statistics_under_null = make_array() repetitions = 20000 for i in np . arange(repetitions): test_statistics_under_null = np . append(test_statistics_under_null, , simulate()) test_statistics_under_null [14]: array([0., 0., 0., …, 0., 0., 0.]) Question 7. Using the results from Question 6, generate a histogram of the empirical distribution of the number of times you saw the area code 781 in your simulation. NOTE: Use the provided bins when making the histogram [15]: bins = np . arange( 0 , 5 , 1 ) # Use these provided bins Table() . with_columns( "781" , test_statistics_under_null) . hist(bins = bins) 4
5
Question 8. Compute an empirical P-value for this test. [16]: # First calculate the observed value of the test statistic from the `spam` , table. observed_val = spam . where( "Area Code" , 781 ) . num_rows p_value = sum (test_statistics_under_null >= observed_val) / repetitions print ([observed_val, p_value]) [2, 0.00205] Question 9. Suppose you use a P-value cutoff of 1%. What do you conclude from the hypothesis test? Why? If we were to use a P-value cutoff of 1%, we can conclude that the null hypothesis probably is not true and that he is more likely to get calls from his own area code. 1.3 Part 2: Multiple Spammers Instead of checking if the area code is equal to his own, Yanay decides to check if the area code matches the area code of one of the 8 places he’s been to recently, and wants to test if it’s more likely to receive a spam call with an area code from any of those 8 places. These are the area codes of the places he’s been to recently: 781, 617, 509, 510, 212, 858, 339, 626. Question 10. Define the null hypothesis and alternative hypothesis for this investigation. Reminder: Don’t forget that your null hypothesis should fully describe a probability model that we can use for simulation later. The null hypothesis for this investigation: he is not more likely to get calls from these area codes, all area codes are equally likley. The alternative hypothesis for this investigation: he is more likely to get calls from one of these area codes. 6
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Suppose you decide to use the number of times you see any of the area codes of the places Yanay has been to in 50 spam calls as your test statistic. Question 11. Write a function called simulate_visited_area_codes that generates exactly one simulated value of your test statistic under the null hypothesis. It should take no arguments and simulate 50 area codes under the assumption that the result of each area is sampled from the range 200-999 inclusive with equal probability. Your function should return the number of times you saw any of the area codes of the places Yanay has been to in those 50 spam calls. Hint : You may find the textbook section on the sample_proportions function to be useful. [17]: def simulate_visited_area_codes (): codes = np . random . choice(possible_area_codes, 50 ) return Table() . with_column( "codes" , codes) . where( "codes" , are . , contained_in(make_array( 781 , 617 , 509 , 510 , 212 , 858 , 339 , 626 ))) . num_rows # Call your function to make sure it works simulate_visited_area_codes() [17]: 0 Question 12. Generate 20,000 simulated values of the number of times you see any of the area codes of the places Yanay has been to in 50 random spam calls. Assign test_statistics_under_null to an array that stores the result of each of these trials. Hint : Use the function you defined in Question 11. [18]: visited_test_statistics_under_null = make_array() repetitions = 20000 for i in range (repetitions): visited_test_statistics_under_null = np . append( visited_test_statistics_under_null, simulate_visited_area_codes()) visited_test_statistics_under_null [18]: array([1., 0., 1., …, 0., 0., 0.]) Question 13. Using the results from Question 12, generate a histogram of the empirical distribu- tion of the number of times you saw any of the area codes of the places Yanay has been to in your simulation. NOTE: Use the provided bins when making the histogram [19]: bins_visited = np . arange( 0 , 6 , 1 ) # Use these provided bins Table() . with_columns( "Visited Codes" , visited_test_statistics_under_null) . hist(bins = bins) 7
8
Question 14. Compute an empirical P-value for this test. [20]: visited_area_codes = make_array( 781 , 617 , 509 , 510 , 212 , 858 , 339 , 626 ) # First calculate the observed value of the test statistic from the `spam` , table. visited_observed_value = spam . where( "Area Code" , are . , contained_in(visited_area_codes)) . num_rows p_value = sum (visited_test_statistics_under_null > visited_observed_value) / , repetitions (visited_observed_value, p_value) [20]: (4, 5e-05) Question 15. Suppose you use a P-value cutoff of 0.05% ( Note: that’s 0.05%, not our usual cutoff of 5% ). What do you conclude from the hypothesis test? Why? We can conclude that the it will still reject the null hypothesis, and he will be more likley to recieve a call from one of the visited area codes. Question 16. Is p_value : (a) the probability that the spam calls favored the visited area codes, (b) the probability that they didn’t favor, or (c) neither If you chose (c), explain what it is instead. The correct answer is (c), because it’s probability of receiving 4 or more calls from a visited are code, assuming all area codes are equally likely. Question 17. Is 0.05% (the P-value cutoff): (a) the probability that the spam calls favored the visited area codes, (b) the probability that they didn’t favor, or (c) neither If you chose (c), explain what it is instead. The correct answer is (c), because it is the probability of a Type 1 error; which is the probability of rejecting the null hypothesis when it is actually true. Question 18. Suppose you run this test for 4000 different people after observing each person’s last 50 spam calls. When you reject the null hypothesis for a person, you accuse the spam callers of favoring the area codes that person has visited. If the spam callers were not actually favoring area codes that people have visited, can we compute how many times we will incorrectly accuse the spam callers of favoring area codes that people have visited? If so, what is the number? Explain your answer. Assume a 0.05% P-value cutoff. 9
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
[21]: 4000*0.0005 [21]: 2.0 1.4 Part 3: Practice with A/B Tests Yanay collects information about this month’s spam calls. The table with_labels is a sampled ta- ble, where the Area Code Visited column contains either "Yes" or "No" which represents whether or not Yanay has visited the location of the area code. The Picked Up column is 1 if Yanay picked up and 0 if he did not pick up. [22]: # Just run this cell with_labels = Table() . read_table( "spam_picked_up.csv" ) with_labels [22]: Area Code Visited | Picked Up No | 0 No | 1 No | 1 Yes | 0 No | 0 No | 0 Yes | 0 No | 1 No | 1 No | 1 … (40 rows omitted) Yanay is going to perform an A/B Test to see whether or not he is more likely to pick up a call from an area code he has visited. Specifically, his null hypothesis is that there is no difference in the distribution of calls he picked up between visited and not visited area codes, with any difference due to chance. His alternative hypothesis is that there is a difference between the two categories, specifically that he thinks that he is more likely to pick up if he has visited the area code. We are going to perform a permutation test to test this. Our test statistic will be the difference in proportion of calls picked up between the area codes Yanay visited and the area codes he did not visit. Question 19. Complete the difference_in_proportion function to have it calculate this test statistic, and use it to find the observed value. The function takes in a sampled table which can be any table that has the same columns as with_labels . We’ll call difference_in_proportion with the sampled table with_labels in order to find the observed difference in proportion. [25]: def difference_in_proportion (sample): # Take a look at the code for `proportion_visited` and use that as a # hint of what `proportions` should be assigned to proportions = sample . group( "Area Code Visited" , np . mean) proportion_visited = proportions . where( "Area Code Visited" , "Yes" ) . , column( "Picked Up mean" ) . item( 0 ) 10
proportion_not_visited = proportions . where( "Area Code Visited" , "No" ) . , column( "Picked Up mean" ) . item( 0 ) return abs (proportion_not_visited - proportion_visited) observed_diff_proportion = difference_in_proportion(with_labels) observed_diff_proportion [25]: 0.21904761904761905 Question 20. To perform a permutation test we shuffle the labels, because our null hypothesis is that the labels don’t matter because the distribution of calls he picked up between visited and not visited area codes come from same underlying distribution. The labels in this case is the "Area Code Visited" column containing "Yes" and "No" . Write a function to shuffle the table and return a test statistic using the function you defined in question 19. Hint: To shuffle labels, we sample without replacement and then replace the appropriate column with the new shuffled column. [27]: def simulate_one_stat (): shuffled = with_labels . sample(with_replacement = False ) . column( "Area Code , Visited" ) original_with_shuffled_labels = with_labels . select( 1 ) . with_column( "Area , Code Visited" , shuffled) return difference_in_proportion(original_with_shuffled_labels) one_simulated_test_stat = simulate_one_stat() one_simulated_test_stat [27]: 0.2571428571428571 Question 21. Generate 1,000 simulated test statistic values. Assign test_stats to an array that stores the result of each of these trials. Hint : Use the function you defined in Question 20. We also provided code that’ll generate a histogram for you after generating a 1000 simulated test statistic values. [30]: trials = 1000 test_stats = make_array() for i in range (trials): test_stats = np . append(test_stats, simulate_one_stat()) # here's code to generate a histogram of values and the red dot is the observed , value 11
Table() . with_column( "Simulated Proportion Difference" , test_stats) . , hist( "Simulated Proportion Difference" ); plt . plot(observed_diff_proportion, 0 , 'ro' , markersize =15 ); Question 22. Compute the empirical p-value for this test, and assign it to p_value_ab . [31]: p_value_ab = np . count_nonzero(test_stats >= observed_diff_proportion) / , len (test_stats) p_value_ab [31]: 0.208 For p_value_ab , you should be getting a value around 10-15%. If our p-value cutoff is 5%, the data is more consistent with the null hypothesis - that there is no difference in the distribution of calls Yanay picked up between visited and not visited area codes. 12
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help