hw08

pdf

School

University of California, Berkeley *

*We aren’t endorsed by this school

Course

C8

Subject

Computer Science

Date

Dec 6, 2023

Type

pdf

Pages

10

Uploaded by AdmiralAtom103517

Report
hw08 November 30, 2023 [1]: # Initialize Otter import otter grader = otter . Notebook( "hw08.ipynb" ) 1 Homework 8: Confidence Intervals Helpful Resource: Python Reference : Cheat sheet of helpful array & table methods used in Data 8! Recommended Reading : Estimation Please complete this notebook by filling in the cells provided. Before you begin, execute the cell below to setup the notebook by importing some helpful libraries. Each time you start your server, you will need to execute this cell again. For all problems that you must write explanations and sentences for, you must provide your answer in the designated space. Moreover, throughout this homework and all future ones, please be sure to not re-assign variables throughout the notebook! For example, if you use max_temperature in your answer to one question, do not reassign it later on. Otherwise, you will fail tests that you thought you were passing previously! Deadline: This assignment is due Wednesday, 10/25 at 11:00pm PT . Turn it in by Tuesday, 10/24 at 11:00pm PT for 5 extra credit points. Late work will not be accepted as per the policies page. Note: This homework has hidden tests on it. That means even though tests may say 100% passed, it doesn’t mean your final grade will be 100%. We will be running more tests for correctness once everyone turns in the homework. Directly sharing answers is not okay, but discussing problems with the course staff or with other students is encouraged. Refer to the policies page to learn more about how to learn cooperatively. You should start early so that you have time to get help if you’re stuck. Offce hours are held Monday through Friday in Warren Hall 101B. The offce hours schedule appears here . [2]: # Don't change this cell; just run it. import numpy as np 1
from datascience import * # These lines do some fancy plotting magic.", import matplotlib % matplotlib inline import matplotlib.pyplot as plt plt . style . use( 'fivethirtyeight' ) import warnings warnings . simplefilter( 'ignore' , FutureWarning ) 1.1 1. Thai Restaurants in Berkeley Jessica and Ciara are trying to see what the best Thai restaurant in Berkeley is. They survey 1,500 UC Berkeley students selected uniformly at random and ask each student which Thai restaurant is the best. ( Note: This data is fabricated for the purposes of this homework. ) The choices of Thai restaurants are Lucky House , Imm Thai , Thai Temple , and Thai Basil . After compiling the results, Jessica and Ciara release the following percentages of votes that each restaurant received, from their sample: Thai Restaurant Percentage Lucky House 8% Imm Thai 53% Thai Temple 25% Thai Basil 14% These percentages represent a uniform random sample of the population of UC Berkeley students. We will attempt to estimate the corresponding parameters , or the percentage of the votes that each restaurant will receive from the population (i.e. all UC Berkeley students). We will use confidence intervals to compute a range of values that reflects the uncertainty of our estimates. The table votes contains the results of Jessica and Ciara’s survey. [3]: # Just run this cell votes = Table . read_table( 'votes.csv' ) votes [3]: Vote Lucky House Lucky House Lucky House Lucky House Lucky House Lucky House Lucky House Lucky House Lucky House Lucky House 2
… (1490 rows omitted) Question 1.1. Complete the function one_resampled_percentage below. It should return Imm Thai’s percentage of votes after taking the original table ( tbl ) and performing one bootstrap sample of it. Remember that a percentage is between 0 and 100. (8 Points) Note 1: tbl will always be in the same format as votes . Note 2: This function should be completed without .group or .pivot . Using these functions will cause your code to timeout. Hint: Given a table of votes, how can you figure out what percentage of the votes are for a certain restaurant? Be sure to use percentages, not proportions, for this question! [12]: def one_resampled_percentage (tbl): return ( 100 * tbl . sample() . where( "Vote" , "Imm Thai" ) . num_rows / tbl . num_rows) one_resampled_percentage(votes) [12]: 52.86666666666667 [13]: grader . check( "q1_1" ) [13]: q1_1 results: All test cases passed! Question 1.2. Complete the percentages_in_resamples function such that it simulates and returns an array of 2023 elements , where each element represents a bootstrapped estimate of the percentage of voters who will vote for Imm Thai. You should use the one_resampled_percentage function you wrote above. (8 Points) Note: We perform our simulation with only 2023 trials in this problem to reduce the runtime, but we should generally use more repetitions. [17]: def percentages_in_resamples (): percentage_imm = make_array() for i in range ( 2023 ): percentage_imm = np . append(percentage_imm, one_resampled_percentage(votes)) return (percentage_imm) [18]: grader . check( "q1_2" ) [18]: q1_2 results: All test cases passed! In the following cell, we run the function you just defined, percentages_in_resamples , and create a histogram of the calculated statistic for the 2023 bootstrap estimates of the percentage of voters who voted for Imm Thai. Note: This might take a few seconds to run. 3
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
[19]: resampled_percentages = percentages_in_resamples() Table() . with_column( 'Estimated Percentage' , resampled_percentages) . hist( "Estimated Percentage" ) Question 1.3. Using the array resampled_percentages , find the values at the two edges of the middle 95% of the bootstrapped percentage estimates. Compute the lower and upper ends of the interval, named imm_lower_bound and imm_upper_bound respectively. (8 Points) Hint: If you are stuck on this question, try looking over Chapter 13.1 of the textbook. [20]: imm_lower_bound = percentile( 2.5 ,resampled_percentages) imm_upper_bound = percentile( 97.5 ,resampled_percentages) print ( f"Bootstrapped 95% confidence interval for the percentage of Imm Thai voters in the population: [ { imm_lower_bound : .2f } , { imm_upper_bound : .2f } ]" ) Bootstrapped 95% confidence interval for the percentage of Imm Thai voters in the population: [50.47, 55.60] [21]: grader . check( "q1_3" ) [21]: q1_3 results: All test cases passed! Question 1.4. The survey results seem to indicate that Imm Thai is beating all the other Thai restaurants among the voters. We would like to use confidence intervals to determine a range of 4
likely values for Imm Thai’s true lead over all the other restaurants combined. The calculation for Imm Thai’s lead over Lucky House, Thai Temple, and Thai Basil combined is: Imm Thai’s percent of vote − ( 100 percent Imm Thai’s percent of Vote ) Define the function one_resampled_difference that returns exactly one value of Imm Thai’s percentage lead over Lucky House, Thai Temple, and Thai Basil combined from one bootstrap sample of tbl . (8 Points) Hint 1: Imm Thai’s lead can be negative. Hint 2: Given a table of votes, how can you figure out what percentage of the votes are for a certain restaurant? Be sure to use percentages, not proportions, for this question! Note: If the skeleton code provided within the function is not helpful for you, feel free to approach the question using your own variables. [22]: def one_resampled_difference (tbl): bootstrap = tbl . sample() imm_percentage = 100 * (bootstrap . where( "Vote" , "Imm Thai" ) . num_rows / bootstrap . num_rows) return (imm_percentage - ( 100 - imm_percentage)) [23]: grader . check( "q1_4" ) [23]: q1_4 results: All test cases passed! Question 1.5. Write a function called leads_in_resamples that returns an array of 2023 elements representing the bootstrapped estimates (the result of calling one_resampled_difference ) of Imm Thai’s lead over Lucky House, Thai Temple, and Thai Basil combined. Afterwards, run the cell to plot a histogram of the resulting samples. (8 Points) Hint: If you see an error involving NoneType , consider what components a function needs to have! [26]: def leads_in_resamples (): arr = make_array() for i in range ( 2023 ): arr = np . append(arr, one_resampled_difference(votes)) return (arr) sampled_leads = leads_in_resamples() Table() . with_column( 'Estimated Lead' , sampled_leads) . hist( "Estimated Lead" ) 5
Question 1.6. Use the simulated data in sampled_leads from Question 1.5 to compute an approximate 95% confidence interval for Imm Thai’s true lead over Lucky House, Thai Temple, and Thai Basil combined. (10 Points) [27]: diff_lower_bound = percentile( 2.5 ,sampled_leads) diff_upper_bound = percentile( 97.5 ,sampled_leads) print ( "Bootstrapped 95 % c onfidence interval for Imm Thai's true lead over Lucky House, Thai Temple, and Thai Basil combined: [ {:f} %, {:f} %]" . format(diff_lower_bound, diff_upper_bound)) Bootstrapped 95% confidence interval for Imm Thai's true lead over Lucky House, Thai Temple, and Thai Basil combined: [1.066667%, 11.066667%] [28]: grader . check( "q1_6" ) [28]: q1_6 results: All test cases passed! 1.2 2. Interpreting Confidence Intervals The staff computed the following 95% confidence interval for the percentage of Imm Thai voters: [50.53, 55.53] (Your answer from 1.3 may have been a bit different due to randomness; that doesn’t mean it was wrong!) 6
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Question 2.1. The staff also created 70%, 90%, and 99% confidence intervals from the same sample, but we forgot to label which confidence interval represented which percentages! First , match each confidence level (70%, 90%, 99%) with its corresponding interval in the cell below (e.g. __ % CI: [52.1, 54] replace the blank with one of the three confidence levels). Then , explain your thought process and how you came up with your answers. (10 Points) The intervals are below: • [50.03, 55.94] • [52.1, 54] • [50.97, 54.99] Hint: If you are stuck on this question, try looking over Chapters 13.3 and 13.4 of the textbook. 99% CI: [50.03, 55.94] 90% CI: [50.97, 54.99] 70% CI: [52.1, 54] Higher confidence levels have wider intervals (we can be more confident that we captured the parameter’s true value if our interval contains a greater number of values) Question 2.2. Suppose we produced 6,000 new samples (each one a new/distinct uniform random sample of 1,500 students) from the population and created a 95% confidence interval from each one . Roughly how many of those 6,000 intervals do you expect will actually contain the true percentage of the population? (10 Points) Assign your answer to true_percentage_intervals . [31]: true_percentage_intervals = 0.95 * 6000 [32]: grader . check( "q2_2" ) [32]: q2_2 results: All test cases passed! Recall the second bootstrap confidence interval you created, which estimated Imm Thai’s lead over Lucky House, Thai Temple, and Thai Basil combined. Among voters in the sample, Imm Thai’s lead was 6%. The staff’s 95% confidence interval for the true lead (in the population of all voters) was: [1.2, 11.2] Suppose we are interested in testing a simple yes-or-no question: “Is the percentage of votes for Imm Thai equal to the percentage of votes for Lucky House, Thai Temple, and Thai Basil combined?” Our null hypothesis is that the percentages are equal, or equivalently, that Imm Thai’s lead is exactly 0. Our alternative hypothesis is that Imm Thai’s lead is not equal to 0. In the questions below, don’t compute any confidence interval yourself—use only the staff’s 95% confidence interval. Question 2.3. Say we use a 5% p-value cutoff. Do we reject the null, fail to reject the null, or are we unable to tell using the staff’s confidence interval? (10 Points) Assign cutoff_five_percent to the number corresponding to the correct answer. 1. Reject the null / Data is consistent with the alternative hypothesis 7
2. Fail to reject the null / Data is consistent with the null hypothesis 3. Unable to tell using our staff confidence interval Hint: Consider the relationship between the p-value cutoff and confidence. If you’re confused, take a look at this chapter of the textbook. [44]: cutoff_five_percent = 1 [45]: grader . check( "q2_3" ) [45]: q2_3 results: All test cases passed! Question 2.4. What if, instead, we use a p-value cutoff of 1%? Do we reject the null, fail to reject the null, or are we unable to tell using our staff confidence interval? (10 Points) Assign cutoff_one_percent to the number corresponding to the correct answer. 1. Reject the null / Data is consistent with the alternative hypothesis 2. Fail to reject the null / Data is consistent with the null hypothesis 3. Unable to tell using our staff confidence interval [46]: cutoff_one_percent = 3 [47]: grader . check( "q2_4" ) [47]: q2_4 results: All test cases passed! Question 2.5. What if we use a p-value cutoff of 10%? Do we reject, fail to reject, or are we unable to tell using our confidence interval? (10 Points) Assign cutoff_ten_percent to the number corresponding to the correct answer. 1. Reject the null / Data is consistent with the alternative hypothesis 2. Fail to reject the null / Data is consistent with the null hypothesis 3. Unable to tell using our staff confidence interval [48]: cutoff_ten_percent = 1 [49]: grader . check( "q2_5" ) [49]: q2_5 results: All test cases passed! 2 3. Midsemester Feedback Form Fill out this form to complete the homework. Please use your Berkeley email to access the form. At the end of the form, there will be a secret word that you should input into the box below. Remember to put the secret word in quotes when inputting it (i.e.“hello”). The quotation marks indicate that it is a String type! 8
Note: This is the same form as you filled out in lab. If you have completed Lab 07, you should have already filled out the form. If so, please feel free to copy your answer from the Lab! [51]: secret_word = "bing chilling" [52]: grader . check( "q3" ) [52]: q3 results: All test cases passed! You’re done with Homework 8! Important submission steps: 1. Run the tests and verify that they all pass. 2. Choose Save Notebook from the File menu, then run the final cell . 3. Click the link to download the zip file. 4. Go to Gradescope and submit the zip file to the corresponding assignment. The name of this assignment is “HW 08 Autograder”. It is your responsibility to make sure your work is saved before running the last cell. 2.1 Pets of Data 8 Chipper says congrats on finishing Homework 8 (only 3 more to go)! Pet of the week: Chipper 2.2 Submission Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. Please save before exporting! [53]: # Save your notebook first, then run this cell to export your submission. grader . export(pdf = False , run_tests = True ) Running your submission against local test cases… Your submission received the following results when run against available test cases: q1_1 results: All test cases passed! q1_2 results: All test cases passed! q1_3 results: All test cases passed! q1_4 results: All test cases passed! q1_6 results: All test cases passed! 9
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
q2_2 results: All test cases passed! q2_3 results: All test cases passed! q2_4 results: All test cases passed! q2_5 results: All test cases passed! q3 results: All test cases passed! <IPython.core.display.HTML object> 10