dis09_solutions

pdf

School

University of California, Berkeley *

*We aren’t endorsed by this school

Course

8

Subject

Statistics

Date

Feb 20, 2024

Type

pdf

Pages

4

Uploaded by CaptainFinch748

Report
Data 8 Fall 2023 The Bootstrap Lab 07 October 2023 1. Mid-semester Check In a. What has been your favorite topic/assignment/lecture/anything so far with the first half of the class done? If you have any concerns about your performance in the class so far, feel free to bring it up to your lab TA. 2. Facts About the Bootstrap Suppose we are trying to estimate a population parameter . Whenever we take a random sample and calculate a statistic to estimate the parameter, we know that the statistic could have come out differently if the sample had come out differently by random chance. We want to understand the variability of the statistic in order to better estimate the parameter. However, we don’t have the resources to collect multiple random samples. In order to solve this problem, we use a technique called bootstrapping . a. When we conduct a bootstrap resample, what size resample should we draw from our sample? Why? The resample should have the same sample size as our sample. This is because our original estimate of some parameter from our sample is based on a certain sample size. If we changed the sample size, the distribution of the estimate would change. b. Why do we need to resample from our sample with replacement? If we don’t sample with replacement, then we will get the same exact sample every time. c. When we conduct a bootstrap resample, what is the underlying assumption/reasoning for resampling from our sample? Why does it work? The underlying assumption is that our sample looks similar to our population — that is, the sample is representative of what the population looks like. The validity of the bootstrap is based on this assumption, because if the sample is unrepresentative of the population, we don’t actually end up with a good picture of what range of values our estimate could take on. 1
3. Thirsty Warmup: What is the difference between a parameter and a statistic? Which of the two is random? A parameter is a property of the population, so it is fixed and doesn’t change. However, we calculate statistics from samples, which are often random. Typically, we want to use statistics in order to estimate population parameters. Therefore, a statistic is random and a parameter is not random. You are interested in investigating the liters of water consumed every day by UC Berkeley students. In particular, you want to study the proportion of students drinking less than 3 liters of water per day. You contact 150 ran- dom students from the directory and obtain the amounts of water each one of them drinks, storing them in the table water . The table has 1 column, amount , which stores the number of liters of water drunk by each student. a. What is the parameter and what is the statistic in this scenario? Population parameter: The proportion of UC Berkeley students who drink less than 3 liters of water per day. Statistic: The proportion of students in your sample who drink less than 3 liters of water per day. b. Write a line of code to calculate the proportion of students in your sample who drank less than 3 liters of water per day. np.mean(water.column(’amount’) < 3) c. Write a line of code to perform a single bootstrap resample of the data stored in the water table. water.sample(water.num rows, with replacement=True) d. Fill in the following blanks to conduct 10000 bootstrap resamples of your data, calculating the proportion of students in each resample that drink less than 3 liters of water per day, then plotting the distribution of those proportions using an appropriate visualization. proportions = for i in np.arange(10000): resampled table = resampled statistic = proportions = proportions table = Table().with column(’Resampled proportion’, proportions) proportions table. proportions = make array() for i in np.arange(10000): resampled table = water.sample(water.num rows, with replacement=True) resampled statistic = np.mean(resampled table.column(’amount’) < 3) proportions = np.append(proportions, resampled statistic) proportions table = Table().with column(’Resampled proportion’, proportions) proportions table.hist(’Resampled proportion’) 2
4. Tennis Time Ciara is interested in the heights of female tennis players. She’s collected a sample of 100 heights of profes- sional women’s tennis players. She wants to use this sample to estimate the true interquartile range (IQR) of all heights of professional women’s tennis players. Hint: We defined the interquartile range (IQR) to be: 75th percentile - 25th percentile a. In order to construct a 99% confidence interval for the IQR, what should our upper and lower percentile endpoints be? Our lower endpoint should be 0.5 and upper endpoint should be 99.5 b. Define a function ci iqr that constructs a 99% confidence interval for the IQR as follows. The function takes the following arguments: tbl : A one-column table consisting of a random sample from the population; you can assume this sample is large reps : The number of bootstrap repetitions Hint: To find the 25th and 75th percentile of an array, you can use the percentile function def ci iqr(tbl, reps): stats = for : resample col = new iqr = stats = left end = right end = return make array(left end, right end) def ci iqr(tbl, reps): stats = make array() for i in np.arange(reps): resample col = tbl.sample().column(0) new iqr = percentile(75, resample col) - percentile(25, resample col) stats = np.append(stats, new iqr) left end = percentile(0.5, stats) right end = percentile(99.5, stats) return make array(left end, right end) 3
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
c. Say Ciara recruited 500 of her friends to perform the same bootstrapping process she did. In other words, each of her friends drew a large, random sample of 100 heights from the population of professional women’s tennis players and constructed their own 99% confidence intervals. Approximately how many of these CI’s do we expect to contain the actual IQR for the heights of professional women’s tennis athletes? We interpret a 99% confidence interval to mean that we are 99% confident in the process used to construct that given interval. In other words, 99% of the time we use this process we expect to construct an interval that contains the true population parameter. Since we have 500 CIs, each at a 99% confidence level, we find that since 500*(0.99) = 495, we expect to have 495 of these CIs containing the actual IQR of heights. 4