Data C8 Final Exam Solutions for Summer 2023 Exam

Data C8, Final Exam Summer 2023 Name: Email: @berkeley.edu Student ID: Name of the student to your left: Name of the student to your right: Instructions: Do not open the examination until instructed to do so. This exam consists of 80 points spread out over 4 questions on 14 pages and must be completed in the 110 minute time period on August 11, 2023, from 10:10 AM to 12:00 PM unless you have pre-approved accommodations otherwise. Note that some questions have circular bubbles to select a choice. This means that you should only select one choice . Other questions have boxes. This means you should select all that apply . Please shade in the box/circle to mark your answer. There is space to write your student ID number (SID) in the upper right-hand corner of each page of the exam. Make sure to write your SID on each page to ensure that your exam is graded. Honor Code [1 pt]: As a member of the UC Berkeley community, I act with honesty, integrity, and respect for others. I am the person whose name is on the exam, and I completed this exam in accordance with the Honor Code. Signature: 1

Data C8 Final Exam, Page 2 of 16 SID: 1 Barbenheimer Returns [18 Points] Rotten Tomatoes, a movie review website, is measuring which of the two movies – Oppenheimer or Barbie – has higher reviews among Berkeley students. They believe that Berkeley students will give higher reviews to the Oppenheimer movie. Researchers at Rotten Tomatoes randomly sample 1000 Berkeley students and show each student both movies under identical viewing conditions. Immediately after watching each movie, every student is asked to rate that movie on an integer scale from 1 (worst) up to, and including 10 (best). The reviews are collected in a table named reviews ; shown below are the first few rows. (a) [2 Pts] Which of the following is a correct null hypothesis that Rotten Tomatoes should use to assess their claim? Select one . ⃝ The Oppenheimer movie has a different distribution of reviews than the Barbie movie among the given sample of Berkeley students. ⃝ The Oppenheimer movie has the same distribution of reviews as the Barbie movie among the given sample of Berkeley students. ⃝ The Oppenheimer movie has a different distribution of reviews than the Barbie movie among Berkeley students. ⃝ The Oppenheimer movie has the same distribution of reviews as the Barbie movie among Berkeley students. (b) [2 Pts] Please state a clear and complete alternative hypothesis that Rotten Tomatoes should use to assess their claim. Solution: The Oppenheimer movie has higher reviews than the Barbie movie among Berkeley stu- dents

Data C8 Final Exam, Page 3 of 16 SID: (c) [3 Pts] Rotten Tomatoes uses the difference of means as their test statistic. Complete the function below so that it returns the difference of mean reviews between the two movies. Larger values of the test statistic should favor the alternative hypothesis. Note: Assume that the reviews table argument resembles the reviews table above. Hint : The group function will return a table that is sorted alphabetically based on the values in the column used for grouping. def test_statistic(reviews_table): means_col = ___________________(A)________________________ return ________________________(B)________________________ (i) Fill in the blank (A) Solution: reviews_table.group(0, np.mean).column(1) (ii) Which of the following options is most appropriate for blank (B) ⃝ means_col.item(0) - means_col.item(1) ⃝ means_col.item(1) - means_col.item(0) (d) [3 Pts] Which of the following may be used to create simulations under the null hypothesis? Select all that apply. □ Shuffle the values of only the movie column. □ Shuffle the values of only the review column. □ Shuffle the values of the movie column, then shuffle the values of the review column. □ Randomly sample all of the rows of the reviews table with replacement . □ Randomly sample all of the rows of the reviews table without replacement . □ None of the above. (e) [2 Pts] Suppose we simulate 10 , 000 values of the test statistic under the null hypothesis. Which of the following will our distribution of simulated test statistics most closely resemble? ⃝ Graph 1 ⃝ Graph 2 ⃝ Graph 3

Your preview ends here