data8-sp23-final

pdf

School

University of California, Berkeley *

*We aren’t endorsed by this school

Course

Subject

Statistics

Date

Jan 9, 2024

Type

pdf

Pages

Uploaded by SuperOxide12465

DATA 8 Foundations of Data Science Spring 2023 Final Exam INSTRUCTIONS You have 2 hours and 50 minutes to complete the exam. • The exam is closed book, closed notes, closed computer/calculator, except the provided final reference sheet. • Mark your answers on the exam itself in the spaces provided. We will not grade answers written on scratch paper or outside the designated answer spaces. • If you need to use the restroom, bring your phone and exam to the front of the room. For questions with circular bubbles , you should fill in exactly one choice. *$ You must choose either this option *$ Or this one, but not both! For questions with square checkboxes , you may fill in multiple choices. 939 You could select this choice. 939 You could select this one too! **Important** : Please fill in circles and squares to indicate answers and cross out or erase mistakes. Preliminaries You can complete these questions before the exam starts. (a) What is your full name? (b) What is your student ID number? (c) Who is sitting to your left? (Write no one if no one is next to you.) (d) Who is sitting to your right? (Write no one if no one is next to you.)

2 1. (28.0 points) True or False (a) (2.0 pt) The height of each bar in a histogram represents the proportion of data within the corresponding bin. *$ True *$ False (b) (2.0 pt) According to the Case Study lecture, the pixel of an image is typically considered a categorical variable when building machine learning models. *$ True *$ False (c) (2.0 pt) A classifier is considered to be overfitting if it performs very well on the test set. *$ True *$ False (d) (2.0 pt) If you are a subject in an experiment, knowing whether you are in the treatment or control group can be considered a confounding variable. *$ True *$ False (e) (2.0 pt) For any distribution, the percentage of data that lies beyond two standard deviations on either side of the mean is less than 30%. *$ True *$ False (f) (2.0 pt) For any regression line, the SD of the residuals is equal to the root mean squared error. *$ True *$ False (g) (2.0 pt) We sample a table with replacement to shuffle labels for A/B testing. *$ True *$ False (h) (2.0 pt) According to the Central Limit Theorem, if a sample is drawn at random from the population with replacement, then the probability distribution of the sample average is normal, regardless of the sample size. *$ True *$ False (i) (2.0 pt) DJ Patil covered climate change in depth during his guest lecture. *$ True *$ False

3 (j) (2.0 pt) The median of a set of 15 integers will always be an integer. *$ True *$ False (k) (2.0 pt) Suppose a hypothesis test is proposed and we already know that the null hypothesis is true. If 100 researchers each independently collect a large random sample of the same size to carry out an experiment and they all use 5% as their p-value cutoff, we should expect around 95% of them to fail to reject the null. *$ True *$ False (l) (2.0 pt) Suppose you fit a regression line to two data sets: (A) the original data set; and (B) the dataset with a few outliers (with respect to the y-axis) removed. The line for (A) will have a larger slope (in absolute value) than the line for (B). Assume both data sets are standardized just prior to fitting the lines. *$ True *$ False (m) (2.0 pt) If we use linear regression to predict y -values based on our x -values, the median of our residuals will always be zero. *$ True *$ False (n) (2.0 pt) Given a function error(a, b) which computes some error based on its input arguments, a valid output from the call minimize(error) could be an array containing two elements: 4 and 20. *$ True *$ False

Your preview ends here