Midterm_Exam - Solutions

pdf

School

Pennsylvania State University *

*We aren’t endorsed by this school

Course

200

Subject

Computer Science

Date

Dec 6, 2023

Type

pdf

Pages

5

Uploaded by GeneralSummer13484

Report
DS200: Introduction to Data Sciences Midterm Exam & Solutions Problem 1: Parameters [3 points] Consider the following histogram: Match the following parameters: (a) mean , (b) median , (c) 10th percentile , (d) 3rd quartile , (e) standard deviation , (f) variance to the following values: 12.15 : standard deviation 87.12 : 10th percentile 104.12 : mean 106.50 : median (i.e., 50th percentile) 113.61 : 3rd quartile (i.e., 75th percentile) 147.62 : variance (see Lecture 5 / “Percentile and Quartiles” and “Mean vs. Median”, and Lecture 6 / “Example Distributions")
Problem 2: Hypothesis Testing [6 points] Suppose that we conduct an experiment to determine if eating avocados can improve a person’s cholesterol level. We randomly assign volunteers to treatment and control groups, measure their cholesterol levels before and after the experiment, and ensure that the treatment group eats a lot of avocados during the test, while the control group does not. Our null hypothesis is that eating avocados has no impact on a person’s cholesterol level . We will reject this null hypothesis if the result of the experiment is highly statistically significant, that is, we use 1% significance level as the threshold for rejecting the null hypothesis. What is the alternative hypothesis ? alternative hypothesis: eating avocados has some impact on a person’s cholesterol level (anything sensible can be accepted as an answer, e.g., alternative hypothesis: control and treatment group cholesterol levels are not from the same distribution) What do Type 1 and Type 2 errors mean in this experiment? Explain them in terms of eating avocados, cholesterol levels, and test conclusions. Type 1 error: conclusion of the test is that eating avocados has some impact on cholesterol level, but in reality, eating avocados has no impact on cholesterol. Type 2 error: conclusion of the test is that eating avocados has no impact on cholesterol, but in reality, it does have an impact Estimate either the probability of Type 1 error or the probability of Type 2 error! probability of Type 1 error is the cut-off p-value, which 1% in this problem (see Lecture 4 / “Hypothesis Testing”, “Types of Errors”, and “Probability of Type 1 Error”)
Problem 3: Estimation [8 points] Consider the following histogram of a population of values: The mean of this population is 200.17 , and its standard deviation is 81.21 . Now suppose that we draw simple random samples from this population, each sample having 10,000 elements, and calculate the mean of each sample. 1. If we plot a histogram of the sample means , what shape do we expect to see? Which theorem says that we should expect to see this shape? normal curve (bell shaped is also acceptable as answer) because of the Central Limit Theorem 2. What would you estimate the standard deviation of the sample means to be? std. dev. of sample means = std. dev. of population / sqrt(sample size) = 81.21 / sqrt(10,000) = 81.21 / 100 = 0.8121 (just plugging the numbers into the formula is acceptable, simplifying to one final number is not required) 3. Can you specify an interval such that the mean of a random sample falls into this interval with at least 75% probability ? Which theorem did you use for this estimation? according to Chebychev’s inequality (or Chebychev’s bounds), 75% of sample means fall into the range [mean - 2 * std. dev. of sample means, mean + 2 * std. dev. of sample means] = [200.17 - 2 * 0.8121, 200.17 + 2 * 0.8121] = [198.55, 201.79] (no need to simply to the final numbers, any correct and formal specification of the interval is acceptable) (see Lecture 6 / “Central Limit Theorem”, “Sample Size vs. Variability”, and “Chebychev's Bounds”)
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Problem 4: Correlation [4 points] Consider the following scatter plots: (a) (b) (c) (d) What would you estimate the correlation coefficient between the two variables to be in each plot? Briefly explain your answer for each plot! Note that you need to provide only a very rough, qualitative estimate, not an exact value. (a): close to 0 (or 0) because no association can be observed (b): close to 1 because of strong positive association (c): close to 0 (or 0) because no linear association can be observed (d): close to -1 because of strong negative association (as usual, equivalent statements are acceptable) (see Lecture 7 / “Properties of Correlation” and “Limitations of Correlation”)
Problem 5: Python Programming [4 points] 1. Match each of the following Python code segments with its corresponding type of sampling method: (a) start_row = np.random.choice(np.arange(10)) selected_rows = np.arange(start_row, cars.num_rows, 10) cars.take(selected_rows) (b) cars = Table.read_table('cars.csv') selected_rows = np.arange(0, cars.num_rows, 10) cars.take(selected_rows) (c) all_rows = np.arange(cars.num_rows) selected_num_rows = round(cars.num_rows / 10) selected_rows = np.random.choice(all_rows, selected_num_rows) cars.take(selected_rows) Sampling methods: deterministic sampling : (b) systematic random sampling : (a) simple random sampling : (c) (see Lecture 3 / “Sampling Terminology” and “Probability Samples”) 2. The bootstrap generates new random samples by a method called resampling : new samples are drawn at random from the original sample . Since the original sample resembles the population, sampling the sample resembles sampling the population. Fill in the blanks in the code below, which simulates the bootstrap , to get the empirical distribution of sample means. original_sample = cars.sample(1000, with_replacement= False ) bootstrap = Table().with_columns('Iteration', make_array(), 'Statistic (Mean)', make_array()) for i in np.arange(2000): resample = original_sample.sample(1000, with_replacement= True ) statistic = np.mean(resample.column('Price')) bootstrap.append((i, statistic))