Homework_Assignment_1 - Solutions

pdf

School

Pennsylvania State University *

*We aren’t endorsed by this school

Course

200

Subject

Computer Science

Date

Dec 6, 2023

Type

pdf

Pages

7

Uploaded by GeneralSummer13484

Report
Homework Assignment 1 DS200: Introduction to Data Sciences 2022 fall Please complete this assignment by writing your answers in this document. You can submit this Word document or a PDF export on Canvas. Problem 1: Sampling [0.5 points] Suppose that you work at a hospital, and you have to recruit participants for a medical study to test the efficacy of a new heart disease medication. Match the three examples below to the four sampling approaches. Examples : 1. You try to recruit the 100 patients who have the highest blood pressure. 2. You sort all eligible patients by age and try to recruit every 500 th patient, starting with a randomly chosen patient from the first 500. 3. You store patient identifiers in an array called patients , apply numpy.random. choice(patients, 100) , and try to recruit the patients returned by this function. Sampling approaches : deterministic systematic random simple random Answer: 1: deterministic 2: systematic random 3: simple random (see Lecture 3 / “Sampling Terminology” and “Probability Samples”)
Problem 2: Distribution [1 point] Consider the following distribution of values: Identify the following: mean median outlier 1 st quartile 95% percentile. Answer: a: 1 st quartile b: median c: mean d: 95 th percentile e: outlier (see Lecture 5 / “Percentiles and Quartiles” and “Mean vs. Median”; see Lab Assignment 2 / “Visualization” for outlier)
Problem 3: Association [1 point] Consider the following three scatter plots: What kind of association can you observe between X and Y in each figure? Explain your answer. Answer: First figure: negative Second figure: positive Third figure: no association can be observed (see Lab Assignment 2 / “Visualization” or Lecture 7 / “Associations”)
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Problem 4: Causality [1 point] Suppose that you observed a positive association between the following three variables: number of tooth cavities, ounces of sugary drinks consumed, weight in pounds. Which of these three variables might be a confounding variable? What spurious association may it cause? Explain your answer. Answer: Consuming sugary drinks likely causes both tooth cavities and weight increase; hence, “consuming sugary drinks” is a confounding variable , which creates a spurious relationship between “tooth cavities” and “weight in pounds” (there is no causal relationship between these two variables; their association is due to their common causal relationship with the confounding variable). (see Lecture 2)
Problem 5: Hypothesis Testing [2 points] Problem 5.1: Setting Up the Test [1 point] Suppose that you have to conduct a randomized controlled experiment to determine if a new medication for high blood pressure works in practice. You divide the participants, who volunteered to be part of the experiment, into two groups at random: treatment group and control group. The experiment will run for 6 months, after which you will measure the blood pressure of all participants. Questions: What should we give to the control group? What would you propose as the null and alternative hypotheses for testing whether the new medication reduces blood pressure? What test statistic would you use to determine if we should reject the null hypothesis? Suppose that we store the treatment and control group participants’ blood pressure values (measured after 6 months) in arrays treatment and control , respectively. How would you implement calculating the test statistic in Python? Answer: Control group: placebo (see Lecture 2 / “Randomized Controlled Experiment with Placebo”) Hypotheses: null hypothesis should be that there is no difference between the blood pressures of the control and treatment groups, i.e., blood pressure values of the two groups are drawn from the same population; alternative hypothesis: null hypothesis is not true (other, similar formulations are also acceptable) Test statistic: absolute value of the difference between the average blood pressures of the control and treatment groups (other, sensible formulations are also acceptable) Python expression: abs(np.mean(treatment) – np.mean(control)) (any correct implementation of the proposed test statistic is acceptable) (see Lecture 4 and Lab Assignment 4 / Parts 2 or 3)
Problem 5.2: Interpreting Test Results [1 point] Suppose that the following figure shows the distribution of the test statistic under the null hypothesis, and the red line marks the test statistic value that you observed in the experiment: Does the new medication work? Would you reject the null hypothesis or not? Suppose that to obtain more reliable results, researchers repeat the entire experiment 40 times (with 40 different control and 40 different test groups), and test with 5% significance threshold. Out of the 40 experiments, 2 experiments provide positive results, showing that the medication works. Since there are multiple statistically significant results, the researchers are confident that the medication works, publish these results, and try to make the treatment available to the public. What methodological mistake are the researchers making? Answers: Test result: there is no reason to reject null hypothesis since p-value is very high (consider the area under the curve, right of the observed value); if the null hypothesis is that the there is no difference between the blood pressures of the treatment and control groups, then the test results shows that the medication does not work Mistake in repeated experiments: p-hacking (with 5% significance threshold and 40 experiments, we expect Type 1 error to occur 2 times; hence, 2 positive results may be purely due to chance) (see Lecture 4 / “p-Value” and “Data Snooping”)
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Problem 6: Bootstrap [0.5 points] Suppose that you are trying to estimate the median blood pressure of a population based on a random sample. To determine how reliable your estimate is, you applied the bootstrap method and obtained the following bootstrap empirical distribution for your statistic (i.e., the median): Percentile Value 5 th percentile 103.76 10 th percentile 107.04 15 th percentile 109.47 20 th percentile 111.40 25 th percentile 113.01 30 th percentile 114.54 35 th percentile 115.93 40 th percentile 117.31 45 th percentile 118.56 50 th percentile 119.87 55 th percentile 121.09 60 th percentile 122.32 65 th percentile 123.58 70 th percentile 125.09 75 th percentile 126.60 80 th percentile 128.23 85 th percentile 130.22 90 th percentile 132.84 95 th percentile 136.55 What is the 80% confidence interval for your statistic (i.e., the median)? Answer: 107.04 – 132.84 (between the 10 th and 90 th percentile since this interval contains 90% - 10% = 80% of the outcomes) (see Lecture 5 / “Percentiles and Quartiles” and “Confidence Intervals”)