Homework_Assignment_1 - Solutions

pdf

School

Pennsylvania State University *

*We aren’t endorsed by this school

Course

200

Subject

Computer Science

Date

Dec 6, 2023

Type

pdf

Pages

Uploaded by GeneralSummer13484

Homework Assignment 1 DS200: Introduction to Data Sciences 2022 fall Please complete this assignment by writing your answers in this document. You can submit this Word document or a PDF export on Canvas. Problem 1: Sampling [0.5 points] Suppose that you work at a hospital, and you have to recruit participants for a medical study to test the efficacy of a new heart disease medication. Match the three examples below to the four sampling approaches. Examples : 1. You try to recruit the 100 patients who have the highest blood pressure. 2. You sort all eligible patients by age and try to recruit every 500 th patient, starting with a randomly chosen patient from the first 500. 3. You store patient identifiers in an array called patients , apply numpy.random. choice(patients, 100) , and try to recruit the patients returned by this function. Sampling approaches : • deterministic • systematic random • simple random Answer: 1: deterministic 2: systematic random 3: simple random (see Lecture 3 / “Sampling Terminology” and “Probability Samples”)

Problem 2: Distribution [1 point] Consider the following distribution of values: Identify the following: • mean • median • outlier • 1 st quartile • 95% percentile. Answer: a: 1 st quartile b: median c: mean d: 95 th percentile e: outlier (see Lecture 5 / “Percentiles and Quartiles” and “Mean vs. Median”; see Lab Assignment 2 / “Visualization” for outlier)

Problem 3: Association [1 point] Consider the following three scatter plots: What kind of association can you observe between X and Y in each figure? Explain your answer. Answer: First figure: negative Second figure: positive Third figure: no association can be observed (see Lab Assignment 2 / “Visualization” or Lecture 7 / “Associations”)

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version

Access to all documents
Unlimited textbook solutions
24/7 expert homework help

Problem 4: Causality [1 point] Suppose that you observed a positive association between the following three variables: • number of tooth cavities, • ounces of sugary drinks consumed, • weight in pounds. Which of these three variables might be a confounding variable? What spurious association may it cause? Explain your answer. Answer: Consuming sugary drinks likely causes both tooth cavities and weight increase; hence, “consuming sugary drinks” is a confounding variable , which creates a spurious relationship between “tooth cavities” and “weight in pounds” (there is no causal relationship between these two variables; their association is due to their common causal relationship with the confounding variable). (see Lecture 2)

Problem 5: Hypothesis Testing [2 points] Problem 5.1: Setting Up the Test [1 point] Suppose that you have to conduct a randomized controlled experiment to determine if a new medication for high blood pressure works in practice. You divide the participants, who volunteered to be part of the experiment, into two groups at random: treatment group and control group. The experiment will run for 6 months, after which you will measure the blood pressure of all participants. Questions: • What should we give to the control group? • What would you propose as the null and alternative hypotheses for testing whether the new medication reduces blood pressure? • What test statistic would you use to determine if we should reject the null hypothesis? • Suppose that we store the treatment and control group participants’ blood pressure values (measured after 6 months) in arrays treatment and control , respectively. How would you implement calculating the test statistic in Python? Answer: Control group: placebo (see Lecture 2 / “Randomized Controlled Experiment with Placebo”) Hypotheses: null hypothesis should be that there is no difference between the blood pressures of the control and treatment groups, i.e., blood pressure values of the two groups are drawn from the same population; alternative hypothesis: null hypothesis is not true (other, similar formulations are also acceptable) Test statistic: absolute value of the difference between the average blood pressures of the control and treatment groups (other, sensible formulations are also acceptable) Python expression: abs(np.mean(treatment) – np.mean(control)) (any correct implementation of the proposed test statistic is acceptable) (see Lecture 4 and Lab Assignment 4 / Parts 2 or 3)

Problem 5.2: Interpreting Test Results [1 point] Suppose that the following figure shows the distribution of the test statistic under the null hypothesis, and the red line marks the test statistic value that you observed in the experiment: Does the new medication work? Would you reject the null hypothesis or not? Suppose that to obtain more reliable results, researchers repeat the entire experiment 40 times (with 40 different control and 40 different test groups), and test with 5% significance threshold. Out of the 40 experiments, 2 experiments provide positive results, showing that the medication works. Since there are multiple statistically significant results, the researchers are confident that the medication works, publish these results, and try to make the treatment available to the public. What methodological mistake are the researchers making? Answers: Test result: there is no reason to reject null hypothesis since p-value is very high (consider the area under the curve, right of the observed value); if the null hypothesis is that the there is no difference between the blood pressures of the treatment and control groups, then the test results shows that the medication does not work Mistake in repeated experiments: p-hacking (with 5% significance threshold and 40 experiments, we expect Type 1 error to occur 2 times; hence, 2 positive results may be purely due to chance) (see Lecture 4 / “p-Value” and “Data Snooping”)

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version

Access to all documents
Unlimited textbook solutions
24/7 expert homework help

Problem 6: Bootstrap [0.5 points] Suppose that you are trying to estimate the median blood pressure of a population based on a random sample. To determine how reliable your estimate is, you applied the bootstrap method and obtained the following bootstrap empirical distribution for your statistic (i.e., the median): Percentile Value 5 th percentile 103.76 10 th percentile 107.04 15 th percentile 109.47 20 th percentile 111.40 25 th percentile 113.01 30 th percentile 114.54 35 th percentile 115.93 40 th percentile 117.31 45 th percentile 118.56 50 th percentile 119.87 55 th percentile 121.09 60 th percentile 122.32 65 th percentile 123.58 70 th percentile 125.09 75 th percentile 126.60 80 th percentile 128.23 85 th percentile 130.22 90 th percentile 132.84 95 th percentile 136.55 What is the 80% confidence interval for your statistic (i.e., the median)? Answer: 107.04 – 132.84 (between the 10 th and 90 th percentile since this interval contains 90% - 10% = 80% of the outcomes) (see Lecture 5 / “Percentiles and Quartiles” and “Confidence Intervals”)

Related Documents

Assignment # 4 Analyzing Classroom Data to Plan Intervention.docx

CSE 3330-002 Chapter 6,7 Review Hyeonjun An.pdf

CSE 3330-002 Chapter 8 Review Hyeonjun An.pdf

Lab_Assignment_2 - 2.html

Chapter 2 Homework-1.docx

Midterm_Exam - Solutions.pdf

Homework_Assignment_1 - 5.75.docx

SCS 100 Mod 1 Sasha Goncalves.docx

Recommended textbooks for you

Np Ms Office 365/Excel 2016 I Ntermed

Computer Science

ISBN:9781337508841

Author:Carey

Publisher:Cengage

Programming Logic & Design Comprehensive

Computer Science

ISBN:9781337669405

Author:FARRELL

Publisher:Cengage

COMPREHENSIVE MICROSOFT OFFICE 365 EXCE

Computer Science

ISBN:9780357392676

Author:FREUND, Steven

Publisher:CENGAGE L

Systems Architecture

Computer Science

ISBN:9781305080195

Author:Stephen D. Burd

Publisher:Cengage Learning

MIS

Computer Science

ISBN:9781337681919

Author:BIDGOLI

Publisher:Cengage

Database Systems: Design, Implementation, & Manag...

Computer Science

ISBN:9781305627482

Author:Carlos Coronel, Steven Morris

Publisher:Cengage Learning

SEE MORE TEXTBOOKS

Recommended textbooks for you

Np Ms Office 365/Excel 2016 I Ntermed
Computer Science
ISBN:9781337508841
Author:Carey
Publisher:Cengage
Programming Logic & Design Comprehensive
Computer Science
ISBN:9781337669405
Author:FARRELL
Publisher:Cengage
COMPREHENSIVE MICROSOFT OFFICE 365 EXCE
Computer Science
ISBN:9780357392676
Author:FREUND, Steven
Publisher:CENGAGE L
Systems Architecture
Computer Science
ISBN:9781305080195
Author:Stephen D. Burd
Publisher:Cengage Learning
MIS
Computer Science
ISBN:9781337681919
Author:BIDGOLI
Publisher:Cengage
Database Systems: Design, Implementation, & Manag...
Computer Science
ISBN:9781305627482
Author:Carlos Coronel, Steven Morris
Publisher:Cengage Learning