Midterm_Exam - Solutions

pdf

School

Pennsylvania State University *

*We aren’t endorsed by this school

Course

200

Subject

Computer Science

Date

Dec 6, 2023

Type

pdf

Pages

Uploaded by GeneralSummer13484

DS200: Introduction to Data Sciences Midterm Exam & Solutions Problem 1: Parameters [3 points] Consider the following histogram: Match the following parameters: (a) mean , (b) median , (c) 10th percentile , (d) 3rd quartile , (e) standard deviation , (f) variance to the following values: 12.15 : standard deviation 87.12 : 10th percentile 104.12 : mean 106.50 : median (i.e., 50th percentile) 113.61 : 3rd quartile (i.e., 75th percentile) 147.62 : variance (see Lecture 5 / “Percentile and Quartiles” and “Mean vs. Median”, and Lecture 6 / “Example Distributions")

Problem 2: Hypothesis Testing [6 points] Suppose that we conduct an experiment to determine if eating avocados can improve a person’s cholesterol level. We randomly assign volunteers to treatment and control groups, measure their cholesterol levels before and after the experiment, and ensure that the treatment group eats a lot of avocados during the test, while the control group does not. Our null hypothesis is that eating avocados has no impact on a person’s cholesterol level . We will reject this null hypothesis if the result of the experiment is highly statistically significant, that is, we use 1% significance level as the threshold for rejecting the null hypothesis. ● What is the alternative hypothesis ? alternative hypothesis: eating avocados has some impact on a person’s cholesterol level (anything sensible can be accepted as an answer, e.g., alternative hypothesis: control and treatment group cholesterol levels are not from the same distribution) ● What do Type 1 and Type 2 errors mean in this experiment? Explain them in terms of eating avocados, cholesterol levels, and test conclusions. Type 1 error: conclusion of the test is that eating avocados has some impact on cholesterol level, but in reality, eating avocados has no impact on cholesterol. Type 2 error: conclusion of the test is that eating avocados has no impact on cholesterol, but in reality, it does have an impact ● Estimate either the probability of Type 1 error or the probability of Type 2 error! probability of Type 1 error is the cut-off p-value, which 1% in this problem (see Lecture 4 / “Hypothesis Testing”, “Types of Errors”, and “Probability of Type 1 Error”)

Problem 3: Estimation [8 points] Consider the following histogram of a population of values: The mean of this population is 200.17 , and its standard deviation is 81.21 . Now suppose that we draw simple random samples from this population, each sample having 10,000 elements, and calculate the mean of each sample. 1. If we plot a histogram of the sample means , what shape do we expect to see? Which theorem says that we should expect to see this shape? normal curve (bell shaped is also acceptable as answer) because of the Central Limit Theorem 2. What would you estimate the standard deviation of the sample means to be? std. dev. of sample means = std. dev. of population / sqrt(sample size) = 81.21 / sqrt(10,000) = 81.21 / 100 = 0.8121 (just plugging the numbers into the formula is acceptable, simplifying to one final number is not required) 3. Can you specify an interval such that the mean of a random sample falls into this interval with at least 75% probability ? Which theorem did you use for this estimation? according to Chebychev’s inequality (or Chebychev’s bounds), 75% of sample means fall into the range [mean - 2 * std. dev. of sample means, mean + 2 * std. dev. of sample means] = [200.17 - 2 * 0.8121, 200.17 + 2 * 0.8121] = [198.55, 201.79] (no need to simply to the final numbers, any correct and formal specification of the interval is acceptable) (see Lecture 6 / “Central Limit Theorem”, “Sample Size vs. Variability”, and “Chebychev's Bounds”)

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version

Access to all documents
Unlimited textbook solutions
24/7 expert homework help

Problem 4: Correlation [4 points] Consider the following scatter plots: (a) (b) (c) (d) What would you estimate the correlation coefficient between the two variables to be in each plot? Briefly explain your answer for each plot! Note that you need to provide only a very rough, qualitative estimate, not an exact value. (a): close to 0 (or 0) because no association can be observed (b): close to 1 because of strong positive association (c): close to 0 (or 0) because no linear association can be observed (d): close to -1 because of strong negative association (as usual, equivalent statements are acceptable) (see Lecture 7 / “Properties of Correlation” and “Limitations of Correlation”)

Problem 5: Python Programming [4 points] 1. Match each of the following Python code segments with its corresponding type of sampling method: (a) start_row = np.random.choice(np.arange(10)) selected_rows = np.arange(start_row, cars.num_rows, 10) cars.take(selected_rows) (b) cars = Table.read_table('cars.csv') selected_rows = np.arange(0, cars.num_rows, 10) cars.take(selected_rows) (c) all_rows = np.arange(cars.num_rows) selected_num_rows = round(cars.num_rows / 10) selected_rows = np.random.choice(all_rows, selected_num_rows) cars.take(selected_rows) Sampling methods: ● deterministic sampling : (b) ● systematic random sampling : (a) ● simple random sampling : (c) (see Lecture 3 / “Sampling Terminology” and “Probability Samples”) 2. The bootstrap generates new random samples by a method called resampling : new samples are drawn at random from the original sample . Since the original sample resembles the population, sampling the sample resembles sampling the population. Fill in the blanks in the code below, which simulates the bootstrap , to get the empirical distribution of sample means. original_sample = cars.sample(1000, with_replacement= False ) bootstrap = Table().with_columns('Iteration', make_array(), 'Statistic (Mean)', make_array()) for i in np.arange(2000): resample = original_sample.sample(1000, with_replacement= True ) statistic = np.mean(resample.column('Price')) bootstrap.append((i, statistic))

Related Documents

DAD 220 Database Documentation 6-1 Project 1 Golda Smith.docx

Assignment # 4 Analyzing Classroom Data to Plan Intervention.docx

CSE 3330-002 Chapter 6,7 Review Hyeonjun An.pdf

CSE 3330-002 Chapter 8 Review Hyeonjun An.pdf

Lab_Assignment_2 - 2.html

Chapter 2 Homework-1.docx

Homework_Assignment_1 - Solutions.pdf

Homework_Assignment_1 - 5.75.docx

SCS 100 Mod 1 Sasha Goncalves.docx

Doc 3.docx

IMG_2402.jpeg

IMG_2398.jpeg

Recommended textbooks for you

Operations Research : Applications and Algorithms

Computer Science

ISBN:9780534380588

Author:Wayne L. Winston

Publisher:Brooks Cole

Database Systems: Design, Implementation, & Manag...

Computer Science

ISBN:9781305627482

Author:Carlos Coronel, Steven Morris

Publisher:Cengage Learning

A Guide to SQL

Computer Science

ISBN:9781111527273

Author:Philip J. Pratt

Publisher:Course Technology Ptr

Oracle 12c: SQL

Computer Science

ISBN:9781305251038

Author:Joan Casteel

Publisher:Cengage Learning

Principles of Information Systems (MindTap Course...

Computer Science

ISBN:9781285867168

Author:Ralph Stair, George Reynolds

Publisher:Cengage Learning

Np Ms Office 365/Excel 2016 I Ntermed

Computer Science

ISBN:9781337508841

Author:Carey

Publisher:Cengage

SEE MORE TEXTBOOKS

Recommended textbooks for you

Operations Research : Applications and Algorithms
Computer Science
ISBN:9780534380588
Author:Wayne L. Winston
Publisher:Brooks Cole
Database Systems: Design, Implementation, & Manag...
Computer Science
ISBN:9781305627482
Author:Carlos Coronel, Steven Morris
Publisher:Cengage Learning
A Guide to SQL
Computer Science
ISBN:9781111527273
Author:Philip J. Pratt
Publisher:Course Technology Ptr
Oracle 12c: SQL
Computer Science
ISBN:9781305251038
Author:Joan Casteel
Publisher:Cengage Learning
Principles of Information Systems (MindTap Course...
Computer Science
ISBN:9781285867168
Author:Ralph Stair, George Reynolds
Publisher:Cengage Learning
Np Ms Office 365/Excel 2016 I Ntermed
Computer Science
ISBN:9781337508841
Author:Carey
Publisher:Cengage