22_independence

pdf

School

Rumson Fair Haven Reg H *

*We aren’t endorsed by this school

Course

101

Subject

Statistics

Date

Nov 24, 2024

Type

pdf

Pages

Uploaded by CoachRiverTiger30

Prof. Garcia SDS 201: Lecture notes March 28, 2018 Agenda 1. Confidence intervals for tests of proportions 2. Goodness of fit Confidence Intervals for tests of proportions confidence intervals for test statistics that are normally distributed are of the form: point estimate ± z * α/ 2 · SE Computing the point estimate is usually easy. Once you’ve chosen a confidence level, finding z * α/ 2 is trivial (use qnorm() ). The difficult part is usually computing the SE , since that depends on the sampling distribution of the test statistic! 1. In 2009, a national vital statistics report indicated that about 3% of all births produced twins. Is the rate of twin births the same among very young mothers? Data from a large city hospital found that only 7 sets of twins were born to 469 teenage girls. Calculate a confidence interval for the true proportion and use it to test an appropriate hypothesis. 2. Researchers at the National Cancer Institute released the results of a study that investigated the effects of 827 dogs from homes where an herbicide was used on a regular basis, diagnosing malignant lymphoma in 473 of them. Of the 130 dogs from homes where no herbicides were used, only 19 were found to have lymphoma. Is there a different rate of cancer in dogs between homes that use the herbicide and homes that do not? Calculate a confidence interval for the true difference in proportions and use it to test an appropriate hypothesis.

Prof. Garcia SDS 201: Lecture notes March 28, 2018 Goodness of Fit Previously, we considered inference for a single proportion. That proportion was the fraction of the outcomes of a binary response variable that had a certain value. For example, respondents could either say that they preferred Coke, or that they preferred Pepsi. But what if the variable can have more than two outcomes? Can we still test the hypothesis that the sample was drawn from a known population? The US Census Bureau reports that in 2000, among the population 15 years and older: • 54.3% are married • 27.1% have never been married • 9.7% are divorced • 6.6% are widowed • 2.2% are separated We can encode these percentages as a vector in R : us <- c ( "Divorced" = 0.097 , "Married" = 0.543 , "Never married/single" = 0.271 , "Separated" = 0.022 , "Widowed" = 0.066 ) # normalize to make sure proportions sum to 1 us <- us / sum (us) The openintro package contains a sample of 500 Americans collected in the 2000 Census. In this sample, the percentages are different: library (openintro) library (mosaic) marital_summary <- census %>% mutate ( maritalStatus = forcats :: fct_recode (maritalStatus, Married = "Married/spouse absent" , Married = "Married/spouse present" )) %>% group_by (maritalStatus) %>% summarize ( status_obs = n ()) %>% mutate ( marital_status_pct = status_obs / nrow (census), marital_status_us = us) marital_summary $ marital_status_pct ## [1] 0.076 0.412 0.444 0.006 0.062 Is it reasonable to conclude that the sample from 2000 reflects the overall US population? In the previous case, the test statistic was the observed sample proportion ˆ p . In this case, we have more than two outcomes, so there is nothing quite analogous to ˆ p . The test statistic that we will use will be labelled X 2 , and it’s formula is: X 2 = k X i =1 Z 2 i = k X i =1 observed i - expected i √ expected i 2 = k X i =1 ( observed i - expected i ) 2 expected i , where k is the number of different outcomes (which in this case is 5). As always, our goal is to put X 2 in context by determining where it lies in the null distribution. First, let’s compute the test statistic: n <- nrow (census) k <- nrow (marital_summary) marital_summary <- marital_summary %>% mutate ( status_exp = marital_status_us * n) X2_hat <- marital_summary %>% summarize ( X2 = sum ((status_obs - status_exp) ^ 2 / status_exp)) %>% unlist ()

Prof. Garcia SDS 201: Lecture notes March 28, 2018 1. Write out the full calculation for X 2 using a table We want to test the null hypothesis that our sample came from the population, whose marital status breakdown is known. Since this implies that the observed counts will match the expected counts exactly, this would result in a test statistic of ˆ X 2 = 0. Our observed value of ˆ X 2 is very different from 0, but in order to understand how different, we need to know what the null distribution of ˆ X 2 is. In this case, it is not normal! Just as before, there are different ways to construct the sampling distribution of ˆ X 2 : 1. Simulation: The procedure is the same it has been: sample from the hypothesized distribution and compute the test statistic many thousands of times. sim <- do ( 1000 ) * marital_summary %>% sample_n ( size = n, replace = TRUE , weight = marital_status_us) %>% group_by (maritalStatus) %>% summarize ( status_obs = n (), status_exp = first (status_exp)) %>% mutate ( X2_i = (status_obs - status_exp) ^ 2 / status_exp) %>% summarize ( X2 = sum (X2_i)) qplot ( data = sim, x = X2) The p-value can be obtained using the pdata function, since the sampling distribution comes from simulated data in our workspace. Note also that since the distribution is non-negative, our test is one-sided. pdata ( ~ X2, X2_hat, data = sim, lower.tail = FALSE ) ## X2 ## 0 2. Chi-Squared Test: Statisticians have constructed a parametric approximation to the sampling distribution of ˆ X 2 . It follows from probability theory that as long as the expected count of each outcome is at least 5, the test statistic follows a distribution that is closely approximated by a χ 2 -distribution on k - 1 degrees of freedom. plotDist ( "chisq" , params = list ( df = k - 1 ), lwd = 3 ) The p-value can be obtained using the pchisq function, since the sampling distribution follows a χ 2 -distribution. pchisq (X2_hat, df = k - 1 , lower.tail = FALSE ) ## X2 ## 2.63096e-16

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version

Access to all documents
Unlimited textbook solutions
24/7 expert homework help

Prof. Garcia SDS 201: Lecture notes March 28, 2018 Notice that the p-value is a one-tailed area in this case, since the distribution is non-negative. There is also a built-in function in R that will perform a χ 2 -test. with (marital_summary, chisq.test (status_obs, p = marital_status_us)) ## ## Chi-squared test for given probabilities ## ## data: status_obs ## X-squared = 79.154, df = 4, p-value = 2.631e-16 What Can Go Wrong? The condition that the expected count for each category is at least 5 is important, because if that condition is not met, the χ 2 -distribution may not be a sufficiently good approximation. Note that the deviation in each count is approximately normal, so the approximation can fail for any of the outcomes. n <- 35 sim <- do ( 1000 ) * marital_summary %>% mutate ( status_exp = marital_status_us * n) %>% sample_n ( size = n, replace = TRUE , weight = marital_status_us) %>% group_by (maritalStatus) %>% summarize ( status_obs = n (), status_exp = first (status_exp)) %>% mutate ( X2_i = (status_obs - status_exp) ^ 2 / status_exp) %>% summarize ( X2 = sum (X2_i)) qplot ( data = sim, x = X2, geom = "density" ) + stat_function ( fun = dchisq, args = list ( df = k - 1 ), color = "purple" ) 0.00 0.05 0.10 0.15 0.20 0 5 10 15 20 X2 density

Prof. Garcia SDS 201: Lecture notes March 28, 2018 In-Class Exercise, OI, 3.40 Evolution vs. creationism A Gallup Poll released in December 2010 asked 1019 adults living in the Continental U.S. about their belief in the origin of humans. These results, along with results from a more comprehensive poll from 2001 (that we will assume to be exactly accurate), are summarized in the table below: Year Response 2010 2001 Humans evolved, with God guiding (1) 38% 37% Humans evolved, but God had no part in process (2) 16% 12% God created humans in present form (3) 40% 45% Other / No opinion (4) 6% 6% 1. Calculate the actual number of respondents in 2010 that fall in each response category. 2. State hypotheses for the following research question: have beliefs on the origin of human life changed since 2001? 3. Calculate the expected number of respondents in each category under the condition that the null hypothesis is true. 4. Conduct a chi-square test and state your conclusion. (Reminder: verify conditions.)

22_independence

Related Documents