22_independence
pdf
keyboard_arrow_up
School
Rumson Fair Haven Reg H *
*We aren’t endorsed by this school
Course
101
Subject
Statistics
Date
Nov 24, 2024
Type
Pages
5
Uploaded by CoachRiverTiger30
Prof. Garcia
SDS 201: Lecture notes
March 28, 2018
Agenda
1. Confidence intervals for tests of proportions
2. Goodness of fit
Confidence Intervals for tests of proportions
confidence intervals for test statistics that are
normally distributed are of the form:
point estimate
±
z
*
α/
2
·
SE
Computing the point estimate is usually easy. Once you’ve chosen a confidence level, finding
z
*
α/
2
is trivial (use
qnorm()
). The difficult part is usually computing the
SE
, since that depends on the
sampling distribution of the test statistic!
1. In 2009, a national vital statistics report indicated that about 3% of all births produced twins.
Is the rate of twin births the same among very young mothers? Data from a large city hospital
found that only 7 sets of twins were born to 469 teenage girls. Calculate a confidence interval
for the true proportion and use it to test an appropriate hypothesis.
2. Researchers at the National Cancer Institute released the results of a study that investigated
the effects of 827 dogs from homes where an herbicide was used on a regular basis, diagnosing
malignant lymphoma in 473 of them. Of the 130 dogs from homes where no herbicides were
used, only 19 were found to have lymphoma. Is there a different rate of cancer in dogs between
homes that use the herbicide and homes that do not? Calculate a confidence interval for the
true difference in proportions and use it to test an appropriate hypothesis.
Prof. Garcia
SDS 201: Lecture notes
March 28, 2018
Goodness of Fit
Previously, we considered inference for a single proportion.
That proportion
was the fraction of the outcomes of a binary response variable that had a certain value. For example,
respondents could either say that they preferred Coke, or that they preferred Pepsi. But what if the
variable can have more than two outcomes? Can we still test the hypothesis that the sample was
drawn from a known population?
The US Census Bureau reports that in 2000, among the population 15 years and older:
•
54.3% are married
•
27.1% have never been married
•
9.7% are divorced
•
6.6% are widowed
•
2.2% are separated
We can encode these percentages as a vector in
R
:
us
<-
c
(
"Divorced"
=
0.097
,
"Married"
=
0.543
,
"Never married/single"
=
0.271
,
"Separated"
=
0.022
,
"Widowed"
=
0.066
)
# normalize to make sure proportions sum to 1
us
<-
us
/
sum
(us)
The
openintro
package contains a sample of 500 Americans collected in the 2000 Census. In
this sample, the percentages are different:
library
(openintro)
library
(mosaic)
marital_summary
<-
census
%>%
mutate
(
maritalStatus
=
forcats
::
fct_recode
(maritalStatus,
Married
=
"Married/spouse absent"
,
Married
=
"Married/spouse present"
))
%>%
group_by
(maritalStatus)
%>%
summarize
(
status_obs
=
n
())
%>%
mutate
(
marital_status_pct
= status_obs
/
nrow
(census),
marital_status_us
= us)
marital_summary
$
marital_status_pct
## [1] 0.076 0.412 0.444 0.006 0.062
Is it reasonable to conclude that the sample from 2000 reflects the overall US population?
In the previous case, the test statistic was the observed sample proportion ˆ
p
. In this case, we
have more than two outcomes, so there is nothing quite analogous to ˆ
p
. The test statistic that we
will use will be labelled
X
2
, and it’s formula is:
X
2
=
k
X
i
=1
Z
2
i
=
k
X
i
=1
observed
i
-
expected
i
√
expected
i
2
=
k
X
i
=1
(
observed
i
-
expected
i
)
2
expected
i
,
where
k
is the number of different outcomes (which in this case is 5). As always, our goal is to put
X
2
in context by determining where it lies in the null distribution.
First, let’s compute the test
statistic:
n
<-
nrow
(census)
k
<-
nrow
(marital_summary)
marital_summary
<-
marital_summary
%>%
mutate
(
status_exp
= marital_status_us
*
n)
X2_hat
<-
marital_summary
%>%
summarize
(
X2
=
sum
((status_obs
-
status_exp)
^
2
/
status_exp))
%>%
unlist
()
Prof. Garcia
SDS 201: Lecture notes
March 28, 2018
1. Write out the full calculation for
X
2
using a table
We want to test the null hypothesis that our sample came from the population, whose marital
status breakdown is known.
Since this implies that the observed counts will match the expected
counts exactly, this would result in a test statistic of
ˆ
X
2
= 0.
Our observed value of
ˆ
X
2
is very
different from 0, but in order to understand
how
different, we need to know what the null distribution
of
ˆ
X
2
is. In this case, it is
not
normal!
Just as before, there are different ways to construct the sampling distribution of
ˆ
X
2
:
1. Simulation: The procedure is the same it has been: sample from the hypothesized distribution
and compute the test statistic many thousands of times.
sim
<-
do
(
1000
)
*
marital_summary
%>%
sample_n
(
size
= n,
replace
=
TRUE
,
weight
= marital_status_us)
%>%
group_by
(maritalStatus)
%>%
summarize
(
status_obs
=
n
(),
status_exp
=
first
(status_exp))
%>%
mutate
(
X2_i
= (status_obs
-
status_exp)
^
2
/
status_exp)
%>%
summarize
(
X2
=
sum
(X2_i))
qplot
(
data
= sim,
x
= X2)
The p-value can be obtained using the
pdata
function, since the sampling distribution comes
from simulated data in our workspace. Note also that since the distribution is non-negative,
our test is one-sided.
pdata
(
~
X2, X2_hat,
data
= sim,
lower.tail
=
FALSE
)
## X2
##
0
2. Chi-Squared Test: Statisticians have constructed a parametric approximation to the sampling
distribution of
ˆ
X
2
. It follows from probability theory that as long as the expected count of
each outcome is at least 5, the test statistic follows a distribution that is closely approximated
by a
χ
2
-distribution on
k
-
1 degrees of freedom.
plotDist
(
"chisq"
,
params
=
list
(
df
= k
-
1
),
lwd
=
3
)
The p-value can be obtained using the
pchisq
function, since the sampling distribution follows
a
χ
2
-distribution.
pchisq
(X2_hat,
df
= k
-
1
,
lower.tail
=
FALSE
)
##
X2
## 2.63096e-16
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Prof. Garcia
SDS 201: Lecture notes
March 28, 2018
Notice that the p-value is a one-tailed area in this case, since the distribution is non-negative.
There is also a built-in function in
R
that will perform a
χ
2
-test.
with
(marital_summary,
chisq.test
(status_obs,
p
= marital_status_us))
##
##
Chi-squared test for given probabilities
##
## data:
status_obs
## X-squared = 79.154, df = 4, p-value = 2.631e-16
What Can Go Wrong?
The condition that the expected count for each category is at least 5 is
important, because if that condition is not met, the
χ
2
-distribution may not be a sufficiently good
approximation. Note that the deviation in each count is approximately normal, so the approximation
can fail for any of the outcomes.
n
<-
35
sim
<-
do
(
1000
)
*
marital_summary
%>%
mutate
(
status_exp
= marital_status_us
*
n)
%>%
sample_n
(
size
= n,
replace
=
TRUE
,
weight
= marital_status_us)
%>%
group_by
(maritalStatus)
%>%
summarize
(
status_obs
=
n
(),
status_exp
=
first
(status_exp))
%>%
mutate
(
X2_i
= (status_obs
-
status_exp)
^
2
/
status_exp)
%>%
summarize
(
X2
=
sum
(X2_i))
qplot
(
data
= sim,
x
= X2,
geom
=
"density"
)
+
stat_function
(
fun
= dchisq,
args
=
list
(
df
= k
-
1
),
color
=
"purple"
)
0.00
0.05
0.10
0.15
0.20
0
5
10
15
20
X2
density
Prof. Garcia
SDS 201: Lecture notes
March 28, 2018
In-Class Exercise, OI, 3.40
Evolution vs. creationism
A Gallup Poll released in December
2010 asked 1019 adults living in the Continental U.S. about their belief in the origin of humans.
These results, along with results from a more comprehensive poll from 2001 (that we will assume to
be exactly accurate), are summarized in the table below:
Year
Response
2010
2001
Humans evolved, with God guiding (1)
38%
37%
Humans evolved, but God had no part in process (2)
16%
12%
God created humans in present form (3)
40%
45%
Other / No opinion (4)
6%
6%
1. Calculate the actual number of respondents in 2010 that fall in each response category.
2. State hypotheses for the following research question: have beliefs on the origin of human life
changed since 2001?
3. Calculate the expected number of respondents in each category under the condition that the
null hypothesis is true.
4. Conduct a chi-square test and state your conclusion. (Reminder: verify conditions.)
Recommended textbooks for you
data:image/s3,"s3://crabby-images/b9e14/b9e141b888912793d57db61a53fa701d5defdb09" alt="Text book image"
Glencoe Algebra 1, Student Edition, 9780079039897...
Algebra
ISBN:9780079039897
Author:Carter
Publisher:McGraw Hill
Recommended textbooks for you
- Glencoe Algebra 1, Student Edition, 9780079039897...AlgebraISBN:9780079039897Author:CarterPublisher:McGraw Hill
data:image/s3,"s3://crabby-images/b9e14/b9e141b888912793d57db61a53fa701d5defdb09" alt="Text book image"
Glencoe Algebra 1, Student Edition, 9780079039897...
Algebra
ISBN:9780079039897
Author:Carter
Publisher:McGraw Hill