chap 10 Expect_The_Unexpected_A_First_Course_In_Biostatist..._----_(Statistics) (3)
pdf
keyboard_arrow_up
School
University of Ottawa *
*We aren’t endorsed by this school
Course
2379
Subject
Biology
Date
Jan 9, 2024
Type
Pages
28
Uploaded by GrandUniverseHyena41
Chapter 10
Comparison of Two Independent
Samples
Biologists are often interested in the comparison of groups. Consider the
following examples. Do two different species of swallow produce similar eggs
on average? Does a type of fertilizer produce larger plants on average, com-
pared to another type of fertilizer? In this chapter, we introduce methods
to compare two independent groups.
We discuss how interval estimation
and hypothesis testing can be used to infer whether there are differences be-
tween the two populations. We first discuss techniques to compare means,
and end the chapter with techniques to compare proportions.
10.1
Study/Experimental Design
When analyzing data, it is important to consider the design of the study or
experiment. This is especially true when comparing groups. The design of
the study often dictates the probability model that will be used to describe
the data collection process from the populations of interest. It is only when
the probability model is appropriate, that we can generalize our results
from the samples to the populations.
Scientists often want to compare groups that are outcomes from a con-
trolled experiment which is run under different experimental conditions.
For example, a simple experiment might be designed to test a claim that
a particular type of fertilizer produces taller plants compared to another
type of fertilizer. The response variable in this instance is the height of the
plants. The primary factor for this experiment is the fertilizer. The levels
of the factor are called
treatments
. So the treatments in this case are the
types of fertilizer. In a controlled experiment we assign the treatments to
the experimental units, which could be plots with one seedling in this case.
This assignment determines the treatment groups.
163
Balan, R., & Lamothe, G. (2017). Expect the unexpected : A first course in biostatistics (second edition). World
Scientific Publishing Company.
Created from ottawa on 2023-09-29 20:15:50.
Copyright © 2017. World Scientific Publishing Company. All rights reserved.
164
Expect the Unexpected: A First Course in Biostatistics
It is possible that there are uncontrolled factors that might affect the
response variable.
These are called
nuisance factors
.
For example the
genetic predisposition of a seedling to produce a tall plant might be a
nuisance factor.
Randomization
is used to average the effects of the nuisance
factors over the different groups. We should randomly assign the types of
fertilizer to the seedlings.
The purpose of a controlled experiment is to determine if there is a
cause-and-effect
relationship. In our case, this means that the use of the
new fertilizer produces taller plants on average.
If the controlled experi-
ment is randomized and the treatment groups are statistically significantly
different, then we can be confident that there is indeed a cause-and-effect
relationship.
One of the simplest experimental designs is called a
completely random-
ized design
. For completely randomized designs, the levels of the primary
factor are randomly assigned to the experimental units. Our fertilizer ex-
periment has such a design. The tools introduced in this chapter apply to
experiments with a completely randomized design.
In some circumstances, the distribution of the response variable can be
highly spread-out. This variability might be due to nuisance factors. For
example, females and males might react differently to a particular drug.
This noise can be prohibitive, in the sense that we would need very large
samples in order to identify significant treatment effects.
To reduce this
noise we can construct homogeneous subgroups, called
blocks
. The variance
within each block should be smaller than the variance of the entire sample.
So the estimates within the blocks should be more precise. As we combine
the estimates across blocks, we should obtain an estimate of the treatment
effect that is more precise than without
blocking
.
If we randomly assign all of the treatments to the experimental units
within each block, then we say that the experiment has a
randomized com-
plete block design
.
As an example, if we want to compare a drug to a
placebo and we believe that the gender has also an effect on the response,
we divide the subjects into blocks according to their gender.
If we have
ten subjects of each gender, we randomly assign the drug to five subjects
of each gender. The remainder of the subjects are given the placebo. We
do not discuss the analysis of block designs in this chapter.
The techniques presented in this chapter do not apply only to completely
randomized experiments. They are also applicable in a non-experimental
setting. Consider the study [64], where the authors compare the breeding
biology of the Welcome Swallow in Australia and New Zealand. The factor
Balan, R., & Lamothe, G. (2017). Expect the unexpected : A first course in biostatistics (second edition). World
Scientific Publishing Company.
Created from ottawa on 2023-09-29 20:15:50.
Copyright © 2017. World Scientific Publishing Company. All rights reserved.
Comparison of Two Independent Samples
165
(in this case, the location) is not assigned to the unit of study (the bird).
Such a study is called an
observational study
.
An observational study can identify associations, but not causality. We
are not randomly assigning the treatments to the units of study. So there
is a danger that any association that we find between the response and the
factor may be due to some third variable, called a
lurking
variable, which
is not evenly distributed among the groups.
Maybe it is access to food
that caused the difference in breeding biology, and not the location. So we
should not say that it is the observational factor that caused the significant
result. However, we can say that there is an association.
The techniques in this chapter can be used to compare samples from an
observational study as long as it is reasonable to assume that observations
within the samples are independent, and that there is independence between
the two samples.
10.2
Confidence Intervals and Tests for Means: Large
Samples
In this section, we discuss techniques to compare the means of two inde-
pendent populations, when both sample sizes are large. We use
X
1
and
X
2
to denote the random measurements from population 1 and population 2,
respectively.
Their means are denoted by
μ
1
=
E
(
X
1
) and
μ
2
=
E
(
X
2
)
and their variances are denoted by
σ
2
1
= Var(
X
1
) and
σ
2
2
= Var(
X
2
).
We assume that we have a random sample of size
n
1
≥
40 from pop-
ulation 1, whose mean and variance are denoted by
X
1
, respectively
S
2
1
.
Similarly, we have a random sample of size
n
2
≥
40 from population 2,
whose mean and variance are denoted by
X
2
, respectively
S
2
2
. From Ex-
ample 7.8, we know that
E
(
X
1
) =
μ
1
,
Var(
X
1
) =
σ
2
1
n
1
,
E
(
X
2
) =
μ
2
,
Var(
X
2
) =
σ
2
2
n
1
.
To compare the two means, we examine the difference in means
μ
1
-
μ
2
.
We begin the discussion with point estimation.
A natural estimator of
μ
1
-
μ
2
is the difference in sample means
X
1
-
X
2
.
This estimator is
unbiased since its expected value is
E
(
X
1
-
X
2
) =
E
(
X
1
)
-
E
(
X
2
) =
μ
1
-
μ
2
.
The variance of the estimator is
Var(
X
1
-
X
2
) = Var(
X
1
) + Var(
X
2
) =
σ
2
1
n
1
+
σ
2
2
n
2
Balan, R., & Lamothe, G. (2017). Expect the unexpected : A first course in biostatistics (second edition). World
Scientific Publishing Company.
Created from ottawa on 2023-09-29 20:15:50.
Copyright © 2017. World Scientific Publishing Company. All rights reserved.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
166
Expect the Unexpected: A First Course in Biostatistics
(see Theorem 7.1). Similar to the estimation of the mean in the one sample
case, the larger the sample sizes, the more precise is the estimate. Further-
more, as we standardize
X
1
-
X
2
, we obtain that
X
1
-
X
2
-
(
μ
1
-
μ
2
)
p
σ
2
1
/n
1
+
σ
2
2
/n
2
has approximately an
N
(0
,
1) distribution.
When both sample sizes are large (i.e.
n
1
≥
40 and
n
2
≥
40), we can use
the sample variances instead of the population variances. More precisely,
X
1
-
X
2
-
(
μ
1
-
μ
2
)
p
S
2
1
/n
1
+
S
2
2
/n
2
has approximately an
N
(0
,
1) distribution.
(10.1)
This approximation can be used even if the populations are not normally
distributed. To justify it, recall that by Theorem 8.1,
X
1
has approximately
an
N
(
μ
1
, S
2
1
/n
1
) distribution, and
X
2
has approximately an
N
(
μ
2
, S
2
2
/n
2
)
distribution.
Moreover,
X
1
and
X
2
are independent random variables,
since the two populations are independent. By Theorem 7.2, it follows that
X
1
-
X
2
has also approximately a normal distribution, with mean
μ
1
-
μ
2
and variance
S
2
1
/n
1
+
S
2
2
/n
2
. Relation (10.1) follows by standardization.
The theory that we present in this section is based upon the standard-
ization (10.1). We emphasize that this standardization should be used only
when both sample samples are greater than or equal to 40.
We consider the inference concerning the difference
μ
1
-
μ
2
. The null
hypothesis is of the form
H
0
:
μ
1
-
μ
2
=
δ
0
, where
δ
0
is a given numeric
value. Note that when
δ
0
= 0, the null hypothesis becomes
H
0
:
μ
1
-
μ
2
= 0,
or equivalently
H
0
:
μ
1
=
μ
2
.
We use the following test statistic:
Z
0
=
X
1
-
X
2
-
δ
0
p
S
2
1
/n
1
+
S
2
2
/n
2
.
(10.2)
If
H
0
holds, then the sampling distribution of
Z
0
is approximately standard
normal.
Hence we can use Table 18.3, to compute the corresponding
p
-
value.
Recall that the
p
-value is the probability of observing a value as
extreme as the current observed value, under the assumption that the null
hypothesis holds.
Since our definition of an extreme value depends on
the alternative hypothesis, the computation of the
p
-value depends on the
alternative hypothesis.
Table 10.1 gives the
p
-value for testing the null hypothesis
H
0
:
μ
1
-
μ
2
=
δ
0
against one of the alternative hypotheses
H
1
.
In this table,
Z
has a
standard normal distribution and
z
0
=
x
1
-
x
2
-
δ
0
p
s
2
1
/n
1
+
s
2
2
/n
2
Balan, R., & Lamothe, G. (2017). Expect the unexpected : A first course in biostatistics (second edition). World
Scientific Publishing Company.
Created from ottawa on 2023-09-29 20:15:50.
Copyright © 2017. World Scientific Publishing Company. All rights reserved.
Comparison of Two Independent Samples
167
is the observed value of the test statistic
Z
0
given by (10.2).
This will
produce a
large sample test
for comparing the means.
Table 10.1
The
p
-value for comparison of
two means: large samples
Alternative Hypothesis
p
-value
H
1
:
μ
1
-
μ
2
> δ
0
P
(
Z > z
0
)
H
1
:
μ
1
-
μ
2
< δ
0
P
(
Z < z
0
)
H
1
:
μ
1
-
μ
2
6
=
δ
0
2
P
(
Z >
|
z
0
|
)
The
p
-value is a measure of how much evidence we have against the null
hypothesis. The smaller the
p
-value, the greater the inconsistency between
the data and the null hypothesis. Actually, the
p
-value is the smallest level
of significance at which the null hypothesis can be rejected with the given
data. We will use the same rule as in Section 9.1:
if
p
-value
< α,
then we reject
H
0
if
p
-value
≥
α,
then we fail to reject
H
0
.
This rule ensures that the probability of type I error is approximately equal
to
α
. When the null hypothesis
H
0
:
μ
1
-
μ
2
= 0 is rejected, it is often said
that the difference between
μ
1
and
μ
2
is statistically significant.
The
p
-value is a valuable statistic that measures the risk associated
with rejecting the null hypothesis. However, it does not give us the whole
picture. Think of the hypothesis test as a diagnostic tool. We must assess its
specificity and its sensitivity (often called
power
in the context of hypothesis
testing). We can control its specificity (our chances of failing to reject
H
0
when
H
0
is true) with the use of a significance level. We can use a confidence
interval to assess the sensitivity (our chances of rejecting
H
0
when
H
1
is
true).
A confidence interval is also useful as a stand-alone tool if the goal is
simply to estimate the difference in means. An (approximate)
confidence
interval
for
μ
1
-
μ
2
at a level of confidence of (1
-
α
) 100% is
x
1
-
x
2
±
z
s
s
2
1
n
1
+
s
2
2
n
2
where
z
is a value such that
P
(
-
z < Z < z
) = 1
-
α
and
Z
follows
a standard normal distribution.
This means that
P
(
Z > z
) =
α/
2, i.e.
z
=
z
α/
2
.
Regardless of whether the difference is found to be statistically signifi-
cant or not, it is important to assess the sensitivity of the hypothesis test.
Balan, R., & Lamothe, G. (2017). Expect the unexpected : A first course in biostatistics (second edition). World
Scientific Publishing Company.
Created from ottawa on 2023-09-29 20:15:50.
Copyright © 2017. World Scientific Publishing Company. All rights reserved.
168
Expect the Unexpected: A First Course in Biostatistics
This will be demonstrated through the use of examples. To assess the sen-
sitivity of the test we must first determine practical (biological or clinical)
significance. As an example, consider the comparison of mean triglyceride
levels for two groups.
The researcher might decide that a difference in
means of 5 mg/dl is not biologically important, but a difference of 20 mg/dl
is important. Researchers determine practical importance using their good
judgment and experience.
Suppose that we found a statistically significant difference in the mean
triglyceride levels. The researcher produces a 95% confidence interval for
the difference in means and he finds that the difference in means is between
2.3 mg/dl to 4.7 mg/dl.
The researcher concludes that the means are
statistically different, but the difference is not biologically (or clinically)
important. In this instance, the test is highly sensitive since it can detect
differences in means which have no practical significance.
Now suppose a scenario where the
p
-value is large, so we fail to reject the
null hypothesis that the means are equal. The researcher produces a 95%
confidence interval for the difference in means and finds that the difference
in means is between
-
2
.
5 mg/dl to 24.1 mg/dl. The maximum error of the
estimate is very large. Perhaps the failure to reject the null hypothesis was
caused by an inadequate sample size. The test is not sensitive (also said
not powerful) enough to detect a difference of biological importance.
A large
p
-value should not automatically be interpreted as evidence in
support of the null hypothesis, and a small
p
-value should not automatically
be interpreted as evidence in support of practical significance. All biologists
should be ultimately interested in biological importance, which may be
assessed using confidence intervals.
Example 10.1.
We want to compare the lipid content (% of weight) of
the lake whitefish
Coregonus clupeaformis
in two large neighboring lakes.
The focus of the study was on medium sized fish, from 600 grams to 1,000
grams. We collected
n
1
= 175 fish from lake 1 and
n
2
= 225 fish from lake
2.
The observed samples means and standard deviations are
x
1
= 7
.
18,
x
2
= 7
.
31,
s
1
= 0
.
55 and
s
2
= 0
.
70.
We test
H
0
:
μ
1
-
μ
2
= 0 against
H
1
:
μ
1
-
μ
2
6
= 0. The observed value
of the test statistic for this large sample test is
z
0
=
x
1
-
x
2
p
s
2
1
/n
1
+
s
2
2
/n
2
=
7
.
18
-
7
.
31
p
(0
.
55)
2
/
175 + (0
.
70)
2
/
225
=
-
2
.
08
.
The
p
-value is (approximately) equal to 2
P
(
Z >
|
z
0
|
) = 2
P
(
Z >
2
.
08) =
2 (1
-
0
.
9812) = 0
.
0376
.
At a level of significance of
α
= 0
.
05, we can reject
Balan, R., & Lamothe, G. (2017). Expect the unexpected : A first course in biostatistics (second edition). World
Scientific Publishing Company.
Created from ottawa on 2023-09-29 20:15:50.
Copyright © 2017. World Scientific Publishing Company. All rights reserved.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Comparison of Two Independent Samples
169
the hypothesis that the lake whitefish have equal mean lipid content in both
lakes.
Using their good judgment and experience the researchers had deter-
mined before hand that the absolute difference
|
μ
1
-
μ
2
|
would have to be
at least 1 to be of biological importance. The biological significance cannot
be determined from the
p
-value. We must analyze the error of the estimate
of
μ
1
-
μ
2
.
A point estimate for
μ
1
-
μ
2
is
x
1
-
x
2
=
-
0
.
13 and its estimated stan-
dard error is
p
s
2
1
/n
1
+
s
2
2
/n
2
= 0
.
0625
.
A 95% (approximate) confidence
interval for
μ
1
-
μ
2
is
x
1
-
x
2
±
1
.
96
s
s
2
1
n
1
+
s
2
2
n
2
=
-
0
.
13
±
0
.
1225 = [
-
0
.
25;
-
0
.
01]
.
We are 95% confident that
|
μ
1
-
μ
2
|
<
1.
The statistically significant
difference between the means has no biological importance.
10.3
Confidence Intervals and Tests for Means: Small
Samples
In this section, we consider the same problem of comparison of the means
μ
1
and
μ
2
of two independent populations, in the case of small samples. We
use the same notation as in Section 10.2. In addition, we suppose that both
X
1
and
X
2
are normally distributed. Under this assumption, by Theorem
7.3, we know that
X
1
has an
N
(
μ
1
, σ
2
1
/n
1
) distribution, and
X
2
has an
N
(
μ
2
, σ
2
2
/n
2
) distribution. Therefore,
X
1
-
X
2
has an
N
(
μ
1
-
μ
2
,
σ
2
1
n
1
+
σ
2
2
n
2
) distribution
.
We consider two cases: (1) the population variances are equal; (2) the
population variances are not equal.
Case (1). Normal Populations with Equal Variances
In this case, the underlying assumptions of our model are independent
normal populations with equal variances:
σ
2
1
=
σ
2
2
. In addition, the sample
sizes could be small.
We denote the common variance by
σ
2
.
With the
added assumption of homogeneity of the variance, the standardization of
the estimator
X
1
-
X
2
becomes
X
1
-
X
2
-
(
μ
1
-
μ
2
)
σ
p
1
/n
1
+ 1
/n
2
has an
N
(0
,
1) distribution.
Balan, R., & Lamothe, G. (2017). Expect the unexpected : A first course in biostatistics (second edition). World
Scientific Publishing Company.
Created from ottawa on 2023-09-29 20:15:50.
Copyright © 2017. World Scientific Publishing Company. All rights reserved.
170
Expect the Unexpected: A First Course in Biostatistics
Since
σ
2
is unknown, we cannot base our inference on this statistic.
Denoting by
S
2
i
the sample variance from population
i
, for
i
= 1
,
2, and
using the fact that
E
(
S
2
i
) =
σ
2
i
=
σ
2
, this means that both
S
2
1
and
S
2
2
are unbiased estimators of the common variance
σ
2
.
We combine them
to obtain a better estimator of
σ
2
. One possible combination is to take a
weighted average of the variances with weights based on their respective
degrees of freedom. This gives us an unbiased estimator of
σ
2
, known as
the
pooled sample variance
:
S
2
p
=
ν
1
ν
1
+
ν
2
S
2
1
+
ν
2
ν
1
+
ν
2
S
2
2
=
(
n
1
-
1)
S
2
1
+ (
n
2
-
1)
S
2
2
n
1
+
n
2
-
2
,
where
ν
i
=
n
i
-
1, for
i
= 1
,
2. The pooled sample standard deviation is
S
p
=
q
S
2
p
. As we replace
σ
by
S
p
in the standardization of
X
1
-
X
2
, we
get the following studentization:
X
1
-
X
2
-
(
μ
1
-
μ
2
)
S
p
p
1
/n
1
+ 1
/n
2
has a
T
(
n
1
+
n
2
-
2) distribution.
(10.3)
For testing
H
0
:
μ
1
-
μ
2
=
δ
0
, we use the test statistic:
T
0
=
X
1
-
X
2
-
δ
0
S
p
p
1
/n
1
+ 1
/n
2
.
If
H
0
is true,
T
0
has a
T
(
n
1
+
n
2
-
2) distribution. A hypothesis test based
on this test statistic is called
Student’s two-sample
t
-test
.
The
p
-value is
given in Table 10.2, where
t
0
=
x
1
-
x
2
s
p
p
1
/n
1
+ 1
/n
2
is the observed value of the test statistic
T
0
, and
T
has a
T
(
n
1
+
n
2
-
2)
distribution.
Table 10.2
The
p
-value for comparison of
two means:
σ
2
1
=
σ
2
2
Alternative Hypothesis
p
-value
H
1
:
μ
1
-
μ
2
> δ
0
P
(
T > t
0
)
H
1
:
μ
1
-
μ
2
< δ
0
P
(
T < t
0
)
H
1
:
μ
1
-
μ
2
6
=
δ
0
2
P
(
T >
|
t
0
|
)
A (1
-
α
) 100% confidence interval for
μ
1
-
μ
2
is
x
1
-
x
2
±
t s
p
r
1
n
1
+
1
n
2
,
Balan, R., & Lamothe, G. (2017). Expect the unexpected : A first course in biostatistics (second edition). World
Scientific Publishing Company.
Created from ottawa on 2023-09-29 20:15:50.
Copyright © 2017. World Scientific Publishing Company. All rights reserved.
Comparison of Two Independent Samples
171
where
t
is a value such that
P
(
-
t < T < t
) = 1
-
α,
and
T
has a
T
(
n
1
+
n
2
-
2) distribution. This means that
P
(
T > t
) =
α/
2, i.e.
t
=
t
α/
2
,n
1
+
n
2
-
2
.
Example 10.2.
An agriculture researcher wants to test the claim that on
average, a new fertilizer yields taller plants at maturity. A completely ran-
domized design is used to generate the data. Sixteen similar plots with one
seedling (the experimental units) are randomly assigned to the treatments,
which in this case are the new and the old fertilizer. A balance design is
used, i.e. both treatment groups are of equal size. The plants are measured
at maturity (in cm). Here are the data:
Old Fertilizer
New Fertilizer
46.1
49.8
37.7
51.5
54.2
50.7
44.7
50.7
30.9
41.9
38.5
36.4
38.0
59.4
55.0
41.9
Summary Data
Size
Mean
Variance
n
1
= 8
x
1
= 43
.
14
s
2
1
= 71
.
65
n
2
= 8
x
2
= 47
.
79
s
2
2
= 52
.
66
The researcher wants to test
H
0
:
μ
1
-
μ
2
= 0 against
H
1
:
μ
1
-
μ
2
<
0
using Student’s two-sample
t
-test.
Figure 10.1 gives an overlay of the normal probability plots for the two
samples.
There are no systematic tendencies away from the lines, hence
we do not have strong evidence against normality. Furthermore, the slopes
of the lines are similar. So it appears that the equal variance assumption
holds.
To further assess this underlying assumption, we can also do a
comparative box plot analysis (see Figure 10.1).
The first sample (old
fertilizer) appears to be slightly more spread out, but this difference in
variability is not striking.
We do not have strong evidence against the
equal variance assumption. It is reasonable to assume that the populations
are normal with equal variances.
The pooled sample variance and standard deviation are
s
2
p
=
(
n
1
-
1)
s
2
1
+ (
n
2
-
1)
s
2
2
n
1
+
n
2
-
2
= 62
.
155
and
s
p
=
√
62
.
155 = 7
.
8838
.
The observed test statistic is
t
0
=
x
1
-
x
2
s
p
p
1
/n
1
+ 1
/n
2
=
43
.
14
-
47
.
79
7
.
8338
p
1
/
8 + 1
/
8
=
-
1
.
18
.
Balan, R., & Lamothe, G. (2017). Expect the unexpected : A first course in biostatistics (second edition). World
Scientific Publishing Company.
Created from ottawa on 2023-09-29 20:15:50.
Copyright © 2017. World Scientific Publishing Company. All rights reserved.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
172
Expect the Unexpected: A First Course in Biostatistics
Fig. 10.1
Normal probability plots and comparative box plots for the plant heights
The
p
-value is
P
(
T < t
0
) =
P
(
T <
-
1
.
18) =
P
(
T >
1
.
18)
,
where
T
has
a
T
(
n
1
+
n
2
-
2) =
T
(14) distribution. Referring to row
ν
= 14 in Table
18.4, 1.18 falls between 0
.
692 and 1
.
345, which have areas to the right of
0
.
25 and 0
.
10. Thus, 0
.
10
< p
-value
<
0
.
25. Using a statistical package, we
see that
p
-value = 0.129.
At a significance level of
α
= 0
.
05, we cannot reject
H
0
. The data do not
appear to support the hypothesis that the use of the new fertilizer produces
taller plants.
A 95% confidence interval for
μ
1
-
μ
2
is
x
1
-
x
2
±
t s
p
r
1
n
1
+
1
n
2
=
-
4
.
65
±
8
.
4554 = [
-
13
.
11
,
3
.
81]
,
where
t
=
t
0
.
025
,
14
= 2
.
145.
We are 95% confident that the difference in
means is from
-
13
.
11 cm to 3
.
81 cm.
We are highly confident that the
absolute difference in means is not larger than 14 cm. However we cannot
say the same about 5 cm, since
-
5 lies in the confidence interval.
In
the
next
example,
we
see
that
we
can
sometimes
use
a
log-
transformation to satisfy the underlying conditions to use Student’s two-
sample
t
-test.
Example 10.3.
Dichloromethane is a volatile liquid that is widely used
as a solvent. A chemical engineer wants to compare the dichloromethane
concentration at two treatment water plants near industrial facilities. She
Balan, R., & Lamothe, G. (2017). Expect the unexpected : A first course in biostatistics (second edition). World
Scientific Publishing Company.
Created from ottawa on 2023-09-29 20:15:50.
Copyright © 2017. World Scientific Publishing Company. All rights reserved.
Comparison of Two Independent Samples
173
suspects that the distributions of the dichloromethane concentration are
skewed to the right due to occasional higher discharges from the industrial
facilities. She verifies her hunch with histograms (see Figure 10.2).
She decides to apply a log transformation, that is, the new measure-
ments are read in ln(
μg/L
). The normal probability plots for the data in
the original scale and the log scale are given in Figure 10.3. It is evident
from the normal probability plots that the data in the original scale are
Fig. 10.2
Histograms for the dichloromethane concentrations from plants 1 and 2
Fig. 10.3
Normal probability plots for the concentrations and log-concentrations
Balan, R., & Lamothe, G. (2017). Expect the unexpected : A first course in biostatistics (second edition). World
Scientific Publishing Company.
Created from ottawa on 2023-09-29 20:15:50.
Copyright © 2017. World Scientific Publishing Company. All rights reserved.
174
Expect the Unexpected: A First Course in Biostatistics
not normal, and furthermore, it appears that the variances are not equal.
However, the log data appears to be normal and the variances appear to be
equal since the lines in the probability plots are nearly parallel. It is safe
to assume that the log concentrations from the two plants follow normal
distributions with equal variances.
To compare the dichloromethane concentration at the two plants, the
chemical engineer tests
H
0
:
μ
1
-
μ
2
= 0 against
H
1
:
μ
1
-
μ
2
6
= 0, where
μ
i
is the mean of the log concentrations from plant
i
, for
i
= 1
,
2
.
The
summary data for the log concentrations are
Plant
n
x
s
2
1
25
2.934
1.162
2
25
2.664
1.209
The pooled sample variance is
s
2
p
=
(
n
1
-
1)
s
2
1
+ (
n
2
-
1)
s
2
2
n
1
+
n
2
-
2
= 1
.
1855
.
The observed value of Student’s two sample
t
-test statistic is
t
0
=
x
1
-
x
2
s
p
p
1
/n
1
+ 1
/n
2
= 0
.
88
.
The
p
-value is 2
P
(
T >
|
t
0
|
) = 2
P
(
T >
0
.
88)
,
where
T
follows a
T
(
n
1
+
n
2
-
2) =
T
(48) distribution.
We cannot find the range of the
p
-value,
using Table 18.4, since this table does not include the row
ν
= 48. We can
approximate the
p
-value using the row
ν
=
∞
. The approximate interval
is 0
.
2
< p
-value
<
0
.
5. Using a statistical software the chemical engineer
computed
p
-value = 0
.
385. Since the
p
-value is large, we should not reject
the hypothesis that the mean log-concentrations are the same. It appears
that the means of the log concentrations are not different.
In Example 10.3, we transformed the data using a logarithm. We did
this because Student’s two-sample
t
-test requires that the populations fol-
low a normal distribution. After inspecting the transformed data, the sam-
ples appeared to come from normal populations with equal variances, thus
we could safely compare the means of the transformed data with Student’s
two sample
t
-test.
Note that, when comparing the means of the log transformed data, we
are actually comparing the geometric means of the data on the original
scale. To clarify the distinction between the mean and the geometric mean
Balan, R., & Lamothe, G. (2017). Expect the unexpected : A first course in biostatistics (second edition). World
Scientific Publishing Company.
Created from ottawa on 2023-09-29 20:15:50.
Copyright © 2017. World Scientific Publishing Company. All rights reserved.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Comparison of Two Independent Samples
175
(for the population or the sample), we introduce the following definition.
Definition 10.1.
Let
X
1
, X
2
, . . . , X
n
be a random sample from a popula-
tion represented by the random variable
X
. The
geometric mean
of the
population is
G
=
e
μ
, where
μ
=
E
(ln
X
). An estimate for
G
is the
(sam-
ple) geometric mean
defined by
g
=
e
(1
/n
)
∑
n
i
=1
ln
x
i
= (
Q
n
i
=1
x
i
)
1
/n
,
where
x
1
, . . . , x
n
are the observed values of the random sample
X
1
, . . . , X
n
.
Example 10.3 (continued).
We construct a 95% confidence interval for
the difference in means of the log-concentrations for the data from Ex-
ample 10.3. Since it is reasonable to assume that the two populations of
log-concentrations are independent and normally distributed with equal
variances, then a 95% confidence interval for
μ
1
-
μ
2
is
x
1
-
x
2
±
t s
p
r
1
n
1
+
1
n
2
= 0
.
270
±
0
.
61930 = [
-
0
.
349
,
0
.
889]
,
where
t
= 2
.
011 satisfies 95% =
P
(
-
t < T < t
) and
T
follows a
T
(48)
distribution. (The value of
t
= 2
.
011 was obtained using a statistical pack-
age.) We are 95% confident that
μ
1
-
μ
2
is between
-
0
.
349 and 0
.
889 (in
ln(
μg/L
)).
Since 0 lies within the confidence interval, the means of the
log-concentrations do not appear to be different.
We denote by
G
i
the geometric mean of the population
i
, consisting of
the dichloromethane concentrations (in
μg/L
) from plant
i
, for
i
= 1
,
2.
Note that
G
i
=
e
μ
i
, where
μ
i
is the mean of the log-concentration from
plant
i
, for
i
= 1
,
2. Exponentiating the difference in means gives us the
ratio of the geometric means, that is
e
μ
1
-
μ
2
=
e
μ
1
/e
μ
2
=
G
1
/G
2
. Since we
are 95% confident that
-
0
.
349
< μ
1
-
μ
2
<
0
.
889, then we are also 95%
confident that 0
.
71 =
e
-
0
.
349
< G
1
/G
2
< e
0
.
889
= 2
.
43. Since 1 lies within
the interval, there appears to be no difference between the geometric means
of the concentrations.
Case (2). Normal Populations with Unequal Variances
The assumption of equality of the two variances is sometimes not rea-
sonable. So we should try to adapt our techniques to the case of unequal
variances:
σ
2
1
6
=
σ
2
2
. This is known as the
Behrens-Fisher problem
. There
are exact solutions to Behrens-Fisher problem (see [21]). These solutions
are beyond the scope of this book. We present an approximate solution.
In 1938, Welch [71] proposed an approximate solution to the Behrens-
Fisher problem.
Welch argued that the inference concerning
μ
1
-
μ
2
for
Balan, R., & Lamothe, G. (2017). Expect the unexpected : A first course in biostatistics (second edition). World
Scientific Publishing Company.
Created from ottawa on 2023-09-29 20:15:50.
Copyright © 2017. World Scientific Publishing Company. All rights reserved.
176
Expect the Unexpected: A First Course in Biostatistics
two independent normal population can be based on
X
1
-
X
2
-
(
μ
1
-
μ
2
)
p
S
2
1
/n
1
+
S
2
2
/n
2
,
(10.4)
which follows approximately a
T
distribution with
ν
degrees of freedom,
where
ν
=
(
s
2
1
/n
1
+
s
2
2
/n
2
)
2
(
s
2
1
/n
1
)
2
/
(
n
1
-
1) + (
s
2
2
/n
2
)
2
/
(
n
2
-
1)
.
(10.5)
ν
is called
Welch’s number of degrees of freedom
.
It follows that we can construct the following approximate (1
-
α
) 100%
confidence interval for
μ
1
-
μ
2
:
x
1
-
x
2
±
t
s
s
2
1
n
1
+
s
2
2
n
2
,
where
P
(
-
t
≤
T
≤
t
) = 1
-
α
and
T
has a
T
(
ν
) distribution. Note that
t
=
t
α/
2
,ν
since
P
(
T > t
) =
α/
2.
Since the number of degrees of freedom must be an integer, we round
down
ν
to the nearest integer. This rounding procedure is for conservative
reasons.
For instance, if
ν
= 7
.
8, we need to decide between
ν
= 7 and
ν
= 8. Since the value
t
0
.
025
(7) = 2
.
365 is greater than
t
0
.
025
(8) = 2
.
306,
the 95% confidence interval based on the
T
distribution with
ν
= 7 degrees
of freedom will be larger than the 95% confidence interval based on the
T
distribution with
ν
= 8 degrees of freedom.
Hence, the smaller interval
(based on
ν
= 8) may not actually contain the value
μ
1
-
μ
2
. By work-
ing with a larger interval, we minimize the risk that the interval does not
contain the value
μ
1
-
μ
2
.
To test
H
0
:
μ
1
-
μ
2
=
δ
0
, we use the test statistic
T
0
=
X
1
-
X
2
-
δ
0
p
S
2
1
/n
1
+
S
2
2
/n
2
.
A test based on this test statistic is often called
Welch’s approximate two-
sample
t
-test
. This test is sometimes also called the Welch–Satterthwaite
t
-test or the Satterthwaite
t
-test. The
p
-value of this test is given in Table
10.3, where
t
0
is the observed value of
T
0
,
T
has a
T
(
ν
) distribution, and
ν
is given in (10.5).
Welch’s method is not exact, but is generally a good approximation.
However, if the population variances are equal, or if the sample sizes are
Balan, R., & Lamothe, G. (2017). Expect the unexpected : A first course in biostatistics (second edition). World
Scientific Publishing Company.
Created from ottawa on 2023-09-29 20:15:50.
Copyright © 2017. World Scientific Publishing Company. All rights reserved.
Comparison of Two Independent Samples
177
Table 10.3
The
p
-value for comparison of
two means:
σ
2
1
6
=
σ
2
2
Alternative Hypothesis
p
-value
H
1
:
μ
1
-
μ
2
> δ
0
P
(
T > t
0
)
H
1
:
μ
1
-
μ
2
< δ
0
P
(
T < t
0
)
H
1
:
μ
1
-
μ
2
6
=
δ
0
2
P
(
T >
|
t
0
|
)
rather small and the population variances can be assumed to be approxi-
mately equal, it is more accurate to use Student’s two-sample t-test. Fur-
thermore, when the population variances are equal, Student’s two-sample
t
-test is more powerful.
Example 10.4.
A ornithologist wants to compare the breeding biology of
two different species of swallows. In particular, she wants to compare the
average egg mass (in grams). Here are the summary data:
Sample Size
Mean
Var.
Species 1
18
1.872
0.264
Species 2
12
2.783
2.060
Min.
Q1
Median
Q3
Max.
Species 1
0.900
1.400
1.900
2.300
2.800
Species 2
0.400
1.250
3.300
3.800
4.700
She wants to test
H
0
:
μ
1
-
μ
2
= 0 against
H
1
:
μ
1
-
μ
2
6
= 0, where
μ
i
is
the mean egg mass (in grams) for species
i
, for
i
= 1
,
2, with a two-sample
t
test. To verify the underlying assumptions of the test, she produced normal
probability plots and comparative box plots (see Figure 10.4).
There are no systematic tendencies away from the normal probability
plot lines, hence we do not have strong evidence against normality. How-
ever, the slopes of the lines are different.
So it appears that the equal
variance assumption may not hold.
To further assess the underlying as-
sumptions, we look at the comparative box plots.
The egg mass for the
second species are more spread out.
It might not be sensible to assume
that the population variances are equal.
She decides to use Welch’s approximate two-sample
t
-test. The observed
value of the test statistic is
t
0
=
x
1
-
x
2
p
s
2
1
/n
1
+
s
2
2
/n
2
=
-
2
.
11
.
The
p
-value is 2
P
(
T >
|
t
0
|
) = 2
P
(
T >
2
.
11), where
T
has an approximate
Balan, R., & Lamothe, G. (2017). Expect the unexpected : A first course in biostatistics (second edition). World
Scientific Publishing Company.
Created from ottawa on 2023-09-29 20:15:50.
Copyright © 2017. World Scientific Publishing Company. All rights reserved.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
178
Expect the Unexpected: A First Course in Biostatistics
T
(
ν
) distribution with the following number of degrees of freedom:
ν
=
(
s
2
1
/n
1
+
s
2
2
/n
2
)
2
(
s
2
1
/n
1
)
2
/
(
n
1
-
1) + (
s
2
2
/n
2
)
2
/
(
n
2
-
1)
= 12
.
89
.
We round down the number of degrees of freedom to the nearest integer,
that is
ν
= 12.
Referring to row
ν
= 12 in Table 18.4, 2.11 falls be-
tween 1
.
782 and 2
.
179, which have areas to the right of 0
.
05 and 0
.
025,
respectively. Thus, 0
.
05
< p
-value
<
0
.
10. The
p
-value computed with a
statistical package is 0.056. At a level of significance of
α
= 0
.
10, we can
accept the alternative hypothesis that the egg mass of the two species are
different on average.
Fig. 10.4
Normal probability plots and comparative box plots for the egg masses
Technology Component using
R
:
Assume that the data for the two
populations are saved in the numerical vectors
x1
and
x2
, respectively.
•
To produce the overlayed QQ-plots for
x1
(in blue) and
x2
(in red),
together with the fitted lines, we use:
lmts=range(x1,x2)
qqnorm(x1,ylim=lmts,col="blue")
abline(mean(x1),sd(x1),col="blue")
par(new=T)
qqnorm(x2,ylim=lmts,col="red")
abline(mean(x2),sd(x2),col="red")
Balan, R., & Lamothe, G. (2017). Expect the unexpected : A first course in biostatistics (second edition). World
Scientific Publishing Company.
Created from ottawa on 2023-09-29 20:15:50.
Copyright © 2017. World Scientific Publishing Company. All rights reserved.
Comparison of Two Independent Samples
179
par(new=F)
Remark:
The above procedure gives the plot of the pairs (
z
i
, y
i
) with
the fitted line of equation
y
= ˆ
μ
+ ˆ
σz
with ˆ
μ
= ¯
x
and ˆ
σ
=
s
, for each of
the two variables. This procedure is used for verifying the assumption
that the two populations are normally distributed with equal variances.
We say that the two populations are normally distributed if both plots
seem to be linear. We say that the two populations have equal variances
if the two lines seem to be parallel.
•
To produce side-by-side boxplots, we use:
boxplot(x1,x2)
Remark:
If you assigned the data to a dataframe (for example, using
the function
read.table()
), refer to the last item of the Technology
component at the end of Section 7.3 to see how to produce side-by-side
boxplots and overlayed normal probability plots in the same graphics
window.
•
To test the hypothesis
H
0
:
μ
1
=
μ
2
against
μ
1
6
=
μ
2
and calculate
a 95% confidence interval for
μ
1
-
μ
2
when the two populations are
normally distributed with equal variances, we use:
t.test(x1,x2,var.equal=TRUE)
Remark:
In
the
case
of
normal
populations
with
unequal
vari-
ances, we use the same command as above, but without including
var.equal=TRUE
. To change the confidence level to 98% (or any other
value), we use:
t.test(x1,x2,conf.lev=0.98,var.equal=TRUE)
•
To test the hypothesis
H
0
:
μ
1
=
μ
2
against
μ
1
> μ
2
when the two
populations are normally distributed with equal variances, we use:
t.test(x1,x2,alternative="greater",var.equal=TRUE)
Remark:
This procedure produces also a one-sided confidence interval
which is not discussed in this book.
In the case of normal popula-
tions with unequal variances, we use the same command as above, but
without including
var.equal=TRUE
.
•
To test the hypothesis
H
0
:
μ
1
=
μ
2
against
μ
1
< μ
2
when the two
populations are normally distributed with equal variances, we use:
Balan, R., & Lamothe, G. (2017). Expect the unexpected : A first course in biostatistics (second edition). World
Scientific Publishing Company.
Created from ottawa on 2023-09-29 20:15:50.
Copyright © 2017. World Scientific Publishing Company. All rights reserved.
180
Expect the Unexpected: A First Course in Biostatistics
t.test(x1,x2,alternative="less",var.equal=TRUE)
Remark:
This procedure produces also a one-sided confidence interval
which is not discussed in this book.
In the case of normal popula-
tions with unequal variances, we use the same command as above, but
without including
var.equal=TRUE
.
•
If you assigned the data to a dataframe (for example with the function
read.table()
), we use:
t.test(y~x, data)
where in the dataframe
data
, we have a numerical vector
y
, and a
categorical vector
x
identifying the two groups.
You should also use
the arguments
var.equal
and
alternative
as above.
10.4
Confidence Intervals and Tests for Proportions
To compare two proportions
p
1
and
p
2
from two independent populations,
we discuss inferences concerning the difference
p
1
-
p
2
. We begin discussing
the point estimation of the difference in proportions. We follow the discus-
sion with interval estimation and hypothesis testing.
Consider two independent binomial experiments.
The probability of
success for the
i
-th experiment is
p
i
and the number of successes is a random
measurement denoted by
Y
i
, for
i
= 1
,
2. The number of observations per
experiment are
n
1
and
n
2
, respectively. The respective sample proportions
are
b
p
1
=
Y
1
/n
1
and
b
p
2
=
Y
2
/n
2
. A natural estimator for
p
1
-
p
2
is
b
p
1
-
b
p
2
. The estimator is unbiased since the expected value of the estimator is
E
(
b
p
1
-
b
p
2
) =
p
1
-
p
2
.
The variance of this estimator is equal to
Var(
b
p
1
-
b
p
2
) = Var(
b
p
1
) + Var(
b
p
2
) =
p
1
(1
-
p
1
)
n
1
+
p
2
(1
-
p
2
)
n
2
.
Similar to the estimation of the difference in means, the larger the samples,
the more precise the estimate. Assuming that both samples are large, as
we standardize
b
p
1
-
b
p
2
, we obtain that (approximately)
b
p
1
-
b
p
2
-
(
p
1
-
p
2
)
p
p
1
(1
-
p
1
)
/n
1
+
p
2
(1
-
p
2
)
/n
2
has an
N
(0
,
1) distribution.
(10.6)
As in the one sample case, the latter standardization cannot be used di-
rectly since the variance is unknown (it involves the true proportions
p
1
and
p
2
). However if we use the estimated variance, it can be shown that
(approximately)
b
p
1
-
b
p
2
-
(
p
1
-
p
2
)
p
b
p
1
(1
-
b
p
1
)
/n
1
+
b
p
2
(1
-
b
p
2
)
/n
2
has an
N
(0
,
1) distribution,
(10.7)
Balan, R., & Lamothe, G. (2017). Expect the unexpected : A first course in biostatistics (second edition). World
Scientific Publishing Company.
Created from ottawa on 2023-09-29 20:15:50.
Copyright © 2017. World Scientific Publishing Company. All rights reserved.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Comparison of Two Independent Samples
181
when
n
1
and
n
2
are large.
What are large sample sizes?
This is not
an easy question to answer.
A common rule of thumb is to not use the
latter normal approximation when the observed number of successes or the
observed number of failures in either one of the groups is less than 5.
Using (10.7), we construct the (approximate)
confidence interval
for
p
1
-
p
2
at a level of confidence of (1
-
α
) 100%. This interval is:
b
p
1
-
b
p
2
±
z
s
b
p
1
(1
-
b
p
1
)
n
1
+
b
p
2
(1
-
b
p
2
)
n
2
,
where
z
is value such that
P
(
-
z < Z < z
) = 1
-
α,
and
Z
follows a standard
normal distribution.
In practice, we usually want to compare our data against a model with
equal proportions. In other words, we would like to test the null hypothesis
H
0
:
p
1
-
p
2
= 0 (or equivalently
H
0
:
p
1
=
p
2
) against an appropriate
alternative hypothesis.
Assuming that
H
0
holds, then the probability of
success is the same for all trials in both experiments. This common prob-
ability is
p
=
p
1
=
p
2
. If this is the case, we can consider the
n
=
n
1
+
n
2
observations as a sample from a binomial distribution with
n
trials and
probability
p
of success. The corresponding sample proportion (called the
pooled sample proportion
) is
b
p
=
Y
1
+
Y
2
n
=
n
1
n
b
p
1
+
n
2
n
b
p
2
.
Note that the pooled sample proportion is a weighted average of the respec-
tive sample proportions, where the weights are the relative sample sizes.
Assuming that
H
0
is true (i.e.
p
1
=
p
2
) the standardization in (10.6)
becomes
b
p
1
-
b
p
2
-
0
p
p
(1
-
p
)
p
1
/n
1
+ 1
/n
2
.
Using
b
p
instead of
p
, we get the following test statistic:
Z
0
=
b
p
1
-
b
p
2
p
b
p
(1
-
b
p
)
p
1
/n
1
+ 1
/n
2
.
Since the
p
-value is the probability of observing a value as extreme as
z
0
(which is the observed value of the test statistic) in the direction of the al-
ternative hypothesis, this hypothesis must be taken in consideration when
computing the
p
-value. We usually want to test the null hypothesis of equal-
ity against one of the following three alternative forms.
Table 10.4 gives
the corresponding
p
-value in the three cases. Here
Z
has approximately a
standard normal distribution. The test is a large sample test.
Balan, R., & Lamothe, G. (2017). Expect the unexpected : A first course in biostatistics (second edition). World
Scientific Publishing Company.
Created from ottawa on 2023-09-29 20:15:50.
Copyright © 2017. World Scientific Publishing Company. All rights reserved.
182
Expect the Unexpected: A First Course in Biostatistics
Table 10.4
The
p
-value for the comparison
of two proportions
Alternative Hypothesis
p
-value
H
1
:
p
1
-
p
2
>
0
P
(
Z > z
0
)
H
1
:
p
1
-
p
2
<
0
P
(
Z < z
0
)
H
1
:
p
1
-
p
2
6
= 0
2
P
(
Z >
|
z
0
|
)
Example 10.5.
Refer to Example 3.7.
We denote by
p
1
and
p
2
the
proportions of recaptured moths in the light-colored population, respec-
tively the dark-colored population.
Among the
n
1
= 137 light-colored
moths,
y
1
= 18 were recaptured, whereas among the
n
2
= 493 dark-colored
moths,
y
2
= 131 were recaptured.
The proportions of recaptured moths
are:
b
p
1
= 0
.
131 for the light-colored moths, and
b
p
2
= 0
.
266 for the dark-
colored moths. Is there a statistical difference between the proportions of
recaptured moths, at a level of significance of
α
= 0
.
05? If so, we wish to
investigate the biological (practical) significance.
We assume that the samples are independent. Since both sample sizes
are large and the observed number of successes and of failures are not too
small, we can safely perform a large sample test. To test
H
0
:
p
1
-
p
2
= 0
against
H
1
:
p
1
-
p
2
6
= 0, we compute the test statistic:
z
0
=
b
p
1
-
b
p
2
p
b
p
(1
-
b
p
)
p
1
/n
1
+ 1
/n
2
=
0
.
131
-
0
.
266
p
(0
.
2365)(1
-
0
.
2365)
p
1
/
137 + 1
/
493
=
-
3
.
29
,
where the pooled sample proportion is
b
p
=
y
1
+
y
2
n
1
+
n
2
=
18 + 131
137 + 493
= 0
.
2365
.
The
p
-value is the probability of observing a difference in proportions as
extreme as
b
p
1
-
b
p
2
=
-
0
.
135, under the assumption that both proportions
are equal. This is approximately equal to 2
P
(
Z >
|
z
0
|
) = 2
P
(
Z >
3
.
29),
where
Z
follows a standard normal approximately. Using Table 18.3, we can
argue that the
p
-value is 0
.
001. There is a statistical significant difference
between the proportions.
To
investigate
the
biological
significance,
we
construct
a
95%
confidence interval for
p
1
-
p
2
:
b
p
1
-
b
p
2
±
1
.
96
q
b
p
1
(1
-
b
p
1
)
n
1
+
b
p
2
(1
-
b
p
2
)
n
2
=
-
0
.
135
±
0
.
0687
=
[
-
0
.
204
,
-
0
.
066]
.
We are 95% confident that the
difference in proportion
p
2
-
p
1
is between 6.6% to 20.4%.
Recall that
biological significance cannot be determined by a test. Only by using their
Balan, R., & Lamothe, G. (2017). Expect the unexpected : A first course in biostatistics (second edition). World
Scientific Publishing Company.
Created from ottawa on 2023-09-29 20:15:50.
Copyright © 2017. World Scientific Publishing Company. All rights reserved.
Comparison of Two Independent Samples
183
good judgement and experience can scientists determine what is biologi-
cally significant. In this instance, if we assume that an absolute difference
in proportions of at least 5% is biologically significant, then our findings
are significant.
Kettlewell hypothesized that a larger proportion of the dark-colored
moths will be recaptured. We compute the corresponding
p
-value to test
his claim. We want to test
H
0
:
p
1
-
p
2
= 0 against
H
1
:
p
1
-
p
2
<
0. The
observed value of the test statistic is
z
0
=
-
3
.
29. The
p
-value for this left-
tailed test is approximately equal to
P
(
Z < z
0
) =
P
(
Z <
-
3
.
29), where
Z
has a standard normal distribution. Using Table 18.2, we can argue that
the
p
-value is 0.0005.
10.5
Problems
Problem 10.1.
It is believed that nutritional deprivation affects various
components of the immune system, such as the tuberculin skin reactivity.
In the study [58], a sample of 8 Sprague-Dawley male rats were fed with a
normal diet consisting of 18% protein. A state of malnutrition was induced
in another sample of 8 rats, which were fed with a diet consisting of only
5% protein. After 4 weeks, the rats were given an intradermal injection of
25
μg
of purified protein derivative of tuberculin. The following table gives
the skin reactivity diameter of erythema and induration (in mm) for the
two groups of rats.
18% Protein Diet
5% Protein Diet
13.3
5.1
16.3
8.7
9.9
8.7
9.3
8.5
16.1
8.1
9.7
6.9
9.7
6.9
14.1
12.3
(a) Using a statistical software, verify the assumption that the two popu-
lations are normal with equal variances.
(b) Test the hypothesis
H
0
:
μ
1
=
μ
2
versus
H
1
:
μ
1
> μ
2
, where
μ
1
is the average level of tuberculin reactivity in the rats with a normal diet,
and
μ
2
is the average level of tuberculin reactivity in the malnourished rats.
State your conclusion.
Balan, R., & Lamothe, G. (2017). Expect the unexpected : A first course in biostatistics (second edition). World
Scientific Publishing Company.
Created from ottawa on 2023-09-29 20:15:50.
Copyright © 2017. World Scientific Publishing Company. All rights reserved.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
184
Expect the Unexpected: A First Course in Biostatistics
(c) Construct a 95% confidence interval for
μ
1
-
μ
2
. Can we say that the
skin reactivity diameter in the malnourished rats is at least 7mm smaller
than in the control group?
Problem 10.2.
A study was conducted to see if vitamin D and calcium
supplementation has any effect on the risk of breast cancer (see [14]). In
this study, 36,282 women were randomly assigned to two groups. The first
group consisting of 18,176 women took a supplement of 1,000 mg of calcium
with 400 IU of vitamin D daily.
The second group consisting of 18,106
women was the placebo group. Both groups were followed-up for a period
of 7 years.
At the end of this period, it was found that 528 patients in
the first group and 546 patients in the second group have developed breast
cancer.
Find a 90% confidence interval for the difference
p
1
-
p
2
, where
p
1
denotes the proportion of women with breast among those who take a
daily calcium-vitamin D supplement and
p
2
is the proportion of women
with breast cancer in the general population. Using this interval, can we
say that calcium and vitamin D supplementation decreases or increases the
risk of breast cancer?
Problem 10.3.
It is claimed that the supplementation with Coenzyme
Q10 (CoQ10) during pregnancy reduces the rate of pre-eclampsia, or preg-
nancy induced hypertension (see [65]). 235 pregnant women at risk of pre-
eclampsia were randomly divided into two groups. The first group of 118
women received 200 mg of CoQ10 daily from the 20th week of pregnancy
until delivery. The other group of 117 women received a placebo. 17 women
in the CoQ10 group developed pre-eclampsia, compared with 30 women in
the placebo group.
Can we conclude that supplementation with CoQ10
reduces the risk of developing pre-eclampsia? Justify your conclusion using
a test of hypothesis at significance level
α
= 0
.
05, and a 95% confidence
interval.
Problem 10.4.
We continue with the situation in Problem 8.8. Assume
that the two sample sizes are
n
1
= 19 and
n
2
= 12 and the two sample
variances are
s
2
1
= 0
.
81 and
s
2
2
= 0
.
49. Is there enough evidence that fam-
ilies from culled populations have a lower bunching intensity than families
from non-culled populations? Use a test of hypothesis at level
α
= 0
.
005.
Suppose that the two populations are normally distributed with equal vari-
ances.
Problem 10.5.
Rhodamine 6G (R6G) is a fluorochrome mitochondrial
dye with potential use for cancer treatment. One of the objectives of the
Balan, R., & Lamothe, G. (2017). Expect the unexpected : A first course in biostatistics (second edition). World
Scientific Publishing Company.
Created from ottawa on 2023-09-29 20:15:50.
Copyright © 2017. World Scientific Publishing Company. All rights reserved.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Comparison of Two Independent Samples
185
study [24] was to show that the administration of R6G during a period of
hypoglycemia reduces the growth rate of the Walker 256 tumor. A group of
n
1
= 7 rats underwent implantation of 100 mg of viable fragments of Walker
256 carcinosarcoma, and after 48 hours they were administered R6G. The
animals were fasted for 24 hours prior to the drug administration and 8
hours after.
After a week, the tumors were weighed yielding a sample
average and a sample standard deviation ¯
x
1
= 3
.
6 g and
s
1
= 0
.
3 g.
A
control group of
n
2
= 7 rats which received the same tumor transplant
had the sample average and sample standard deviation ¯
x
2
= 7
.
1 g and
s
2
= 0
.
7 g. Can we conclude that the administration of R6G reduces the
tumor growth rate? Justify your answer using a test of hypothesis and a
95% confidence interval.
Assume that the two populations have normal
distributions with equal variances.
Problem 10.6.
Nurses interested in the effect of prenatal care divided 18
expectant mothers into two groups of size 9. Group 1 received prenatal con-
sultations, while those in group 2 received no prenatal consultations. The
summary statistics on birth weight for group 1 are ¯
x
1
= 99
.
6 ounces and
s
1
= 6
.
82 ounces for group 1, respectively ¯
x
2
= 85
.
3 ounces and
s
2
= 16
.
75
ounces for group 2. Construct a 95% confidence interval for
μ
1
-
μ
2
, where
μ
1
denotes the average birth weight for babies whose mothers received pre-
natal consultations, and
μ
2
denotes the average birth weight for babies
whose mothers received no prenatal consultations. Using this interval, can
we conclude that babies whose mothers did not receive prenatal consulta-
tions have a smaller weight at birth? Assume that the two populations are
normal with unequal variances.
Problem 10.7.
Recent studies have shown that exercise is one of the most
efficient ways of increasing the release of the growth hormone in children
and teenagers.
However, when exercise is combined with L-arginine sup-
plementation, children seem to grow less. The height increase (in cm) in
one year was recorded for two samples of 14-year old boys. The boys in the
first group participated in a physical activity for at least 3 hours a week.
The boys in the second group participated in the same activities, and had
a supplementation of L-arginine included in their diet. The following table
gives the summary of the data:
Group
Size
Mean
Standard Deviation
Exercise
n
1
= 50
¯
x
1
= 23
.
5
s
1
= 5
.
6
Exercise and L-arginine
n
2
= 60
¯
x
2
= 21
.
4
s
2
= 6
.
9
Balan, R., & Lamothe, G. (2017). Expect the unexpected : A first course in biostatistics (second edition). World
Scientific Publishing Company.
Created from ottawa on 2023-09-29 20:15:50.
Copyright © 2017. World Scientific Publishing Company. All rights reserved.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
186
Expect the Unexpected: A First Course in Biostatistics
Use a large sample test to check if there is enough evidence that the L-
arginine supplementation slows down the release of the growth hormone,
when compared to exercise alone. Use the level
α
= 0
.
05.
Problem 10.8.
Measles is among the world’s most contagious diseases,
which can cause severe complications and even death. This disease is easily
preventable through vaccination.
Herd immunity occurs when the vacci-
nation of a significant portion of the population provides protection even
to the non-vaccinated individuals.
For measles, it is estimated that this
portion should be at least 83%. During a measles outbreak in sub-Saharan
Africa, in a sample of 989 children from a country in which the measles
vaccination rate was higher than 83%, 43 became infected with measles,
while in a sample of 845 children from a neighboring country in which the
measles vaccination rate was lower than 83%, 148 became infected with
measles. Using this data, can we conclude that measles vaccination is ef-
fective for lowering the infection rate?
Use a test of hypotheses of level
α
= 0
.
005.
Hint:
Compare the proportion
p
1
of infected children in the country in
which the vaccination rate was higher than 83% with the proportion
p
2
of
infected children in the country in which the vaccination rate was lower
than 83%.
Problem 10.9.
A pH level of the soil between 5.3 and 6.5 is optimal for
strawberries. To measure the pH level, a field is divided into two lots. In
each lot, we randomly select 20 samples of soil. The data are below.
Lot 1
5.66
5.73
5.68
5.77
5.73
5.71
5.68
5.58
6.11
5.37
5.67
5.53
5.59
5.94
5.84
5.53
5.64
5.73
5.30
5.65
Lot 2
5.25
6.73
6.25
5.21
5.63
6.41
5.89
6.76
5.13
5.64
5.94
6.16
5.64
6.54
5.79
5.91
6.17
6.90
5.76
6.07
(a) Using a statistical software, verify the assumption that the two popu-
lations are normally distributed.
(b) Using a statistical software, assess the assumption that the two popu-
lations have equal variances.
(c) Test the hypothesis
H
0
:
μ
1
=
μ
2
versus
H
1
:
μ
1
6
=
μ
2
, where
μ
1
is
Balan, R., & Lamothe, G. (2017). Expect the unexpected : A first course in biostatistics (second edition). World
Scientific Publishing Company.
Created from ottawa on 2023-09-29 20:15:50.
Copyright © 2017. World Scientific Publishing Company. All rights reserved.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Comparison of Two Independent Samples
187
the mean pH level of the soil in lot 1, and
μ
2
is the mean pH level of the
soil in lot 2. State your conclusion. Use the level
α
= 0
.
05.
Problem 10.10.
The table below gives the size of human groups involved
in bear-human interactions at a particular park.
The interactions were
classified according to the behavior of the bear.
Behavior
Inquisitive
Avoidance
Mean
x
1
= 3
.
5
x
2
= 2
.
4
Standard Deviation
s
1
= 5
.
2
s
2
= 2
.
3
Sample Size
n
1
= 65
n
2
= 55
Can we conclude that the mean size of the human groups involved in bear
interactions are different according to the behavior of the bear?
Use the
level
α
= 0
.
05. Which test did you use to compare the two means?
Problem 10.11.
In a particular park it is believed that the type of be-
havior observed during human-bear interactions depends on the type of
location. In the front country, among 109 human-bear interactions, 35 in-
volved a neutral or an avoidance behaviour. In the back country, among 83
human-bear interactions, 69 involved a neutral or an avoidance behavior.
Can we conclude that the proportion of human-bear interactions that are
classified as a neutral or an avoidance behavior is larger in the back country
compared to the front country? Use the level
α
= 0
.
05.
Problem 10.12.
A botanist is testing a new tomato fertilizer.
He was
growing two different groups of 8 plants each, using the standard fertilizer
for the first group, and the new fertilizer for the second group.
After 70
days, he measured the tomato yield (in kg) for each plant.
The data is
given in the table below:
Balan, R., & Lamothe, G. (2017). Expect the unexpected : A first course in biostatistics (second edition). World
Scientific Publishing Company.
Created from ottawa on 2023-09-29 20:15:50.
Copyright © 2017. World Scientific Publishing Company. All rights reserved.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
188
Expect the Unexpected: A First Course in Biostatistics
Standard
New
Plant
Fertilizer
Plant
Fertilizer
1
4.76
1
5.60
2
4.25
2
4.98
3
3.98
3
5.12
4
3.44
4
3.86
5
3.87
5
4.56
6
4.78
6
5.76
7
3.99
7
4.21
8
3.21
8
4.05
Mean
4.035
Mean
4.767
Standard Deviation
0.56
Standard Deviation
0.71
Find a 95% confidence interval for
μ
1
-
μ
2
, where
μ
1
is the average tomato
yield per plant using the standard fertilizer, and
μ
2
is the average tomato
yield per plant using the new fertilizer. Interpret the result.
Problem 10.13.
The Nobel Prize in Chemistry in 1937 was divided be-
tween Norman Haworth for his work on carbohydrates and vitamin C, and
Paul Karrer for his work on carotenoids, flavins and vitamins A and B2.
Vitamin C is an ascorbic acid with antioxidant properties. A study is un-
dertaken to compare the amount of ascorbic acid (in mg) in two popular
brands of vitamin C (labeled as 100 mg). The summary of the data follows:
Brand 1
Brand 2
Mean
x
1
= 118
x
2
= 122
Standard Deviation
s
1
= 1
.
2
s
2
= 1
.
75
Number of Tablets
n
1
= 15
n
2
= 15
Assume that the amount of ascorbic acid in a tablet is normally distributed,
and the variance of this amount is the same for the two brands.
(a) Compute the pooled standard deviation for the two samples.
(b) Give the range of the
p
-value of Student’s two-sample
t
-test to compare
the mean amount of ascorbic acid per tablet for the two brands. What can
we conclude? (Use a two sided test of level
α
= 0
.
01.)
(c) Construct a 95% confidence interval for
μ
1
-
μ
2
, where
μ
1
is the mean
amount of ascorbic acid per tablet for brand 1, and
μ
2
is the mean amount
of ascorbic acid per tablet for brand 2.
Problem 10.14.
We want to compare the density of organisms (in number
of organisms per square meter) at two different locations along a river.
Balan, R., & Lamothe, G. (2017). Expect the unexpected : A first course in biostatistics (second edition). World
Scientific Publishing Company.
Created from ottawa on 2023-09-29 20:15:50.
Copyright © 2017. World Scientific Publishing Company. All rights reserved.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Comparison of Two Independent Samples
189
Below are the descriptive statistics for two samples of size 12 each, taken
from the two locations.
Mean
Standard Deviation
Location 1
9,168.75
3,700.57
Location 2
2,168.33
815.26
Assume that the samples are selected from independent normal popula-
tions with unequal variances. Can we conclude that the mean density of
organisms at the two locations are different? Use the level
α
= 0
.
05.
Problem 10.15.
We want to compare the germination rate of a new strain
of a plant against an old strain of the same plant. Below are the data.
Germinated
Did Not Germinate
Total
Old Strain
125
15
140
New Strain
152
8
160
(a) Can we conclude that the germination rates differ?
Use the level
α
= 0
.
05.
(b) Construct a 98% confidence interval for the difference between the ger-
mination rates.
Problem 10.16.
Consider a study comparing two medications for severe
bladder infections. The variable
x
is the length of time (in days) to recovery.
For the
n
1
= 15 patients who were given medication 1, we observed a mean
recovery time of
x
1
= 16
.
87 days. The mean recovery time was
x
2
= 19
.
09
days for the
n
2
= 18 patients who were given medication 2.
(a) Here are overlayed quantile-quantile plot for the two samples of recovery
times. Is it reasonable to assume that both populations of recovery times
are normally distributed with equal variances?
Balan, R., & Lamothe, G. (2017). Expect the unexpected : A first course in biostatistics (second edition). World
Scientific Publishing Company.
Created from ottawa on 2023-09-29 20:15:50.
Copyright © 2017. World Scientific Publishing Company. All rights reserved.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
190
Expect the Unexpected: A First Course in Biostatistics
(b) Based on the following
R
output, compute the value of the pooled
standard deviation
s
p
.
> t.test(x1,x2,var.equal=TRUE)
Two Sample t-test
data:
x1 and x2
t = -5.174, df = 31, p-value = 1.304e-05
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-3.105940 -1.349615
sample estimates:
mean of x mean of y
16.86667
19.09444
(c) Based on the
R
output in (b), give a 95% confidence interval for difference
between the mean recovery time on medication 1 and the mean recovery
time on medication 2.
(d) Based on the confidence interval from (c), which medication is best?
Did you know?
In 1923, the Nobel Prize committee credited the
practical extraction of insulin to a team at the University of Toronto, and
awarded the Nobel Prize in Physiology/Medicine to Frederick Banting and
John James Richard Macleod for the discovery of insulin. Banting, shared
his prize with his assistant Charles Best, who was chosen on a flip of coin
to help him carry out the lab work in the summer of 1921. MacLeod shared
the prize with the biochemist James Collip, who helped to purify the extracts
from ox pancreas. The first injection of insulin was given at the Toronto
General Hospital to a 14-year old dying diabetic patient in January 1922.
The patent for insulin was sold to the University of Toronto for one dollar.
Balan, R., & Lamothe, G. (2017). Expect the unexpected : A first course in biostatistics (second edition). World
Scientific Publishing Company.
Created from ottawa on 2023-09-29 20:15:50.
Copyright © 2017. World Scientific Publishing Company. All rights reserved.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Related Documents
Recommended textbooks for you

Biology (MindTap Course List)
Biology
ISBN:9781337392938
Author:Eldra Solomon, Charles Martin, Diana W. Martin, Linda R. Berg
Publisher:Cengage Learning

Biology 2e
Biology
ISBN:9781947172517
Author:Matthew Douglas, Jung Choi, Mary Ann Clark
Publisher:OpenStax

Biology: The Dynamic Science (MindTap Course List)
Biology
ISBN:9781305389892
Author:Peter J. Russell, Paul E. Hertz, Beverly McMillan
Publisher:Cengage Learning

Concepts of Biology
Biology
ISBN:9781938168116
Author:Samantha Fowler, Rebecca Roush, James Wise
Publisher:OpenStax College

Biology Today and Tomorrow without Physiology (Mi...
Biology
ISBN:9781305117396
Author:Cecie Starr, Christine Evers, Lisa Starr
Publisher:Cengage Learning

Biology: The Unity and Diversity of Life (MindTap...
Biology
ISBN:9781305073951
Author:Cecie Starr, Ralph Taggart, Christine Evers, Lisa Starr
Publisher:Cengage Learning
Recommended textbooks for you
- Biology (MindTap Course List)BiologyISBN:9781337392938Author:Eldra Solomon, Charles Martin, Diana W. Martin, Linda R. BergPublisher:Cengage LearningBiology 2eBiologyISBN:9781947172517Author:Matthew Douglas, Jung Choi, Mary Ann ClarkPublisher:OpenStaxBiology: The Dynamic Science (MindTap Course List)BiologyISBN:9781305389892Author:Peter J. Russell, Paul E. Hertz, Beverly McMillanPublisher:Cengage Learning
- Concepts of BiologyBiologyISBN:9781938168116Author:Samantha Fowler, Rebecca Roush, James WisePublisher:OpenStax CollegeBiology Today and Tomorrow without Physiology (Mi...BiologyISBN:9781305117396Author:Cecie Starr, Christine Evers, Lisa StarrPublisher:Cengage LearningBiology: The Unity and Diversity of Life (MindTap...BiologyISBN:9781305073951Author:Cecie Starr, Ralph Taggart, Christine Evers, Lisa StarrPublisher:Cengage Learning

Biology (MindTap Course List)
Biology
ISBN:9781337392938
Author:Eldra Solomon, Charles Martin, Diana W. Martin, Linda R. Berg
Publisher:Cengage Learning

Biology 2e
Biology
ISBN:9781947172517
Author:Matthew Douglas, Jung Choi, Mary Ann Clark
Publisher:OpenStax

Biology: The Dynamic Science (MindTap Course List)
Biology
ISBN:9781305389892
Author:Peter J. Russell, Paul E. Hertz, Beverly McMillan
Publisher:Cengage Learning

Concepts of Biology
Biology
ISBN:9781938168116
Author:Samantha Fowler, Rebecca Roush, James Wise
Publisher:OpenStax College

Biology Today and Tomorrow without Physiology (Mi...
Biology
ISBN:9781305117396
Author:Cecie Starr, Christine Evers, Lisa Starr
Publisher:Cengage Learning

Biology: The Unity and Diversity of Life (MindTap...
Biology
ISBN:9781305073951
Author:Cecie Starr, Ralph Taggart, Christine Evers, Lisa Starr
Publisher:Cengage Learning