Chap 13 Expect_The_Unexpected_A_First_Course_In_Biostatist..._----_(Statistics) (5)
pdf
keyboard_arrow_up
School
University of Ottawa *
*We aren’t endorsed by this school
Course
2379
Subject
Mathematics
Date
Jan 9, 2024
Type
Pages
18
Uploaded by GrandUniverseHyena41
Chapter 13
Regression and Correlation
Biologists are often interested in the relationship between two variables. We
learn in this chapter to describe the relationship between two quantitative
variables with a correlation analysis. We also learn to describe one of the
variables as a linear function of the other variable. This is called a regression
analysis.
13.1
Sample Covariance and Correlation
In this section, we introduce some techniques that describe the association
between two quantitative variables. We consider two examples. In Exam-
ple 13.1, we describe the association between the heights of mothers and
daughters. This is an example of a positive linear association, where the
heights of the daughters tend to increase as the heights of the mothers in-
crease. In Example 13.2, we examine the relationship between the number
of colds and vitamin C. This is an example of a negative linear association.
As the dosage of vitamin C increases, the number of colds tend to decrease
on average.
Consider
n
paired observations (
x
i
, y
i
), for
i
= 1
, . . . , n
, from a pair
(
X, Y
) of random variables.
We can use a scatter plot to describe the
association between
x
and
y
.
In Figure 13.1, we have an illustration of
linear associations.
For each scatter plot, we display a horizontal line at
y
and a vertical line at
x
. These lines define four quadrants. If there is a
positive linear association between
X
and
Y
, then most of the points are
going to lie in quadrants I and III, where (
x
i
-
x
)(
y
i
-
y
) is positive. While
for a negative association, most of the points are going to lie in quadrants
II and IV, where (
x
i
-
x
)(
y
i
-
y
) is negative.
To describe the linear association between the two variables, we can use
225
Balan, R., & Lamothe, G. (2017). Expect the unexpected : A first course in biostatistics (second edition). World
Scientific Publishing Company.
Created from ottawa on 2023-12-03 18:33:31.
Copyright © 2017. World Scientific Publishing Company. All rights reserved.
226
Expect the Unexpected: A First Course in Biostatistics
the
sample covariance
d
cov
xy
=
∑
n
i
=1
(
x
i
-
x
)(
y
i
-
y
)
n
-
1
=
(
∑
n
i
=1
x
i
y
i
)
-
(1
/n
)(
∑
n
i
=1
x
i
)(
∑
n
i
=1
y
i
)
n
-
1
.
It will be positive for positive linear associations and it will be negative
for negative linear associations. So the covariance captures the sign (also
called the direction) of a linear association.
Fig. 13.1
An illustration of linear associations
We now define a statistic which is based on the covariance. The
sample
correlation
is
r
xy
=
d
cov
xy
s
x
s
y
,
where
s
x
and
s
y
are the respective sample standard deviations. The sam-
ple correlation is also called
Pearson’s correlation
, or the
product-moment
correlation
. The sample correlation satisfies the following properties which
justify its suitability as a descriptive measure of the
intensity
of the linear
association:
•
It is invariant to linear scaling. In other words, the correlation remains
the same regardless if we measure height in millimeters, centimeters or
meters.
•
It has the same sign as the covariance, so it is negative for negative
linear associations and positive for positive linear associations.
•
A correlation is always between
-
1 and 1.
It is equal to 1 or
-
1
if and only if the points (
x
1
, y
1
)
, . . . ,
(
x
n
, y
n
) fall exactly on a line.
Furthermore, if there is no association between
X
and
Y
, then the
correlation should be near 0.
Balan, R., & Lamothe, G. (2017). Expect the unexpected : A first course in biostatistics (second edition). World
Scientific Publishing Company.
Created from ottawa on 2023-12-03 18:33:31.
Copyright © 2017. World Scientific Publishing Company. All rights reserved.
Regression and Correlation
227
If the relationship between
X
and
Y
is linear, then we interpret this
relationship as being stronger as
r
approaches 1 or
-
1 and as being weaker
as
r
approaches 0. If the relationship between
X
and
Y
is not linear, then
the sample correlation is more difficult to interpret. In Example 13.3, we
have two variables that are strongly related, but the correlation is near
zero.
Example 13.1.
The data below gives the heights (in cm) for a sample of
n
= 12 pairs of mother and daughter.
Height
Daughter
160
165
156
169
152
156
Mother
163
165
162
161
161
160
Daughter
162
156
161
160
164
162
Mother
164
159
164
161
163
168
Figure 13.2 gives the scatter plot of the height
Y
of the daughter against
the height
X
of the mother. There appears to be a positive linear associ-
ation between the two variables. The sample covariance is
d
cov
xy
= 4
.
9318
and the respective standard deviations are
s
x
= 2
.
4664 and
s
y
= 4
.
6928.
The sample correlation between the heights of the daughters and the moth-
ers is equal to
r
xy
=
d
cov
xy
/
(
s
x
s
y
) = 0
.
426
.
The intensity of the linear
association between heights of the mother and the daughter is moderately
weak.
Fig. 13.2
Scatter plot for pairs of mother and daughter
Balan, R., & Lamothe, G. (2017). Expect the unexpected : A first course in biostatistics (second edition). World
Scientific Publishing Company.
Created from ottawa on 2023-12-03 18:33:31.
Copyright © 2017. World Scientific Publishing Company. All rights reserved.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
228
Expect the Unexpected: A First Course in Biostatistics
Example 13.2.
Consider an experiment where different daily dosages of
vitamin C (in mg) were randomly assigned to subjects. For each subject,
we count the number of times that the person contracted the common cold
over a period of three years. Here are the data:
Dosage (in mg)
Number of Colds
0
12
10
10
15
14
15
14
7
8
11
9
30
10
12
9
8
11
50
7
10
8
4
6
Figure 13.3 gives the scatter plot of the number
Y
of colds against the daily
dosage
X
of vitamin C. There appears to be a negative linear association
between
X
and
Y
.
The sample covariance is
d
cov
xy
=
-
34
.
0132 and the
respective standard deviations are
s
x
= 18
.
9789 and
s
y
= 2
.
8074.
The
sample correlation between the two variables is equal to
r
xy
=
-
0
.
638
.
The intensity of the linear association between the number of colds and the
dosage of vitamin
C
is moderately strong.
Fig. 13.3
Scatter plot: number of colds against vitamin C
It is recommended that you always produce a scatter plot. The scat-
ter plot is a useful
diagnostic tool
.
It allows us to verify the underlying
assumption of linearity between
y
and
x
as seen in the following example.
Example 13.3.
To investigate the effect of a particular stimulant on re-
action times, the researchers randomly assigned a dosage of the stimulant
to the subjects. The treatment groups are 0 mg, 10 mg, 20 mg, 30 mg, 40
Balan, R., & Lamothe, G. (2017). Expect the unexpected : A first course in biostatistics (second edition). World
Scientific Publishing Company.
Created from ottawa on 2023-12-03 18:33:31.
Copyright © 2017. World Scientific Publishing Company. All rights reserved.
Regression and Correlation
229
mg and 50 mg.
Each group contains 10 subjects.
The response
y
is the
reaction time (in seconds) and the predictor
x
is the assigned dosage (in
mg).
The correlation is
r
xy
=
-
0
.
1673, since the slope is negative, it ap-
pears that on average an increase in the dosage will decrease the reaction
time.
Furthermore, if we assume that the association is linear, then the
association is weak (since
r
xy
is zero).
The investigators were prudent and were not ready to conclude that the
stimulant has little to no effect on the reaction times. They produced the
scatter plot of the pairs (
x
i
, y
i
) (see Figure 13.4), and noticed that the rela-
tionship between
y
and
x
does not appear to be linear. They assessed that
the correlation does not adequately measure the strength of the association
in this case.
In fact using techniques that are outside the scope of this book, it can
be shown that there is a strong association between the reaction times and
the dosage of the stimulant.
It is just that this association is not linear.
For a fixed dosage of the stimulant, the reaction time does not vary a lot.
Fig. 13.4
The least squares line of reaction time
We end this section with a short discussion on
causation
.
Scientists
generally want to establish a causality relationship, i.e.
a relationship in
which the response (or effect) is a consequence of a cause (or causal factor).
The scientific method can be used to establish a cause-and-effect relation-
ship. The method involves performing experiments in which we can control
the cause and the possible causal factors, and observe a significant effect.
Balan, R., & Lamothe, G. (2017). Expect the unexpected : A first course in biostatistics (second edition). World
Scientific Publishing Company.
Created from ottawa on 2023-12-03 18:33:31.
Copyright © 2017. World Scientific Publishing Company. All rights reserved.
230
Expect the Unexpected: A First Course in Biostatistics
However, in biology and medicine, it is often unethical to assign a factor to
a unit. For example, it is unethical to ask someone to smoke two cigarettes
a day. Nevertheless, there are acceptable methods that can be used to dis-
tinguish causal from noncausal associations. Refer to Chapter 2 in [57] for
a discussion on establishing causality in the context of epidemiology.
An aspect that is important for a causal association is the strength of
association.
If there is a causal relationship, then there is an association
between the cause and effect. Therefore, a strong correlation between two
variables can hint at the existence of a causal relationship.
But a large
correlation alone is not proof of causation.
Let us consider an example. Say we select a few communities at random,
and we measure the correlation between the number of bananas consumed
in a month per capita and the prevalence of a disease. Say the confidence
interval for the correlation is [
-
0
.
9;
-
0
.
85].
We have observed a strong
correlation between the two variables.
Does this mean that eating more
bananas causes the risk of developing the disease to decrease? It is doubtful.
In this case, it is likely that there are lurking variables (such as a healthy
lifestyle) that are causes of both eating more bananas and the decreased
risk of disease.
A significant correlation between two variables is not sufficient evidence
for a cause-and-effect relationship, however it does hint at the possibility of
the existence of such a relationship. A significant correlation between two
variables is evidence of an association between the two variables.
Technology Component using
R
:
Suppose that
x
and
y
are numerical
vectors of equal length.
•
To compute the covariance between the two variables, we use
cov(x,y)
•
To compute the correlation between the two variables, we use
cor(x,y)
13.2
Least Squares Line
In this section, we begin by describing the association between a variable
y
(also called the
response
) and a variable
x
(also called the
predictor
) with
a line of best fit.
We assume that we have a random sample of paired
observations (
x
i
, y
i
) for
i
= 1
, . . . , n
.
Balan, R., & Lamothe, G. (2017). Expect the unexpected : A first course in biostatistics (second edition). World
Scientific Publishing Company.
Created from ottawa on 2023-12-03 18:33:31.
Copyright © 2017. World Scientific Publishing Company. All rights reserved.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Regression and Correlation
231
Example 13.4 (Part 1).
Consider the data from Example 13.2. The pre-
dictor variable
x
is the dosage of vitamin C and the response variable
y
is the
number of colds. For these data, the line of best fit is ˆ
y
= 12
.
0
-
0
.
0944
x
,
which is overlayed in the scatter plot in Figure 13.5.
We can use the line to estimate the mean number of colds in three years
for a dosage of 35 mg:
ˆ
μ
Y
|
x
=35
= 12
.
0
-
0
.
0944 (35) = 8
.
696
.
Fig. 13.5
Least squares line for the number of colds against dosage of vitamin C
To find the line of best fit, denoted by ˆ
y
= ˆ
α
+
ˆ
β x
, we will define what
we mean by “best”.
Consider the
i
-th case (
x
i
, y
i
).
The corresponding
fitted value
ˆ
y
i
= ˆ
α
+
ˆ
β x
i
is the evaluation of the estimated line at
x
=
x
i
.
The difference between the
i
th observed response and the
i
-th fitted value
is called the
i
-th
residual
e
i
=
y
i
-
ˆ
y
i
. A residual is sometimes called an
observed error. The sum of the squared residuals:
L
=
n
X
i
=1
h
y
i
-
(ˆ
α
+
ˆ
β x
i
)
i
2
,
is used as measure of fit. In some sense,
L
represents a distance between the
observed responses and the estimated line. We say that the line of best fit is
the line that minimizes
L
. This criterion of fit was independently proposed
in the 18th century by the German mathematician Carl Friedrich Gauss
and by the French mathematician Adrien-Marie Legendre. It is known as
the
method of least-squares
.
Balan, R., & Lamothe, G. (2017). Expect the unexpected : A first course in biostatistics (second edition). World
Scientific Publishing Company.
Created from ottawa on 2023-12-03 18:33:31.
Copyright © 2017. World Scientific Publishing Company. All rights reserved.
232
Expect the Unexpected: A First Course in Biostatistics
The minimum of the least-squares criterion
L
can be found by differen-
tiating it with respect to ˆ
α
and
ˆ
β
and by setting these partial derivatives
equal to zero.
We obtain a system of two equations in ˆ
α
and
ˆ
β
that we
need to solve. After some simplification, these equations can be shown to
be
n
X
i
=1
y
i
=
n
ˆ
α
+
ˆ
β
n
X
i
=1
x
i
and
n
X
i
=1
x
i
y
i
= ˆ
α
n
X
i
=1
x
i
+
ˆ
β
n
X
i
=1
x
2
i
.
(13.1)
These equations are called the
normal equations
.
As we isolate ˆ
α
in the
first equation and substitute it in the second equation to obtain
ˆ
β
, we get
the least-squares estimates of the intercept
ˆ
α
=
∑
n
i
=1
y
i
n
-
ˆ
β
∑
n
i
=1
x
i
n
,
(13.2)
and of the slope
ˆ
β
=
(
∑
n
i
=1
x
i
y
i
)
-
(1
/n
)(
∑
n
i
=1
x
i
)(
∑
n
i
=1
y
i
)
(
∑
n
i
=1
x
2
i
)
-
(1
/n
)(
∑
n
i
=1
x
i
)
2
=
∑
n
i
=1
(
x
i
-
x
)(
y
i
-
y
)
∑
n
i
=1
(
x
i
-
x
)
2
.
(13.3)
All the quantities involved in this solution should seem familiar. Actu-
ally the slope of the least-squares line has a few other useful representations,
such as
ˆ
β
=
∑
n
i
=1
(
x
i
-
x
)(
y
i
-
y
)
/
(
n
-
1)
∑
n
i
=1
(
x
i
-
x
)
2
/
(
n
-
1)
=
d
cov
xy
s
2
x
=
r
xy
s
y
s
x
,
where
x
and
y
are respectively the sample means of the predictors and the
responses,
s
2
x
is the sample variance of the predictors,
d
cov
xy
is the sample
covariance between
x
and
y
, and
r
xy
is the sample correlation between
x
and
y
.
Note that the slope of the least-squares line will always have the
same sign as the sample correlation between the response and the predictor.
Example 13.5.
Refer to the mother-daughter sample of size
n
= 12 from
Example 13.1.
The response variable
y
is the height of the daughter
and the predictor variable
x
is the height of the mother.
We summa-
rize the data with the following sums:
∑
x
i
= 1
,
951
.
0,
∑
y
i
= 1
,
923
.
0,
∑
x
2
i
= 317
,
267
.
0,
∑
x
i
y
i
= 312
,
702
.
0 and
∑
y
2
i
= 308
,
403
.
Using (13.2)
and (13.3) to compute the least squares estimates, we get the following
estimated line
b
y
= 0
.
8107
x
+ 28
.
4421.
Figure 13.6 gives the scatter plot
of the pairs (
x
i
, y
i
) and the estimated regression line.
Balan, R., & Lamothe, G. (2017). Expect the unexpected : A first course in biostatistics (second edition). World
Scientific Publishing Company.
Created from ottawa on 2023-12-03 18:33:31.
Copyright © 2017. World Scientific Publishing Company. All rights reserved.
Regression and Correlation
233
Fig. 13.6
Least squares line for the heights of mothers and daughters
The least squares line describes the central tendency of the response
y
as a function of the predictor
x
.
We should also describe the dispersion
about the line, since not all of the observations will fall on the line.
We
can measure the variability about the least squares line with the
residual
standard deviation
which is defined as
s
e
=
s
∑
n
i
=1
e
2
i
n
-
2
=
s
∑
n
i
=1
(
y
i
-
ˆ
y
i
)
2
n
-
2
.
Note that
∑
n
i
=1
e
2
i
/
(
n
-
2) is approximately the average squared deviation
of the
n
responses, away from the least squares line.
However instead of
dividing by
n
, we divide by
n
-
2 which corresponds to the number of degrees
of freedom in this case.
To describe the variability about the center, we
need to first estimate the center by estimating the slope and the intercept.
This leads to a loss of 2 degrees of freedom.
Example 13.4 (Part 2).
Consider the number of colds data from Exam-
ple 13.2. To describe the precision of least squares estimation, we compute
the residual standard standard deviation. Using
R
, we get
s
e
= 0
.
775 colds.
So typically, the number of colds in three years is about 0.775 colds away
from the least square line.
Technology Component using
R
:
Suppose that
x
and
y
are numerical
vectors of equal length.
•
We use
lm(y~x)
to compute the line of least squares.
Balan, R., & Lamothe, G. (2017). Expect the unexpected : A first course in biostatistics (second edition). World
Scientific Publishing Company.
Created from ottawa on 2023-12-03 18:33:31.
Copyright © 2017. World Scientific Publishing Company. All rights reserved.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
234
Expect the Unexpected: A First Course in Biostatistics
•
We assign the estimated linear model to
mod
with the command
mod=lm(y~x)
. The commands
mod$residuals
and
mod$df.residuals
give the vector of the residuals and the corresponding degrees of free-
dom, respectively. The following command will give the residual stan-
dard deviation
sqrt(sum((model$residuals)^2)/model$df.residual)
•
To produce a scatter plot of
y
against
x
, we use
plot(x,y)
To overlay the least square line onto the plot, we use
abline(lm(y~x))
13.3
Problems
Problem 13.1.
The height of a child as an adult can be predicted using
the child’s height at the age of 2. The following table gives the heights of
20 women (in cm), as adults and at the age of 2:
Adult
Height At
Adult
Height At
Height (
y
)
Age of 2 (
x
)
Height (
y
)
Age of 2 (
x
)
164.6
86.4
158.3
83.1
166.1
87.6
159.8
84.5
167.4
88.9
160.6
85.2
163.8
85.7
162.5
84.3
162.9
84.1
173.5
93.9
168.1
89.0
171.9
92.7
169.3
90.1
165.3
85.2
167.4
87.2
164.1
84.2
168.5
88.3
167.5
86.3
165.9
86.3
175.3
95.2
For this data, we have:
20
X
i
=1
y
i
= 3
,
322
.
8
,
20
X
i
=1
x
i
= 1748
.
2
20
X
i
=1
y
2
i
= 552
,
414
.
5
20
X
i
=1
x
2
i
= 153
,
028
,
20
X
i
=1
x
i
y
i
= 290
,
710
.
1
Balan, R., & Lamothe, G. (2017). Expect the unexpected : A first course in biostatistics (second edition). World
Scientific Publishing Company.
Created from ottawa on 2023-12-03 18:33:31.
Copyright © 2017. World Scientific Publishing Company. All rights reserved.
Regression and Correlation
235
(a) Calculate the estimated least squares line.
(b) Find the sample correlation between the height as an adult and the
height at the age of 2.
(c) Estimate the mean height of a girl as an adult, whose height at age 2 is
87.2 cm.
(d) Predict the height of a girl as an adult, whose height at age 2 is 84 cm.
Problem 13.2.
Melanoma is a type of skin cancer which forms from
melanocytes (pigment-producing cells).
Melanoma is considered as the
most dangerous form of skin cancer. It is not the most common of the skin
cancers in North America, but it does cause the most deaths. Melanoma is
caused mainly by exposure to ultraviolet radiation (either from the sun or
tanning beds). The authors of article [23] studied the association between
melanoma mortality rates and the geographical latitude.
The data is in
the file
SkinCancer.txt
. The latitude (
x
) of the largest city in each state or
province was used as an estimate of geographical center of population. The
mortality rate (
y
) for the male population is the number of deaths per year
per 100,000 individuals. (The mortality rates are age-standardized to ac-
count for populations of different ages.) Here is a scatter plot of melanoma
mortality rates for the male population against the latitude of the state or
province.
(a) Here are a few summary statistics:
x
= 40
.
3762;
y
= 1
.
3506;
s
x
= 5
.
6851;
s
y
= 0
.
4036
and
d
cov
xy
=
∑
n
i
=1
(
x
i
-
x
)(
y
i
-
y
)
n
-
1
=
-
1
.
8003
.
Balan, R., & Lamothe, G. (2017). Expect the unexpected : A first course in biostatistics (second edition). World
Scientific Publishing Company.
Created from ottawa on 2023-12-03 18:33:31.
Copyright © 2017. World Scientific Publishing Company. All rights reserved.
236
Expect the Unexpected: A First Course in Biostatistics
Compute the correlation between the melanoma mortality rate and the lat-
itude.
Based on the above scatter plot and the value of the correlation,
describe the association between the melanoma mortality rate and the lat-
itude.
(b) Using the statistics from part (a), compute the least square line to de-
scribe the melanoma mortality rates for the male population as a function
of the latitude of the province or state. Give an interpretation to the slope
of this line.
(c) Consider the female population. Using statistical software, compute the
least squares line to describe the melanoma mortality rates as a function of
the latitude of the province or state. Give an interpretation to the slope of
this line. Furthermore, construct a scatter plot of the melanoma mortality
rates against the latitude.
(d) Consider the scatter plot from part (c). There is a state/province with
a much lower than expected mortality rate. Identify this state or province.
Problem 13.3.
Systolic arterial blood pressure (SBP) and diastolic ar-
terial blood pressure (DBP) frequently display a linear relationship char-
acterized by the systolic-versus-diastolic slope and the sample correlation
(see [30]). The following table gives the SBP and the DBP for 15 men aged
between 40 and 65:
SBP (
y
)
DBP (
x
)
SBP (
y
)
DBP (
x
)
112
63
156
100
120
69
124
82
135
70
99
56
142
82
105
65
132
76
124
73
115
67
144
89
119
71
134
76
128
73
(a) Calculate the mean SBP and the mean DBP for this sample.
(b) Calculate the sample covariance cov
xy
, the sample variances
s
2
x
, s
2
y
, and
the sample correlation
r
xy
.
(c) Give the line of the best fit which expresses the SBP as a function of
the DBP.
(d) Give the point estimate for the SBP of a man of age between 40 and
65, whose DBP is equal to 75.
Balan, R., & Lamothe, G. (2017). Expect the unexpected : A first course in biostatistics (second edition). World
Scientific Publishing Company.
Created from ottawa on 2023-12-03 18:33:31.
Copyright © 2017. World Scientific Publishing Company. All rights reserved.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Regression and Correlation
237
Problem 13.4.
Pulmonary vascular resistance (PVR) occurs when the
pulmonary artery creates resistance against the blood flowing into it from
the right ventricle. An elevated PVR is frequently observed in patients with
advanced heart failure. The researchers in [46] hypothesized that inhalation
of nitric oxide would decrease PVR in such patients. To test this hypothesis,
they studied the hemodynamic effects of inhalation of nitric oxide (80 ppm)
for 10 minutes in 19 patients with heart failure associated to left ventricular
dysfunction.
Here is a scatter plot that displays the change in PVR (in
percentage) against the PVR at baseline.
(a) Denote the change in PVR (in percentage) as
y
and the PVR at baseline
as
x
.
Here are the covariance between
x
and
y
and also the respective
standard deviations
d
cov
xy
=
-
2783
.
822;
s
y
= 29
.
6938;
s
x
= 136
.
4879
.
Compute the correlation between these two variables.
(b) Describe the association between the change in PVR (in percentage)
and PVR at baseline.
Problem 13.5.
Since Confederation, the Canadian population has been
growing steadily. The following table gives the population of Canada (in
millions) since 1951. The data is taken from Statistics Canada website. We
Balan, R., & Lamothe, G. (2017). Expect the unexpected : A first course in biostatistics (second edition). World
Scientific Publishing Company.
Created from ottawa on 2023-12-03 18:33:31.
Copyright © 2017. World Scientific Publishing Company. All rights reserved.
238
Expect the Unexpected: A First Course in Biostatistics
denote by
y
the Canadian population and
x
the year. We have:
30
X
i
=1
x
i
= 59
,
400
,
30
X
i
=1
y
i
= 730
.
381
,
30
X
i
=1
x
i
y
i
= 1
,
449
,
110
n
X
i
=1
x
2
i
= 117
,
620
,
990
,
30
X
i
=1
y
2
i
= 18
,
756
.
71
.
Year
Population
Year
Population
Year
Population
1951
14.009
1961
18.239
1971
21.963
1953
14.845
1963
18.931
1973
22.494
1955
15.698
1965
19.644
1975
23.143
1957
16.610
1967
20.500
1977
23.727
1959
17.483
1969
21.001
1979
24.203
Year
Population
Year
Population
Year
Population
1981
24.821
1991
27.945
2001
31.012
1983
25.367
1993
28.682
2003
31.676
1985
25.843
1995
29.303
2005
32.359
1987
26.449
1997
29.965
2007
33.115
1989
27.056
1999
30.404
2009
33.894
(a) Construct a scatter plot of the data. Give the estimated regression line
of the population as a function of the year.
(b) Calculate the sample correlation
r
xy
. Interpret the result.
(c) Compute the residual standard deviation
s
e
.
Problem 13.6.
We would like to describe the relationship between the
mean adult female body mass (in kg) of grizzly bears (
y
) and the percentage
of meat in the diet (
x
). Below are the data for
n
= 12 different regions.
x
y
x
y
5
120
42
169
6
122
42
171
7
117
60
201
11
129
76
210
12
132
77
225
26
139
79
220
(a) Calculate the mean and standard deviation for the mean adult female
body mass and for the percentage of meat in the diet.
Balan, R., & Lamothe, G. (2017). Expect the unexpected : A first course in biostatistics (second edition). World
Scientific Publishing Company.
Created from ottawa on 2023-12-03 18:33:31.
Copyright © 2017. World Scientific Publishing Company. All rights reserved.
Regression and Correlation
239
(b) Draw a scatter plot of the mean adult female body mass against the
percentage of meat in the diet.
(c) Calculate the sample covariance and the sample correlation between the
percentage of meat in the diet and the mean adult female body mass.
Problem 13.7.
A large study was conducted to test the hypothesis that
the skeletal muscle mass of women reduces with age. All women involved in
the study had a body mass index of at most 35. For each of the 125 women
participating in this study, the researchers recorded their total skeletal mus-
cle mass (in kg) and their age (in years).
The data are found in the file
SkeletalMass.txt. The first column gives the skeletal muscle mass and the
second column gives the age.
(a) Construct a scatter plot of the data. Give the estimated regression line
of the skeletal muscle mass as a function of age.
(b) Calculate the sample correlation
r
xy
. Interpret the result.
(c) Compute the residual standard deviation.
Problem 13.8.
Bears play a role in the transfer of marine isotopes, in
particular those taken from salmon, into the terrestrial ecosystem (see [36]).
The values of the nitrogen isotope signature
δ
15
N
(in per mil) measured
from a certain foliage are modeled as a function of the distance from the
river (in metres). Below are the data from a river with few bears and little
to no salmon.
Distance
50
100
150
200
250
300
350
400
δ
15
N
-
3
.
48
-
4
.
02
-
3
.
00
-
3
.
24
-
3
.
96
-
3
.
80
-
3
.
14
-
3
.
80
(a) Produce a scatter plot and compute the least squares line describing the
value of the nitrogen isotope signature as a function of the distance from
the river.
(b) Compute the residual standard deviation.
(c) Calculate the sample correlation
r
xy
. Can we conclude that the value
of the nitrogen isotope signature and the distance from the river are corre-
lated?
Problem 13.9.
Continue with the situation in Problem 13.8.
Consider
now the following data from a river with few bears and an abundant salmon
population.
Balan, R., & Lamothe, G. (2017). Expect the unexpected : A first course in biostatistics (second edition). World
Scientific Publishing Company.
Created from ottawa on 2023-12-03 18:33:31.
Copyright © 2017. World Scientific Publishing Company. All rights reserved.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
240
Expect the Unexpected: A First Course in Biostatistics
Distance
50
75
100
125
150
200
225
δ
15
N
0.18
-
0
.
97
-
1
.
74
-
1
.
96
-
2
.
13
-
2
.
31
-
2
.
65
Distance
250
300
325
350
375
400
δ
15
N
-
2
.
53
-
2
.
52
-
2
.
55
-
2
.
59
-
2
.
71
-
2
.
87
(a) Produce a scatter plot and compute the least squares line describing the
value of the nitrogen isotope signature as a function of the distance from
the river. Does the association appear to be linear?
(b) Because there is an abundant salmon population, but few bears for the
nitrogen transfer, it is hypothesized that the value of the nitrogen isotope
signature is correlated with the inverse distance from the river. We trans-
form the data by defining the predictor
x
= 1
/
distance. Produce a scatter
plot and compute the least squares line describing the values of the nitrogen
isotope signature as a function of
x
. What are your findings?
(c) Compute the correlation between the values of the nitrogen isotope sig-
nature
y
and
x
= 1
/
distance?
Problem 13.10.
With an increase in human activity in bear habitats,
there are more human-bear interactions (see [1]). The following data were
collected over a few years in the back country of a particular park. They
represent the number of human-bear interactions and the number of people
using a shuttle bus during a two-week period.
Number of
Human-Bear
Number of
Human-Bear
Bus Users
Interactions
Bus Users
Interactions
1,750
1
14,000
16
2,000
1
14,025
10
5,880
2
14,035
8
6,000
2
14,250
12
7,775
2
15,004
10
10,002
4
15,250
12
10,025
5
15,300
9
10,035
3
15,750
11
11,050
5
15,750
20
12,004
9
16,000
12
(a) Produce a scatter plot and compute the least squares line describing
the number of human-bear interactions as a function of the number of bus
users. Does the association appear to be linear?
Balan, R., & Lamothe, G. (2017). Expect the unexpected : A first course in biostatistics (second edition). World
Scientific Publishing Company.
Created from ottawa on 2023-12-03 18:33:31.
Copyright © 2017. World Scientific Publishing Company. All rights reserved.
Regression and Correlation
241
(b) Apply a logarithm transformation to the response by defining a new
response variable
y
=
ln
(number of interactions).
Produce a scatter plot
and compute the least squares line describing
y
as a function of the number
of bus users. Does the association appear to be linear?
(c) Use the residuals from part (c) to produce a normal probability plot of
the residuals and a residual plot.
Use these plots to perform diagnostics
of the underlying assumptions of the simple linear model. What are your
findings?
(d) Using the least squares line from part (c), predict the number of human-
bear interactions for a two-week period in which there are 8,000 shuttle bus
users. Construct the corresponding 95% prediction interval and interpret
the result.
(e) Using the least squares line from part (c), estimate the mean number
of human-bear interactions for a two-week period in which there are 8,000
shuttle bus users. Construct the corresponding 95% confidence interval and
interpret the result.
Did you know?
More than two thirds of the world’s plant species
are found in the tropical rainforests, which are renowned for their massive
bio-diversity.
Rainforests, once covered 14% of the earth’s land surface,
now cover only 6%.
Nearly half of the world’s species of plants, animals
and microorganisms will be destroyed or severely threatened over the next
25 years, due to rainforest deforestation.
Experts estimate that the last
remaining rainforests could be consumed in less than 40 years. The Trop-
ical Plants Database is an international project dedicated to providing ac-
curate and factual information on the plants of the Amazon Rainforest,
created by the joint efforts of botanists, ethnobotanists, health professionals
and phytochemists.
More information about this project can be found at
http://www.rain-tree.com/plants.htm.
Balan, R., & Lamothe, G. (2017). Expect the unexpected : A first course in biostatistics (second edition). World
Scientific Publishing Company.
Created from ottawa on 2023-12-03 18:33:31.
Copyright © 2017. World Scientific Publishing Company. All rights reserved.
This page intentionally left blank
This page intentionally left blank
This page intentionally left blank
This page intentionally left blank
Balan, R., & Lamothe, G. (2017). Expect the unexpected : A first course in biostatistics (second edition). World
Scientific Publishing Company.
Created from ottawa on 2023-12-03 18:33:31.
Copyright © 2017. World Scientific Publishing Company. All rights reserved.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help