worksheet_inference1
pdf
keyboard_arrow_up
School
University of British Columbia *
*We aren’t endorsed by this school
Course
DSCI100
Subject
Statistics
Date
Feb 20, 2024
Type
Pages
12
Uploaded by CountKuduMaster478
Worksheet 11 - Introduction to
Statistical Inference
Lecture and Tutorial Learning Goals:
After completing this week's lecture and tutorial work, you will be able to:
Describe real world examples of questions that can be answered with the
statistical inference methods.
Name common population parameters (e.g., mean, proportion, median,
variance, standard deviation) that are often estimated using sample data, and
use computation to estimate these.
Define the following statistical sampling terms (population, sample,
population parameter, point estimate, sampling distribution).
Explain the difference between a population parameter and sample point
estimate.
Use computation to draw random samples from a finite population.
Use computation to create a sampling distribution from a finite population.
Describe how sample size influences the sampling distribution.
This worksheet covers parts of the Inference chapter of the online textbook. You
should read this chapter before attempting the worksheet.
### Run this cell before continuing.
library
(
tidyverse
)
library
(
repr
)
library
(
infer
)
library
(
cowplot
)
options
(
repr.matrix.max.rows =
6
)
source
(
'tests.R'
)
source
(
'cleanup.R'
)
Question 1.1
Matching:
{points: 1}
Read the mixed up table below and assign the variables in the code cell below a
number to match the the term to it's correct definition. Do not put quotations
around the number or include words in the answer, we are expecting the assigned
values to be numbers.
Terms
Definitions
point estimate
1. the entire set of entities/objects of interest
population
2. selecting a subset of observations from a population where
each observation is equally likely to be selected at any point
during the selection process
In [ ]:
Terms
Definitions
random sampling
3. a numerical summary value about the population
representative
sampling
4. a distribution of point estimates, where each point estimate was
calculated from a different random sample from the same
population
population
parameter
5. a collection of observations from a population
sample
6. a single number calculated from a random sample that
estimates an unknown population parameter of interest
observation
7. selecting a subset of observations from a population where the
sample’s characteristics are a good representation of the
population’s characteristics
sampling
distribution
8. a quantity or a quality (or set of these) we collect from a given
entity/object
point_estimate <-
NULL
population <-
NULL
random_sampling <-
NULL
representative_sampling <-
NULL
population_parameter <-
NULL
sample <-
NULL
observation <-
NULL
sampling_distribution <-
NULL
### BEGIN SOLUTION
point_estimate <-
6
population <-
1
random_sampling <-
2
representative_sampling <-
7
population_parameter <-
3
sample <-
5
observation <-
8
sampling_distribution <-
4
### END SOLUTION
test_1.1
()
Virtual sampling simulation
In real life, we rarely, if ever, have measurements for our entire population. Here,
however, we will pretend that we somehow were able to ask every single Candian
senior what their age is. We will do this so that we can experiment to learn about
sampling and how this relates to estimation.
Here we make a simulated dataset of ages for our population (all Canadian
seniors) bounded by realistic values (
65 and 117):
In [ ]:
In [ ]:
# run this cell to simulate a finite population
set.seed
(
4321
) # DO NOT CHANGE
can_seniors <-
tibble
(
age =
(
rexp
(
2000000
, rate =
0.1
)
^
2
) +
65
) |>
filter
(
age <=
117
, age >=
65
)
can_seniors
Question 1.2
{points: 1}
A distribution defines all the possible values (or intervals) of the data and how
often they occur. Visualize the distribution of the population (
can_seniors
) that
was just created by plotting a histogram using binwidth = 1
in the
geom_histogram
argument. Name the plot pop_dist
and give the x-axis a
descriptive label.
options
(
repr.plot.width =
8
, repr.plot.height =
7
)
# ... <- ggplot(..., ...) + # geom_...(...) +
# ... +
# ggtitle("Population distribution")
### BEGIN SOLUTION
pop_dist <-
ggplot
(
can_seniors
, aes
(
age
)) +
geom_histogram
(
binwidth =
1
) +
xlab
(
"Age (years)"
) +
ggtitle
(
"Population distribution"
) +
theme
(
text =
element_text
(
size =
20
))
### END SOLUTION
pop_dist
test_1.2
()
Question 1.3
{points: 1}
Distributions are complicated to communicate, thus we often want to represent
them by a single value or small number of values. Common values used for this
include the mean, median, standard deviation, etc).
Use summarize
to calculate the following population parameters from the
can_seniors
population:
mean (use the mean
function)
median (use the median
function)
standard deviation (use the sd
function)
Name this data frame pop_parameters
which has the column names pop_mean
,
pop_med
and pop_sd
.
### BEGIN SOLUTION
pop_parameters <-
can_seniors |>
summarize
(
pop_mean =
mean
(
age
),
pop_med =
median
(
age
),
In [ ]:
In [ ]:
In [ ]:
In [ ]:
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
pop_sd =
sd
(
age
))
### END SOLUTION
pop_parameters
test_1.3
()
Question 1.4
{points: 1}
In real life, we usually are able to only collect a single sample from the population.
We use that sample to try to infer what the population looks like.
Take a single random sample of 40 observations from the Canadian seniors
population (
can_seniors
). Name it sample_1
. Use 4321 as your seed.
set.seed
(
4321
) # DO NOT CHANGE!
# ... <- ... |> # rep_sample_n(...)
### BEGIN SOLUTION
sample_1 <-
can_seniors |>
rep_sample_n
(
40
)
### END SOLUTION
sample_1
test_1.4
()
Question 1.5
{points: 1}
Visualize the distribution of the random sample you just took (
sample_1
) that
was just created by plotting a histogram using binwidth = 1
in the
geom_histogram
argument. Name the plot sample_1_dist
and give the plot the
title "Sample 1 Distribution" (using ggtitle
) and the x-axis a descriptive label.
options
(
repr.plot.width =
8
, repr.plot.height =
7
)
### BEGIN SOLUTION
sample_1_dist <-
ggplot
(
sample_1
, aes
(
age
)) +
geom_histogram
(
binwidth =
1
) +
xlab
(
"Age (years)"
) +
ggtitle
(
"Sample 1 Distribution"
) +
theme
(
text =
element_text
(
size =
20
))
### END SOLUTION
sample_1_dist
test_1.5
()
Question 1.6
{points: 1}
Use summarize
to calculate the following point estimates from the random
sample you just took (
sample_1
):
mean
median
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
standard deviation
Name this data frame sample_1_estimates
which has the column names
sample_1_mean
, sample_1_med
and sample_1_sd
.
### BEGIN SOLUTION
sample_1_estimates <-
sample_1 |>
summarize
(
sample_1_mean =
mean
(
age
),
sample_1_med =
median
(
age
),
sample_1_sd =
sd
(
age
))
### END SOLUTION
sample_1_estimates
test_1.6
()
Let's now compare our random sample to the population from which it was
drawn. In ggplot
, it is possible to display multiple charts together by using the
function plot_grid
from a separate package called cowplot
. We can use the
ncol
parameter to control how many columns of plots the grid contains. Since
we want to compare the distributions' shape and position on the x-axis, it is most
effective to concatenate these charts vertically in a single column.
# run this code cell
options
(
repr.plot.width =
7
, repr.plot.height =
7
)
plot_grid
(
pop_dist
, sample_1_dist
, ncol =
1
)
And now let's compare the point estimates (mean, median and standard
deviation) with the true population parameters we were trying to estimate:
# run this cell
pop_parameters
sample_1_estimates |>
select
(
-
replicate
)
Question 1.7
Multiple Choice
{points: 1}
After comparing the population and sample distributions above, and the true
population parameters and the sample point estimates, which statement below is
not
correct:
A. The sample point estimates are close to the values for the true population
parameters we are trying to estimate
B. The sample distribution is of a similar shape to the population distribution
C. The sample point estimates are identical to the values for the true population
parameters we are trying to estimate
Assign your answer to an object called answer1.7
. Your answer should be a
single character surrounded by quotes.
In [ ]:
In [ ]:
In [ ]:
In [ ]:
### BEGIN SOLUTION
answer1.7 <-
"C"
### END SOLUTION
test_1.7
()
Question 1.8.0
{points: 1}
What if we took another sample? What would we expect? Let's try!
Take another random sample of size 40 from population (using a different
random seed this time so that you get a different sample), visualize its
distribution (give the plot the title "Sample 2 Distribution" using ggtitle
), and
calculate the point estimates for the sample mean, median and standard
deviation. Name your random sample of data sample_2
, name your visualization
sample_2_dist
, and finally name your estimates sample_2_estimates
, which
has the column names sample_2_mean
, sample_2_med
and sample_2_sd
.
set.seed
(
2020
) # DO NOT CHANGE!
### BEGIN SOLUTION
sample_2 <-
can_seniors |>
rep_sample_n
(
40
)
options
(
repr.plot.width =
8
, repr.plot.height =
7
)
sample_2_dist <-
ggplot
(
sample_2
, aes
(
age
)) +
geom_histogram
(
binwidth =
1
) +
xlab
(
"Age (years)"
) +
ggtitle
(
"Sample 2 Distribution"
)
sample_2_estimates <-
sample_2 |>
summarise
(
sample_2_mean =
mean
(
age
),
sample_2_med =
median
(
age
),
sample_2_sd =
sd
(
age
))
sample_2_dist
sample_2_estimates
### END SOLUTION
test_1.8.0
()
Question 1.8.1
{points: 1}
After comparing the distribution and point estimates of this second random
sample from the population with that of the first random sample and the
population, which of the following statements below is not
correct:
A. The sample distributions from different random samples are of a similar shape
to the population distribution, but they vary a bit depending which values are
captured in the sample
In [ ]:
In [ ]:
In [ ]:
In [ ]:
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
B. The sample point estimates from different random samples are close to the
values for the true population parameters we are trying to estimate, but they vary
a bit depending which values are captured in the sample
C. Every random sample from the same population should have an identical set of
values and yield identical point estimates.
Assign your answer to an object called answer1.8.1
. Your answer should be a
single character surrounded by quotes.
### BEGIN SOLUTION
answer1.8.1 <-
"C"
### END SOLUTION
test_1.8.1
()
Exploring the sampling distribution of an estimate
Just how much should we expect the point estimates of our random samples to
vary? To build an intuition for this, let's experiment a little more with our
population of Canadian seniors. To do this we will take 1500 random samples, and
then calculate the point estimate we are interested in (let's choose the mean for
this example) for each sample. Finally, we will visualize the distribution of the
sample point estimates. This distribution will tell us how much we would expect
the point estimates of our random samples to vary for this population for samples
of size 40 (the size of our samples).
Question 1.9
{points: 1}
Draw 1500 random samples from our population of Canadian seniors
(
can_seniors
). Each sample should have 40 observations. Name the data frame
samples
and use the seed 4321
. Here we use the functions head()
, tail()
and dim()
to view the first few rows, the last few rows and the dimension of the
data set respectively.
set.seed
(
4321
) # DO NOT CHANGE!
# ... <- rep_sample_n(..., size = ..., reps = ...)
### BEGIN SOLUTION
samples <-
rep_sample_n
(
can_seniors
, size =
40
, reps =
1500
)
### END SOLUTION
head
(
samples
)
tail
(
samples
)
dim
(
samples
)
test_1.9
()
Question 2.0
{points: 1}
In [ ]:
In [ ]:
In [ ]:
In [ ]:
Group by the sample replicate number, and then for each sample, calculate the
mean as the point estimate. Name the data frame sample_estimates
. The data
frame should have the column names replicate
and mean_age
.
### BEGIN SOLUTION
sample_estimates <-
samples |>
group_by
(
replicate
) |>
summarize
(
mean_age =
mean
(
age
))
### END SOLUTION
head
(
sample_estimates
)
tail
(
sample_estimates
)
test_2.0
()
Question 2.1
{points: 1}
Visualize the distribution of the sample estimates (
sample_estimates
) you just
calculated by plotting a histogram using binwidth = 1
in the geom_histogram
argument. Name the plot sampling_distribution
. Give the plot the title
"Sampling Distribution of the Sample Means" using ggtitle
, and give the x-axis
a descriptive label.
options
(
repr.plot.width =
8
, repr.plot.height =
7
)
### BEGIN SOLUTION
sampling_distribution <-
ggplot
(
sample_estimates
, aes
(
x =
mean_age
)) +
geom_histogram
(
binwidth =
1
) +
xlab
(
"Sample mean (age in years)"
) +
ggtitle
(
"Sampling Distribution of the Sample Means"
) +
theme
(
text =
element_text
(
size =
20
))
### END SOLUTION
sampling_distribution
test_2.1
()
Question 2.2
{points: 1}
Let's refresh our memories: what is the mean age of the whole population (we
calculated this above)? Assign your answer to an object called answer2.2
. Your
answer should be a single number reported to two decimal places.
### BEGIN SOLUTION
answer2.2 <-
79.30
### END SOLUTION
answer2.2
test_2.2
()
Question 2.3
Multiple Choice
{points: 1}
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
Considering the true value for the population mean, and the sampling
distribution you created and visualized in question 2.1
, which of the following
statements below is not
correct:
A. The sampling distribution is centered at the true population mean
B. All the sample means are the same value as the true population mean
C. Most sample means are at or very near the same value as the true population
mean
D. A few sample means are far away from the same value as the true population
mean
Assign your answer to an object called answer2.3
. Your answer should be a
single character surrounded by quotes.
### BEGIN SOLUTION
answer2.3 <-
"B"
### END SOLUTION
answer2.3
test_2.3
()
Question 2.4
True/False
{points: 1}
Taking a random sample and calculating a point estimate is a good way to get a
"best guess" of the population parameter you are interested in. True or False?
Assign your answer to an object called answer2.4
. Your answer should be either
"True" or "False", surrounded by quotes.
### BEGIN SOLUTION
answer2.4 <-
"True"
### END SOLUTION
answer2.4
test_2.4
()
The influence of sample size on the sampling
distribution
What happens to our point estimate when we change the sample size? Let's
answer this question by experimenting! We will create 3 different sampling
distributions of sample means, each using a different sample size. As we did
above, we will draw samples from our Canadian seniors population. We will
visualize these sampling distributions and see if we can see a pattern when we
vary the sample size.
In [ ]:
In [ ]:
In [ ]:
In [ ]:
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Question 2.5
{points: 1}
Using the same strategy as you did above, draw 1500 random samples from the
Canadian seniors population (
can_seniors
), each of size 20. For each sample,
calculate the mean age and assign this data to a column called mean_age
.
Then, visualize the distribution of the sample estimates (means) you just
calculated by plotting a histogram using binwidth = 1
in the geom_histogram
argument. Name the plot sampling_distribution_20
. Give the plot the title
"Sampling Distribution (n=20)" using ggtitle
, and give the x-axis a descriptive
label. Also specify the x-axis limits to be 65 and 95 using xlim(c(65, 95))
.
Set the seed as 4321 when you collect your samples.
set.seed
(
4321
) # DO NOT CHANGE THIS!
options
(
repr.plot.width =
8
, repr.plot.height =
7
)
### BEGIN SOLUTION
sample_estimates_20 <-
rep_sample_n
(
can_seniors
, size =
20
, reps =
1500
) |>
group_by
(
replicate
) |>
summarise
(
mean_age =
mean
(
age
))
sampling_distribution_20 <-
ggplot
(
sample_estimates_20
, aes
(
x =
mean_age
)) +
geom_histogram
(
binwidth =
1
) +
xlab
(
"Sample mean (age in years)"
) +
xlim
(
c
(
65
, 95
)) +
ggtitle
(
"Sampling Distribution (n=20)"
) +
theme
(
text =
element_text
(
size =
20
))
### END SOLUTION
sampling_distribution_20
test_2.5
()
Question 2.6
{points: 1}
Using the same strategy as you did above, draw 1500 random samples from the
Canadian seniors population (
can_seniors
), each of size 100. For each sample,
calculate the mean age and assign this data to a column called mean_age
.
Then, visualize the distribution of the sample estimates (means) you just
calculated by plotting a histogram using binwidth = 1
in the geom_histogram
argument. Name the plot sampling_distribution_100
. Give the plot the title
"Sampling Distribution (n=100)" using ggtitle
, and give the x axis a descriptive
label. Also specify the x-axis limits to be 65 and 95 using xlim(c(65, 95))
.
Set the seed as 4321 when you collect your samples.
set.seed
(
4321
) # DO NOT CHANGE THIS!
options
(
repr.plot.width =
8
, repr.plot.height =
7
)
### BEGIN SOLUTION
sample_estimates_100 <-
rep_sample_n
(
can_seniors
, size =
100
, reps =
1500
) |>
In [ ]:
In [ ]:
In [ ]:
group_by
(
replicate
) |>
summarise
(
mean_age =
mean
(
age
))
sampling_distribution_100 <-
ggplot
(
sample_estimates_100
, aes
(
x =
mean_age
)) +
geom_histogram
(
binwidth =
1
) +
xlab
(
"Sample mean (age in years)"
) +
xlim
(
c
(
65
, 95
)) +
ggtitle
(
"Sampling Distribution (n=100)"
) +
theme
(
text =
element_text
(
size =
20
))
### END SOLUTION
sampling_distribution_100
test_2.6
()
# run this cell to change the sampling distribution plot created
# earlier in the notebook so that the x-axis is the same dimensions
# as the other two plots you just made, and so that the title is "n = 40"
sampling_distribution <-
sampling_distribution +
xlim
(
c
(
65
, 95
))
sampling_distribution
$
labels
$
title <-
"Sampling Distribution (n=40)"
Question 2.7
{points: 1}
Fill in the blanks in the code below to use plot_grid
to concatenate the three
sampling distributions vertically. Order them from smallest sample size on the on
the top, to largest sample size on the bottom. Name the final panel figure
sampling_distribution_panel
.
options
(
repr.plot.width =
6
)
# sampling_distribution_panel <- plot_grid(
# ...,
# ...,
# ...,
# ncol = 1
# )
### BEGIN SOLUTION
sampling_distribution_panel <-
plot_grid
(
sampling_distribution_20
,
sampling_distribution
,
sampling_distribution_100
,
ncol =
1
)
sampling_distribution_panel
### END SOLUTION
test_2.7
()
Question 2.8
Multiple Choice
{points: 1}
Considering the panel figure you created above in question 2.7
, which of the
following statements below is not
correct:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
A. As the sample size increases, the sampling distribution of the point estimate
becomes narrower.
B. As the sample size increases, more sample point estimates are closer to the
true population mean.
C. As the sample size decreases, the sample point estimates become more
variable (spread out).
D. As the sample size increases, the sample point estimates become more variable
(spread out).
Assign your answer to an object called answer2.8
. Your answer should be a
single character surrounded by quotes.
### BEGIN SOLUTION
answer2.8 <-
"D"
### END SOLUTION
answer2.8
test_2.8
()
Question 2.9
True/False
{points: 1}
Given what you observed above, and considering the real life scenario where you
will only have one sample, answer the True/False question below:
The smaller your random sample, the better your sample point estimate reflect
the true population parameter you are trying to estimate. True or False?
Assign your answer to an object called answer2.9
. Your answer should be either
"true" or "false", surrounded by quotes.
### BEGIN SOLUTION
answer2.9 <-
"false"
### END SOLUTION
answer2.9
test_2.9
()
source
(
'cleanup.R'
)
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Recommended textbooks for you
data:image/s3,"s3://crabby-images/b9e14/b9e141b888912793d57db61a53fa701d5defdb09" alt="Text book image"
Glencoe Algebra 1, Student Edition, 9780079039897...
Algebra
ISBN:9780079039897
Author:Carter
Publisher:McGraw Hill
data:image/s3,"s3://crabby-images/af711/af7111c99977ff8ffecac4d71f474692077dfd4c" alt="Text book image"
Big Ideas Math A Bridge To Success Algebra 1: Stu...
Algebra
ISBN:9781680331141
Author:HOUGHTON MIFFLIN HARCOURT
Publisher:Houghton Mifflin Harcourt
Recommended textbooks for you
- Glencoe Algebra 1, Student Edition, 9780079039897...AlgebraISBN:9780079039897Author:CarterPublisher:McGraw HillBig Ideas Math A Bridge To Success Algebra 1: Stu...AlgebraISBN:9781680331141Author:HOUGHTON MIFFLIN HARCOURTPublisher:Houghton Mifflin Harcourt
data:image/s3,"s3://crabby-images/b9e14/b9e141b888912793d57db61a53fa701d5defdb09" alt="Text book image"
Glencoe Algebra 1, Student Edition, 9780079039897...
Algebra
ISBN:9780079039897
Author:Carter
Publisher:McGraw Hill
data:image/s3,"s3://crabby-images/af711/af7111c99977ff8ffecac4d71f474692077dfd4c" alt="Text book image"
Big Ideas Math A Bridge To Success Algebra 1: Stu...
Algebra
ISBN:9781680331141
Author:HOUGHTON MIFFLIN HARCOURT
Publisher:Houghton Mifflin Harcourt