tutorial_inference1
pdf
keyboard_arrow_up
School
University of British Columbia *
*We aren’t endorsed by this school
Course
DSCI100
Subject
Statistics
Date
Feb 20, 2024
Type
Pages
9
Uploaded by CountKuduMaster478
Tutorial 11 - Introduction to Statistical
Inference
Lecture and Tutorial Learning Goals:
After completing this week's lecture and tutorial work, you will be able to:
Describe real world examples of questions that can be answered with the
statistical inference methods.
Name common population parameters (e.g., mean, proportion, median,
variance, standard deviation) that are often estimated using sample data, and
use computation to estimate these.
Define the following statistical sampling terms (population, sample,
population parameter, point estimate, sampling distribution).
Explain the difference between a population parameter and sample point
estimate.
Use computation to draw random samples from a finite population.
Use computation to create a sampling distribution from a finite population.
Describe how sample size influences the sampling distribution.
This worksheet covers parts of the Inference chapter of the online textbook. You
should read this chapter before attempting the worksheet.
### Run this cell before continuing.
library
(
tidyverse
)
library
(
repr
)
library
(
infer
)
options
(
repr.matrix.max.rows =
6
)
source
(
'tests.R'
)
source
(
'cleanup.R'
)
Virtual sampling simulation
In this tutorial you will study samples and sample means generated from different
distributions. In real life, we rarely, if ever, have measurements for our entire
population. Here, however, we will make simulated datasets so we can understand
the behaviour of sample means.
Suppose we had the data science final grades for a large population of students.
# run this cell to simulate a finite population
set.seed
(
20201
) # DO NOT CHANGE
students_pop <-
tibble
(
grade =
(
rnorm
(
mean =
70
, sd =
8
, n =
10000
)))
students_pop
Question 1.0
{points: 1}
In [ ]:
In [ ]:
Visualize the distribution of the population (
students_pop
) that was just created
by plotting a histogram using binwidth = 1
in the geom_histogram
argument.
Name the plot pop_dist
and give x-axis a descriptive label.
options
(
repr.plot.width =
8
, repr.plot.height =
6
)
# ... <- ggplot(..., ...) + # geom_...(...) +
# ... +
# ggtitle("Population distribution")
### BEGIN SOLUTION
pop_dist <-
ggplot
(
students_pop
, aes
(
grade
)) +
geom_histogram
(
binwidth =
1
) +
xlab
(
"Grades"
) +
ggtitle
(
"Population distribution"
) +
theme
(
text =
element_text
(
size =
20
))
### END SOLUTION
pop_dist
test_1.0
()
Question 1.1
{points: 3}
Describe in words the distribution above, comment on the shape, center and how
spread out the distribution is.
BEGIN SOLUTION
The distribution is bell-shaped, symmetric, with one large peak in the middle
centered at about 70 percent. Students' scores ranged from just over 40 to just
under 100% but most students got between about 60 to 80%.
END SOLUTION
Question 1.2
{points: 1}
Use summarise
to calculate the following population parameters from the
students_pop
population:
mean (use the mean
function)
median (use the median
function)
standard deviation (use the sd
function)
Name this data frame pop_parameters
which has the column names pop_mean
,
pop_med
and pop_sd
.
### BEGIN SOLUTION
pop_parameters <-
students_pop |>
summarise
(
pop_mean =
mean
(
grade
),
pop_med =
median
(
grade
),
In [ ]:
In [ ]:
In [ ]:
pop_sd =
sd
(
grade
))
### END SOLUTION
pop_parameters
test_1.2
()
Question 1.2.1
{points: 1}
Draw one random sample of 5 students from our population of students
(
students_pop
). Use summarize
to calculate the mean, median, and standard
deviation for these 5 students.
Name this data frame ests_5
which should have column names mean_5
,
med_5
and sd_5
. Use the seed 4321
.
set.seed
(
4321
) # DO NOT CHANGE!
### BEGIN SOLUTION
ests_5 <-
students_pop |>
rep_sample_n
(
5
) |>
summarize
(
mean_5 =
mean
(
grade
),
med_5 =
median
(
grade
),
sd_5 =
sd
(
grade
))
### END SOLUTION
ests_5
test_1.2.1
()
Question 1.2.2
Multiple Choice:
{points: 1}
Which of the following is the point estimate for the average final grade for the
population of data science students (rounded to two decimal places)?
A. 70.03
B. 69.76
C. 73.52
D. 8.05
Assign your answer to an object called answer1.2.2
. Your answer should be a
single character surrounded by quotes.
### BEGIN SOLUTION
answer1.2.2 <-
"B"
### END SOLUTION
test_1.2.2
()
Question 1.2.3
{points: 1}
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Draw one random sample of 100 students from our population of students
(
students_pop
). Use summarize
to calculate the mean, median and standard
deviation for these 100 students.
Name this data frame ests_100
which has the column names mean_100
,
med_100
and sd_100
. Use the seed 4321
.
set.seed
(
4321
) # DO NOT CHANGE!
### BEGIN SOLUTION
ests_100 <-
students_pop |>
rep_sample_n
(
100
) |>
summarize
(
mean_100 =
mean
(
grade
),
med_100 =
median
(
grade
),
sd_100 =
sd
(
grade
))
### END SOLUTION
ests_100
test_1.2.3
()
Exploring the sampling distribution of the sample mean
for different populations
We will create the sampling distribution of the sample mean by taking 1500
random samples of size 5 from this population and visualize the distribution of
the sample means.
Question 1.3
{points: 1}
Draw 1500 random samples from our population of students (
students_pop
).
Each sample should have 5 observations. Name the data frame samples
and use
the seed 4321
.
# ... <- rep_sample_n(..., size = ..., reps = ...)
set.seed
(
4321
) # DO NOT CHANGE!
### BEGIN SOLUTION
samples <-
rep_sample_n
(
students_pop
, size =
5
, reps =
1500
)
### END SOLUTION
head
(
samples
)
tail
(
samples
)
dim
(
samples
)
test_1.3
()
Question 1.4
{points: 1}
Group by the sample replicate number, and then for each sample, calculate the
mean. Name the data frame sample_estimates
. The data frame should have the
column names replicate
and mean_grade
.
In [ ]:
In [ ]:
In [ ]:
In [ ]:
### BEGIN SOLUTION
sample_estimates <-
samples |>
group_by
(
replicate
) |>
summarise
(
mean_grade =
mean
(
grade
))
### END SOLUTION
head
(
sample_estimates
)
tail
(
sample_estimates
)
test_1.4
()
Question 1.5
{points: 1}
Visualize the distribution of the sample estimates (
sample_estimates
) you just
calculated by plotting a histogram using binwidth = 1
in the geom_histogram
argument. Name the plot sampling_distribution_5
and give the plot (using
ggtitle
) and the x axis a descriptive label.
options
(
repr.plot.width =
8
, repr.plot.height =
6
)
### BEGIN SOLUTION
sampling_distribution_5 <-
ggplot
(
sample_estimates
, aes
(
x =
mean_grade
)) +
geom_histogram
(
binwidth =
1
) +
xlab
(
"Sample means \n(mean grade)"
) +
ggtitle
(
"Sampling distribution of the sample means \n for n = 5"
) +
theme
(
text =
element_text
(
size =
20
))
### END SOLUTION
sampling_distribution_5
test_1.5
()
Question 1.6
{points: 3}
Describe in words the distribution above, comment on the shape, center and how
spread out the distribution is. Compare this sampling distribution to the
population distribution of students' grades above.
BEGIN SOLUTION
The distribution is bell-shaped, with one large peak in the middle centered at the
population mean. The sample means range from 60 to 80%, but most samples
had a mean between about 65 to 75%. The shape of the sampling distribution is
the same (bell-shaped, one peak, symmetric), but the spread is smaller than that
of the population distribution.
END SOLUTION
Question 1.6.1
{points: 3}
Repeat Q1.3 - 1.5
, but now for 100 observations:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
1. Draw 1500 random samples from our population of students
(
students_pop
). Each sample should have 100 observations. Use the seed
4321
.
2. Group by the sample replicate number, and then for each sample, calculate
the mean (call this column mean_grade_100
).
3. Visualize the distribution of the sample estimates you calculated by plotting a
histogram using binwidth = 0.5
in the geom_histogram
argument. Name
the plot sampling_distribution_100
and give the plot title (using
ggtitle
) and the x axis a descriptive label.
set.seed
(
4321
) # DO NOT CHANGE!
### BEGIN SOLUTION
sample_means <-
rep_sample_n
(
students_pop
, size =
100
, reps =
1500
) |>
group_by
(
replicate
) |>
summarise
(
mean_grade_100 =
mean
(
grade
))
sampling_distribution_100 <-
ggplot
(
sample_means
, aes
(
x =
mean_grade_100
)) +
geom_histogram
(
binwidth =
0.5
) +
xlab
(
"Sample means \n(mean grade)"
) +
ggtitle
(
"Sampling distribution of the sample means \n for n = 100"
) +
theme
(
text =
element_text
(
size =
20
))
### END SOLUTION
sampling_distribution_100
set.seed
(
4321
) # DO NOT CHANGE!
# We check that you've created objects with the right names below
# But all other tests were intentionally hidden so that you can practice decidin
# when you have the correct answer.
test_that
(
'Did not create objects named sampling_distribution_100'
, {
expect_true
(
exists
(
"sampling_distribution_100"
)) })
### BEGIN HIDDEN TESTS
properties <-
c
(
sampling_distribution_100
$
layers
[[
1
]]
$
mapping
, sampling_distribu
labels <-
sampling_distribution_100
$
labels
test_that
(
'mean_grade_100 should be on the x-axis.'
, {
expect_true
(
"mean_grade_100" ==
rlang
::
get_expr
(
properties
$
x
))
})
test_that
(
'sampling_distribution_100 should be a histogram.'
, {
expect_true
(
"GeomBar" %in%
class
(
sampling_distribution_100
$
layers
[[
1
]]
$
geom
)
})
test_that
(
'sampling_distribution data should be used to create the histogram'
, {
expect_equal
(
int_round
(
nrow
(
sampling_distribution_100
$
data
), 0
), 1500
)
expect_equal
(
digest
(
int_round
(
sum
(
sampling_distribution_100
$
data
), 2
)), '487
})
test_that
(
'Labels on the x axis should be descriptive. The plot should have a de
expect_false
((
labels
$
x
) ==
'mean_grade_100'
)
expect_false
(
is.null
(
labels
$
title
))
})
print
(
"Success!"
)
### END HIDDEN TESTS
Question 1.6.2
{points: 3}
In [ ]:
In [ ]:
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Suppose we do not know the parameter value for the population of data science
students (as is usually the case in real life).
Compare your point estimates for the
population mean from Q1.2.1 and 1.2.3
above. Which of the two point estimates
is more likely to be closer to the actual value of the average final grade of the
population of data science students? Briefly explain. (Hint: look at the sampling
distributions for your samples of size 5 and size 100 to help you answer this
question).
BEGIN SOLUTION
The point estimate for the sample of size 100. We can see from the sampling
distributions above the sampling distribution for samples of size 100 has less
variability/spread. So the larger the sample the estimate is based on, the more
likely it will be close to the parameter it estimates.
END SOLUTION
Question 1.7
{points: 1}
Let's create a simulated dataset of the number of cups of coffee drunk per week
for our population of students. Describe in words the distribution, comment on
the shape, center and how spread out the distribution is.
# run this cell to simulate a finite population
set.seed
(
2020
) # DO NOT REMOVE
coffee_data =
tibble
(
cups =
rexp
(
n =
2000
, rate =
0.34
))
coffee_dist <-
ggplot
(
coffee_data
, aes
(
cups
)) +
geom_histogram
(
binwidth =
0.5
) +
xlab
(
"Cups of coffee per week"
) +
ggtitle
(
"Population distribution"
) +
theme
(
text =
element_text
(
size =
20
))
coffee_dist
BEGIN SOLUTION
The distribution is not symmetric (specifically its right skewed), with one large
peak on the left side of the distribution. Most students drink a small (0, 1, or 2)
cups of coffee per week. Though there is a long tail on the right side with some
students drinking as many as 20 cups per week. The range of the distribution is
between 0 and 20.
END SOLUTION
Question 1.8
{points: 1}
In [ ]:
Draw 1500 random samples from coffee_data
. Each sample should have 5
observations. Assign this data frame to an object called coffee_samples_5
.
Group by the sample replicate number, and then for each sample, calculate the
mean. Name the data frame coffee_sample_estimates_5
. The data frame
should have the column names replicate
and coffee_mean_cups_5
.
Finally, create a plot of the sampling distribution called
coffee_sampling_distribution_5
.
Hint: a binwidth of 1 is a little too big for this data, try a binwidth of
0.5 instead.
set.seed
(
4321
) # DO NOT CHANGE!
### BEGIN SOLUTION
coffee_samples_5 <-
rep_sample_n
(
coffee_data
, size =
5
, reps =
1500
)
coffee_sample_estimates_5 <-
coffee_samples_5 |>
group_by
(
replicate
) |>
summarise
(
coffee_mean_cups_5 =
mean
(
cups
))
coffee_sampling_distribution_5 <-
ggplot
(
coffee_sample_estimates_5
, aes
(
x =
cof
geom_histogram
(
binwidth =
0.5
) +
xlab
(
"Sample means \n(mean cups of coffee per week)"
) +
ggtitle
(
"Sampling distribution of the \n mean cups of coffee per week"
) +
theme
(
text =
element_text
(
size =
20
)) +
xlim
(
c
(
0
, 8
))
### END SOLUTION
coffee_sampling_distribution_5
test_1.8
()
Question 1.9
{points: 3}
Describe in words the distribution above, comment on the shape, center and how
spread out the distribution is. Compare this sampling distribution to the
population distribution above.
BEGIN SOLUTION
The distribution is not symmetric. Its right skewed, but not as highly skewed as
the population, with one large peak around 2.5 cups. The range of the distribution
is between 0 and 10.
END SOLUTION
Question 2.0
{points: 1}
In [ ]:
In [ ]:
Draw 1500 random samples from coffee_data
. Each sample should have 30
observations. Assign this data frame to an object called coffee_samples_30
.
Group by the sample replicate number, and then for each sample, calculate the
mean. Name the data frame coffee_sample_estimates_30
. The data frame
should have the column names replicate
and coffee_mean_cups_30
.
Finally, create a plot of the sampling distribution called
coffee_sampling_distribution_30
.
Hint: use xlim
to control the x-axis limits so that they are similar to
those in the histogram above. This will make it easier to compare
this histogram with that one.
set.seed
(
4321
) # DO NOT CHANGE!
### BEGIN SOLUTION
coffee_samples_30 <-
rep_sample_n
(
coffee_data
, size =
30
, reps =
1500
)
coffee_sample_estimates_30 <-
coffee_samples_30 |>
group_by
(
replicate
) |>
summarise
(
coffee_mean_cups_30 =
mean
(
cups
))
coffee_sampling_distribution_30 <-
ggplot
(
coffee_sample_estimates_30
, aes
(
x =
c
geom_histogram
(
binwidth =
0.5
) +
xlab
(
"Sample means \n(mean cups of coffee per week)"
) +
ggtitle
(
"Sampling distribution of the \n mean cups of coffee per week"
) +
theme
(
text =
element_text
(
size =
20
)) +
xlim
(
c
(
0
, 8
))
### END SOLUTION
coffee_sampling_distribution_30
test_2.0
()
Question 2.1
{points: 3}
Describe in words the distribution above, comment on the shape, center and how
spread out the distribution is. Compare this sampling distribution with samples of
size 30 to the sampling distribution with samples of size 5.
BEGIN SOLUTION
The distribution is bell-shaped, with one large peak in the middle centered
around 2.5 cups per week. The spread is smaller than that of the sampling
distribution for size 5.
END SOLUTION
source
(
'cleanup.R'
)
In [ ]:
In [ ]:
In [ ]:
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Related Documents
Recommended textbooks for you
data:image/s3,"s3://crabby-images/b9e14/b9e141b888912793d57db61a53fa701d5defdb09" alt="Text book image"
Glencoe Algebra 1, Student Edition, 9780079039897...
Algebra
ISBN:9780079039897
Author:Carter
Publisher:McGraw Hill
data:image/s3,"s3://crabby-images/af711/af7111c99977ff8ffecac4d71f474692077dfd4c" alt="Text book image"
Big Ideas Math A Bridge To Success Algebra 1: Stu...
Algebra
ISBN:9781680331141
Author:HOUGHTON MIFFLIN HARCOURT
Publisher:Houghton Mifflin Harcourt
Recommended textbooks for you
- Glencoe Algebra 1, Student Edition, 9780079039897...AlgebraISBN:9780079039897Author:CarterPublisher:McGraw HillBig Ideas Math A Bridge To Success Algebra 1: Stu...AlgebraISBN:9781680331141Author:HOUGHTON MIFFLIN HARCOURTPublisher:Houghton Mifflin Harcourt
data:image/s3,"s3://crabby-images/b9e14/b9e141b888912793d57db61a53fa701d5defdb09" alt="Text book image"
Glencoe Algebra 1, Student Edition, 9780079039897...
Algebra
ISBN:9780079039897
Author:Carter
Publisher:McGraw Hill
data:image/s3,"s3://crabby-images/af711/af7111c99977ff8ffecac4d71f474692077dfd4c" alt="Text book image"
Big Ideas Math A Bridge To Success Algebra 1: Stu...
Algebra
ISBN:9781680331141
Author:HOUGHTON MIFFLIN HARCOURT
Publisher:Houghton Mifflin Harcourt