R-Lab-4
pdf
keyboard_arrow_up
School
University of Toronto *
*We aren’t endorsed by this school
Course
247
Subject
Statistics
Date
Jan 9, 2024
Type
Pages
7
Uploaded by JudgeGoatPerson933
R Lab #4
Kenji Tan
December 6, 2022
Submission Instructions
1. At the end of this R Lab, submit BOTH your .rmd AND .pdf files
2. This R Lab will be marked based on completion out of 3 points:
•
Were the exercises covered during your R Lab section completed in your file? Yes/Partially/No (1/0.5/0)
•
Did the student answer all lab observation questions?
Yes/Partially/No (1/0.5/0)
•
Did the student remove
Eval = F
from completed code chunks so that output is properly displayed in
the .pdf file? Yes/No (0.5/0)
•
Were BOTH completed .rmd AND corresponding .pdf files submitted? Yes/No (0.5/0)
In this R Lab, we will rely on simulation to:
1)
Examine differences in population distribution (i.e. distribution of a random variable) against sampling
distribution (distributions of sample statistics from samples drawn from a distribution)
2)
Observe the effect of sample size on the convergence of sampling distribution of sample mean to the
normal distribution
3) Use simulation to explore other sampling distributions where CLT cannot be applied
Tools and Skills Covered in this R Lab
•
Simulate sampling distributions in R by simulating a large number of random samples of size
n
from a
population
•
Use simulation to examine and compute probabilities from sampling distributions where CLT cannot
be applied
Loading Necessary Packages
library(tidyverse)
library(latex2exp)
Recap: Central limit theorem (CLT)
The central limit theorem states that for a large enough sample size, the sample mean from random samples
of the population will have a distribution that is approximately normal. This is true even if the distribution
of the population we are sampling from is not normal. From the central limit theorem, we can define the
following properties:
•
The expected value (mean) of the population we are sampling from (
µ
) will be the same as the expected
value (long run average) of the sample mean (
µ
X
). That is,
E
(
X
) =
E
(
X
)
where
X
denotes the random
variable with the population distribution.
1
•
The variance of the sample means will be the smaller than the variance of the population by a factor
of
n
. That is,
σ
2
X
=
σ
2
n
. Be sure to differentiate
σ
2
, which describes the variability between singular
outcomes in a population (or the variance of
X
from outcome to outcome) while
σ
2
X
describes the
variability between one sample mean from the next (i.e. how variable are sample means from each
other.)
The following example demonstrates how to apply the central limit theorem in R.
Applying the Central Limit Theorem in R
Suppose the time between queries coming in can be modeled by an exponential distribution with an average
of 10 per hour. Then the exponential density function describes to us the possible times we can expect to
wait between queries, along with a glimpse of which times are likelier than others. Let
X
denote the random
time between queries, measured in hours. Let’s plot the distribution of
X
. We can visualize this distribution
in two ways:
1)
Graph the curve of the exponential probability density function:
f
(
x
) =
1
10
e
−
x/
10
, x >
0
(average of 10
customers per hour = average 1/10th of an hour per customer)
2)
Simulate a large enough data set from an exponential distribution. Law of large numbers tells us that
as we increase the number of simulations, the relative frequencies of outcomes will converge to their
probabilities. This means a large enough simulation of data points from the population will show us
the
approximate
distribution of individual wait times.
#make this example reproducible
set.seed(
1206
)
# To plot the exact PDF, you can create a table of values of (x, f(x)) and plot the points
# using geom_line().
# To plot the histogram, let
'
s use a simulation of B = 100,000 data points and plot the
# density histogram that shows how the individual wait times are distributed
# Note: all columns in a tibble must be the same size
B
=
100000
sim_data
<-
tibble(
x =
seq(
0
,
50
,
length.out=
B),
fx =
0.1
*exp(-x/
10
),
time =
rexp(B,
rate =
1
/
10
))
#rate = lambda. One query per 1/10 of an hr
# Create density histogram to visualize distribution of the time between queries
# and overlay with the probability density function.
ggplot(
data =
sim_data)+
geom_histogram(aes(
x =
time,
y =
..density..),
# set the mapping parameters
fill =
"blue"
,
colour =
"black"
,
bins=
70
)+
# set binwidth and color
geom_line(aes(
x =
x,
y =
fx),
size =
1.25
,
colour=
'
red
'
)+
labs(
title =
'
Density Histogram and PDF of Exponentially Distributed Time
'
,
x =
'
Time (hrs)
'
,
y=
'
Density
'
) +
theme_light()
2
0.000
0.025
0.050
0.075
0.100
0
40
80
120
Time (hrs)
Density
Density Histogram and PDF of Exponentially Distribute
Revisiting Past Observations:
Does the data that was generated appear to follow the exponential distribution that it was drawn from?
Yes. Important to note here that the data we are sampling are
not normally distributed
. When we have
a large sample of observations from the population (in this case, we generated individual observations of
X
1
, X
2
, ..., X
100
,
000
), plotting the data allows us to approximate the distribution of the individual wait times.
We’ve used this strategy to plot simulated distributions all term: larger simulations = more data to paint a
more accurate picture of the distribution and/or probabilities!
Let’s apply this same concept to simulate sampling distributions:
Simulated Distributions of
X
n
for Small and Large
n
1)
To generate an approximate sampling distribution, we need to have a large enough set of observations
of the sample statistic (more observations = more accurate the histogram will be to the sampling
distribution density)
2)
To do this we will first decide on the sample size to be used for the sample statistic, and the simulation
size
•
Sample size (n)
: the number of observations that comprises a single sample from the population.
The sample statistic
T
is computed from this random sample. Each
T
i
=
T
(
X
1
, X
2
, ..., X
n
)
•
Simulation size (B)
: the number of times we repeat this random sampling. This is the number of
sample statistics we will create. We will end up with
T
1
, T
2
, ..., T
B
observations of the sample statistic
# Setting up initial objects
n5.mean
<-
c()
#Empty vector to store means from a sample of size 5
n50.mean
<-
c()
#Empty vector to store means from a sample of size 50
B
<-
10000
#Simulation size
n1
=
5
#Size per sample
n2
=
50
#Size of second sample
# For each simulation from 1 to B, take a random sample of size 5 and 50 from the population
# For each random sample, compute the sample statistic and store it
# Generate a histogram of the sample statistics to visualize how the sample statistic
# values are distributed (approximately)
3
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
for
(i
in
1
:B){
# Repeat the process B times
sample1
<-
rexp(n1,
1
/
10
)
# Sample 5 wait times
n5.mean[i]
<-
mean(sample1)
# Store the mean of these values
sample2
<-
rexp(n2,
1
/
10
)
# Sample 50 wait times
n50.mean[i]
<-
mean(sample2)
# Store the mean of these values
}
sim.means
<-
tibble(n5.mean, n50.mean)
# Store everything in a tibble
# create a histogram to visualize sampling distribution of sample means
ggplot(
data =
sim.means)+
geom_histogram(aes(
x=
n5.mean,
y=
..density..),
fill=
'
blue
'
,
color=
'
black
'
,
binwidth=
0.51
,
alpha=
0.5
)+
# transparency parameter
geom_histogram(aes(
x=
n50.mean,
y=
..density..),
fill =
'
red
'
,
color=
'
black
'
,
binwidth=
0.51
,
alpha=
0.5
)+
labs(
title =
TeX(r
'
(Simulated Distribution of $
\b
ar{X}_5$ and $
\b
ar{X}_{50}$)
'
),
subtitle =
TeX(r
'
(Blue = $
\b
ar{X}_5$, Red = $
\b
ar{X}_{50}$)
'
),
x =
TeX(r
'
($
\b
ar{X}$, Average Time between Queries (hrs))
'
),
y=
'
Density
'
)+
theme_light()
0.0
0.1
0.2
0
10
20
30
40
X
, Average Time between Queries (hrs)
Density
Blue = X
5
, Red = X
50
Simulated Distribution of X
5
and X
50
Observations
1)
Do the distributions of the sample means at sample size
n
= 5
or
n
= 50
appear to resemble the normal
distribution? Note that this is just a visual inspection, and we may not be able to concretely conclude
a normal distribution.
n50 seems to look like it resemble more of a normal distribution as the graph looks more symmetrical than
the other. It has a clearer symmetry around the mean at 10. For n50, the distribution is clearly skewed to
the right and thus does not represent a normal distribution at all.
2) How has the spread of the sampling distribution of
X
n
changed as we increased the sample size
n
?
The spread has decraeased as n has increased.
4
3) Let’s compare with some numerics:
(i) Probability that
X
≥
12
using the simulated data
(ii) Probability that
X
≥
12
using a normal approximation (whether appropriate or not)
In class, we showed that
E
(
X
) =
µ
X
and
V
(
X
) =
σ
2
X
n
, which we will use as the parameters of the normal
distribution. That is, let’s see how these simulated probabilities compare with those using
N
(
10
,
100
/
5
)
and
N
(10
,
100
/
50
)
#Estimate probabilities using the relative freq sample means of at least 15 among simulations
est.p5
<-
sum(sim.means$n5.mean >=
12
)/B
# Proportion of times I get a value above 12 over total # of simulations
est.p50
<-
sum(sim.means$n50.mean >=
12
)/B
#Probabilities if we approximated with a normal distribution
norm.p5
<-
pnorm(
12
,
10
,
sd=
sqrt(
100
/
5
),
lower.tail =
F)
norm.p50
<-
pnorm(
12
,
10
,
sd=
sqrt(
100
/
50
),
lower.tail =
F)
Measure
P
(
X
5
≥
12)
P
(
X
50
≥
12)
Sample Size
5
50
Simulated
0.2876
0.0819
Using Normal
0.3273604
0.0786496
4)
Based on the estimated probabilities, would it be wise to use a normal distribution to calculate
probabilities involving sample means for both sample sizes?
Not for both sample sizes. It would be more appropriate for n50 as it more closely resembles a normal
distribution. However, for n5 it would not be appropriate as it does not resemble that.
What about other sample statistics?
Central Limit Theorem provides us with an approximate sampling distribution for
sample means
only,
and even then it is only when certain conditions are met. Consider the sampling distribution of the sample
variance. Much like the sample mean:
•
Every random sample of size
n
from a population or distribution is random
•
Functions fo random variables are also random variables with their own distribution
•
Sample variance is a random variable with a distribution that is unknown and not possible to derive
Sample variance is computed from a random sample
X
1
, X
2
, ..., X
n
as:
S
2
n
=
QQQQQQQ
n
i
=1
(
X
i
−
¯
X
)
2
n
−
1
Let’s simulate the sampling distribution of this using samples of 50 wait times and using 10,000 simulations.
With a large enough number of simulations, we should achieve a pretty good approximation of the actual
sampling distribution of
S
2
n
:
# Warning = F restricts the printing of warning messages
# In the ggplot below, we restrict the plot to an interval on the x and y axis
# which will force R to omit some columns of data from being plotted, which is what
# generates this error message
#Declare initial objects:
B
=
10000
# Simulation size
n
=
50
sim.var
<-
c()
# Empty vector to store values
5
for
(i
in
1
:B){
sample
<-
rexp(n,
0.1
)
# Generate random waiting times
sim.var[i]
<-
var(sample)
# Store the variance in the empty vector
}
# Store everything in a tibble
sim.data2
<-
tibble(
x=
seq(
0
,
400
,
length.out=
B),
# Values in x
fx=
0.1
*exp(-x/
10
),
sim.var =
sim.var)
# PDF evaluated at those values
# Plot simulated distribution of variance, compared with probability density of wait times
ggplot(sim.data2)+
geom_histogram(aes(
x=
sim.var,
y=
..density..),
fill=
'
cadetblue3
'
,
color =
'
black
'
,
binwidth=
5
)+
geom_line(aes(
x=
x,
y=
fx),
color=
'
orange
'
)+
# Our PDF
geom_vline(aes(
xintercept=
100
),
size=
1.2
,
color=
'
red
'
)+
#locate population variance
scale_y_continuous(
limits=
c(
0
,
0.05
))+
scale_x_continuous(
limits=
c(
0
,
350
))+
labs(
x =
'
Sample Variance Wait Times (hrs-squared)
'
,
y =
'
Density
'
,
subtitle =
'
Overlaid with Population Distribution: Exp(10)
'
,
title =
'
Simulated Distribution of Sample Variance, n = 50
'
)+
theme_light()
0.00
0.01
0.02
0.03
0.04
0.05
0
100
200
300
Sample Variance Wait Times (hrs-squared)
Density
Overlaid with Population Distribution: Exp(10)
Simulated Distribution of Sample Variance, n = 50
Observations:
1)
Does the sampling distribution of variance resemble the distribution of the population? What are the
common outcomes of
X
versus the common sample variance values?
No, it does resemble the distribution of the population as the mean should be 10 but it is shown at around
100 in the sample variance.
2) Does the distribution of sample variances seem to resemble a normal distribution?
It does not seem to resemble a normal distribution because it is skewed to the right.
3) Can CLT be applied to approximate distributions of sample variances? Why or why not?
Yes, as long as there is a large enough sample.
6
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
4)
Based on the simulation, does it appear that the sample variance is an accurate estimate for the
population variance? i.e. Is the variance in a random sample of 50 wait times representative of the true
variance in the population?
No, because the sample variance does not have a large enough n for it to be accurate.
Additional Exercises for Practice
1)
The exponential distribution with
θ
= 10
has a variance of 100. Use the simulation above to estimate
the probability that the sample variance among 50 random observations will be more than 150. Would
you consider this probability to be high or low?
2)
In Central Limit Theorem, we saw that as we increased the number of observations in each sample
(i.e. the sample size), the distribution of sample means started to converge to a normal distribution.
The distribution become narrower around the true population average,
µ
.
Will wee see the same
phenomenon with sample variance?
(i)
Pick a larger sample size, and repeat the lab activity to simulate the distribution of sample variances
for the larger sample size.
(ii) Compare it to the one from lab with
n
= 50
.
(iii)
How has the shape of the distribution changed? How has the spread? Are sample variances more or
less accurate to the population variance of
σ
2
= 100
?
3)
Suppose the response time
R
is uniformly distributed on the interval (0, b). If b is unknown, a natural
way to try to estimate this is by collecting data of random wait times, and using the longest wait time
in the data as an estimate of b. Let’s simulate this to see how well the maximum value of a data set
will behave!
Here, the sample statistic is the maximum value of a random sample. That is,
T
=
max
(
R
1
, R
2
, ..., R
n
)
. To
be able to produce data to see how the maximum data point, let’s suppose
b
= 8
. Using a sample size of
20 and a simulation size of 10,000, produce a simulated distribution of
T
, the random maximum value of a
random sample.
#Initial Set-Up
# Simulation
# Plot Simulation
(a)
How does the shape of the distribution of maximum values compare with the shape or the response
times?
(b)
Based on the simulated distribution, do you think the maximum of a data set is good at estimating the
maximum wait time of 8 here?
(c) Does
T
tend to vary far from the target value of 8?
(d)
Based on your simulated data, does it look like
T
will on average estimate
b
correctly based on a sample
of size 20?
7
Recommended textbooks for you
Algebra & Trigonometry with Analytic Geometry
Algebra
ISBN:9781133382119
Author:Swokowski
Publisher:Cengage

Intermediate Algebra
Algebra
ISBN:9781285195728
Author:Jerome E. Kaufmann, Karen L. Schwitters
Publisher:Cengage Learning

College Algebra (MindTap Course List)
Algebra
ISBN:9781305652231
Author:R. David Gustafson, Jeff Hughes
Publisher:Cengage Learning

Algebra: Structure And Method, Book 1
Algebra
ISBN:9780395977224
Author:Richard G. Brown, Mary P. Dolciani, Robert H. Sorgenfrey, William L. Cole
Publisher:McDougal Littell

Mathematics For Machine Technology
Advanced Math
ISBN:9781337798310
Author:Peterson, John.
Publisher:Cengage Learning,
Recommended textbooks for you
- Algebra & Trigonometry with Analytic GeometryAlgebraISBN:9781133382119Author:SwokowskiPublisher:CengageIntermediate AlgebraAlgebraISBN:9781285195728Author:Jerome E. Kaufmann, Karen L. SchwittersPublisher:Cengage LearningCollege Algebra (MindTap Course List)AlgebraISBN:9781305652231Author:R. David Gustafson, Jeff HughesPublisher:Cengage Learning
- Algebra: Structure And Method, Book 1AlgebraISBN:9780395977224Author:Richard G. Brown, Mary P. Dolciani, Robert H. Sorgenfrey, William L. ColePublisher:McDougal LittellMathematics For Machine TechnologyAdvanced MathISBN:9781337798310Author:Peterson, John.Publisher:Cengage Learning,
Algebra & Trigonometry with Analytic Geometry
Algebra
ISBN:9781133382119
Author:Swokowski
Publisher:Cengage

Intermediate Algebra
Algebra
ISBN:9781285195728
Author:Jerome E. Kaufmann, Karen L. Schwitters
Publisher:Cengage Learning

College Algebra (MindTap Course List)
Algebra
ISBN:9781305652231
Author:R. David Gustafson, Jeff Hughes
Publisher:Cengage Learning

Algebra: Structure And Method, Book 1
Algebra
ISBN:9780395977224
Author:Richard G. Brown, Mary P. Dolciani, Robert H. Sorgenfrey, William L. Cole
Publisher:McDougal Littell

Mathematics For Machine Technology
Advanced Math
ISBN:9781337798310
Author:Peterson, John.
Publisher:Cengage Learning,