R-Lab-4

pdf

School

University of Toronto *

*We aren’t endorsed by this school

Course

247

Subject

Statistics

Date

Jan 9, 2024

Type

pdf

Pages

7

Uploaded by JudgeGoatPerson933

Report
R Lab #4 Kenji Tan December 6, 2022 Submission Instructions 1. At the end of this R Lab, submit BOTH your .rmd AND .pdf files 2. This R Lab will be marked based on completion out of 3 points: Were the exercises covered during your R Lab section completed in your file? Yes/Partially/No (1/0.5/0) Did the student answer all lab observation questions? Yes/Partially/No (1/0.5/0) Did the student remove Eval = F from completed code chunks so that output is properly displayed in the .pdf file? Yes/No (0.5/0) Were BOTH completed .rmd AND corresponding .pdf files submitted? Yes/No (0.5/0) In this R Lab, we will rely on simulation to: 1) Examine differences in population distribution (i.e. distribution of a random variable) against sampling distribution (distributions of sample statistics from samples drawn from a distribution) 2) Observe the effect of sample size on the convergence of sampling distribution of sample mean to the normal distribution 3) Use simulation to explore other sampling distributions where CLT cannot be applied Tools and Skills Covered in this R Lab Simulate sampling distributions in R by simulating a large number of random samples of size n from a population Use simulation to examine and compute probabilities from sampling distributions where CLT cannot be applied Loading Necessary Packages library(tidyverse) library(latex2exp) Recap: Central limit theorem (CLT) The central limit theorem states that for a large enough sample size, the sample mean from random samples of the population will have a distribution that is approximately normal. This is true even if the distribution of the population we are sampling from is not normal. From the central limit theorem, we can define the following properties: The expected value (mean) of the population we are sampling from ( µ ) will be the same as the expected value (long run average) of the sample mean ( µ X ). That is, E ( X ) = E ( X ) where X denotes the random variable with the population distribution. 1
The variance of the sample means will be the smaller than the variance of the population by a factor of n . That is, σ 2 X = σ 2 n . Be sure to differentiate σ 2 , which describes the variability between singular outcomes in a population (or the variance of X from outcome to outcome) while σ 2 X describes the variability between one sample mean from the next (i.e. how variable are sample means from each other.) The following example demonstrates how to apply the central limit theorem in R. Applying the Central Limit Theorem in R Suppose the time between queries coming in can be modeled by an exponential distribution with an average of 10 per hour. Then the exponential density function describes to us the possible times we can expect to wait between queries, along with a glimpse of which times are likelier than others. Let X denote the random time between queries, measured in hours. Let’s plot the distribution of X . We can visualize this distribution in two ways: 1) Graph the curve of the exponential probability density function: f ( x ) = 1 10 e x/ 10 , x > 0 (average of 10 customers per hour = average 1/10th of an hour per customer) 2) Simulate a large enough data set from an exponential distribution. Law of large numbers tells us that as we increase the number of simulations, the relative frequencies of outcomes will converge to their probabilities. This means a large enough simulation of data points from the population will show us the approximate distribution of individual wait times. #make this example reproducible set.seed( 1206 ) # To plot the exact PDF, you can create a table of values of (x, f(x)) and plot the points # using geom_line(). # To plot the histogram, let ' s use a simulation of B = 100,000 data points and plot the # density histogram that shows how the individual wait times are distributed # Note: all columns in a tibble must be the same size B = 100000 sim_data <- tibble( x = seq( 0 , 50 , length.out= B), fx = 0.1 *exp(-x/ 10 ), time = rexp(B, rate = 1 / 10 )) #rate = lambda. One query per 1/10 of an hr # Create density histogram to visualize distribution of the time between queries # and overlay with the probability density function. ggplot( data = sim_data)+ geom_histogram(aes( x = time, y = ..density..), # set the mapping parameters fill = "blue" , colour = "black" , bins= 70 )+ # set binwidth and color geom_line(aes( x = x, y = fx), size = 1.25 , colour= ' red ' )+ labs( title = ' Density Histogram and PDF of Exponentially Distributed Time ' , x = ' Time (hrs) ' , y= ' Density ' ) + theme_light() 2
0.000 0.025 0.050 0.075 0.100 0 40 80 120 Time (hrs) Density Density Histogram and PDF of Exponentially Distribute Revisiting Past Observations: Does the data that was generated appear to follow the exponential distribution that it was drawn from? Yes. Important to note here that the data we are sampling are not normally distributed . When we have a large sample of observations from the population (in this case, we generated individual observations of X 1 , X 2 , ..., X 100 , 000 ), plotting the data allows us to approximate the distribution of the individual wait times. We’ve used this strategy to plot simulated distributions all term: larger simulations = more data to paint a more accurate picture of the distribution and/or probabilities! Let’s apply this same concept to simulate sampling distributions: Simulated Distributions of X n for Small and Large n 1) To generate an approximate sampling distribution, we need to have a large enough set of observations of the sample statistic (more observations = more accurate the histogram will be to the sampling distribution density) 2) To do this we will first decide on the sample size to be used for the sample statistic, and the simulation size Sample size (n) : the number of observations that comprises a single sample from the population. The sample statistic T is computed from this random sample. Each T i = T ( X 1 , X 2 , ..., X n ) Simulation size (B) : the number of times we repeat this random sampling. This is the number of sample statistics we will create. We will end up with T 1 , T 2 , ..., T B observations of the sample statistic # Setting up initial objects n5.mean <- c() #Empty vector to store means from a sample of size 5 n50.mean <- c() #Empty vector to store means from a sample of size 50 B <- 10000 #Simulation size n1 = 5 #Size per sample n2 = 50 #Size of second sample # For each simulation from 1 to B, take a random sample of size 5 and 50 from the population # For each random sample, compute the sample statistic and store it # Generate a histogram of the sample statistics to visualize how the sample statistic # values are distributed (approximately) 3
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
for (i in 1 :B){ # Repeat the process B times sample1 <- rexp(n1, 1 / 10 ) # Sample 5 wait times n5.mean[i] <- mean(sample1) # Store the mean of these values sample2 <- rexp(n2, 1 / 10 ) # Sample 50 wait times n50.mean[i] <- mean(sample2) # Store the mean of these values } sim.means <- tibble(n5.mean, n50.mean) # Store everything in a tibble # create a histogram to visualize sampling distribution of sample means ggplot( data = sim.means)+ geom_histogram(aes( x= n5.mean, y= ..density..), fill= ' blue ' , color= ' black ' , binwidth= 0.51 , alpha= 0.5 )+ # transparency parameter geom_histogram(aes( x= n50.mean, y= ..density..), fill = ' red ' , color= ' black ' , binwidth= 0.51 , alpha= 0.5 )+ labs( title = TeX(r ' (Simulated Distribution of $ \b ar{X}_5$ and $ \b ar{X}_{50}$) ' ), subtitle = TeX(r ' (Blue = $ \b ar{X}_5$, Red = $ \b ar{X}_{50}$) ' ), x = TeX(r ' ($ \b ar{X}$, Average Time between Queries (hrs)) ' ), y= ' Density ' )+ theme_light() 0.0 0.1 0.2 0 10 20 30 40 X , Average Time between Queries (hrs) Density Blue = X 5 , Red = X 50 Simulated Distribution of X 5 and X 50 Observations 1) Do the distributions of the sample means at sample size n = 5 or n = 50 appear to resemble the normal distribution? Note that this is just a visual inspection, and we may not be able to concretely conclude a normal distribution. n50 seems to look like it resemble more of a normal distribution as the graph looks more symmetrical than the other. It has a clearer symmetry around the mean at 10. For n50, the distribution is clearly skewed to the right and thus does not represent a normal distribution at all. 2) How has the spread of the sampling distribution of X n changed as we increased the sample size n ? The spread has decraeased as n has increased. 4
3) Let’s compare with some numerics: (i) Probability that X 12 using the simulated data (ii) Probability that X 12 using a normal approximation (whether appropriate or not) In class, we showed that E ( X ) = µ X and V ( X ) = σ 2 X n , which we will use as the parameters of the normal distribution. That is, let’s see how these simulated probabilities compare with those using N ( 10 , 100 / 5 ) and N (10 , 100 / 50 ) #Estimate probabilities using the relative freq sample means of at least 15 among simulations est.p5 <- sum(sim.means$n5.mean >= 12 )/B # Proportion of times I get a value above 12 over total # of simulations est.p50 <- sum(sim.means$n50.mean >= 12 )/B #Probabilities if we approximated with a normal distribution norm.p5 <- pnorm( 12 , 10 , sd= sqrt( 100 / 5 ), lower.tail = F) norm.p50 <- pnorm( 12 , 10 , sd= sqrt( 100 / 50 ), lower.tail = F) Measure P ( X 5 12) P ( X 50 12) Sample Size 5 50 Simulated 0.2876 0.0819 Using Normal 0.3273604 0.0786496 4) Based on the estimated probabilities, would it be wise to use a normal distribution to calculate probabilities involving sample means for both sample sizes? Not for both sample sizes. It would be more appropriate for n50 as it more closely resembles a normal distribution. However, for n5 it would not be appropriate as it does not resemble that. What about other sample statistics? Central Limit Theorem provides us with an approximate sampling distribution for sample means only, and even then it is only when certain conditions are met. Consider the sampling distribution of the sample variance. Much like the sample mean: Every random sample of size n from a population or distribution is random Functions fo random variables are also random variables with their own distribution Sample variance is a random variable with a distribution that is unknown and not possible to derive Sample variance is computed from a random sample X 1 , X 2 , ..., X n as: S 2 n = QQQQQQQ n i =1 ( X i ¯ X ) 2 n 1 Let’s simulate the sampling distribution of this using samples of 50 wait times and using 10,000 simulations. With a large enough number of simulations, we should achieve a pretty good approximation of the actual sampling distribution of S 2 n : # Warning = F restricts the printing of warning messages # In the ggplot below, we restrict the plot to an interval on the x and y axis # which will force R to omit some columns of data from being plotted, which is what # generates this error message #Declare initial objects: B = 10000 # Simulation size n = 50 sim.var <- c() # Empty vector to store values 5
for (i in 1 :B){ sample <- rexp(n, 0.1 ) # Generate random waiting times sim.var[i] <- var(sample) # Store the variance in the empty vector } # Store everything in a tibble sim.data2 <- tibble( x= seq( 0 , 400 , length.out= B), # Values in x fx= 0.1 *exp(-x/ 10 ), sim.var = sim.var) # PDF evaluated at those values # Plot simulated distribution of variance, compared with probability density of wait times ggplot(sim.data2)+ geom_histogram(aes( x= sim.var, y= ..density..), fill= ' cadetblue3 ' , color = ' black ' , binwidth= 5 )+ geom_line(aes( x= x, y= fx), color= ' orange ' )+ # Our PDF geom_vline(aes( xintercept= 100 ), size= 1.2 , color= ' red ' )+ #locate population variance scale_y_continuous( limits= c( 0 , 0.05 ))+ scale_x_continuous( limits= c( 0 , 350 ))+ labs( x = ' Sample Variance Wait Times (hrs-squared) ' , y = ' Density ' , subtitle = ' Overlaid with Population Distribution: Exp(10) ' , title = ' Simulated Distribution of Sample Variance, n = 50 ' )+ theme_light() 0.00 0.01 0.02 0.03 0.04 0.05 0 100 200 300 Sample Variance Wait Times (hrs-squared) Density Overlaid with Population Distribution: Exp(10) Simulated Distribution of Sample Variance, n = 50 Observations: 1) Does the sampling distribution of variance resemble the distribution of the population? What are the common outcomes of X versus the common sample variance values? No, it does resemble the distribution of the population as the mean should be 10 but it is shown at around 100 in the sample variance. 2) Does the distribution of sample variances seem to resemble a normal distribution? It does not seem to resemble a normal distribution because it is skewed to the right. 3) Can CLT be applied to approximate distributions of sample variances? Why or why not? Yes, as long as there is a large enough sample. 6
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
4) Based on the simulation, does it appear that the sample variance is an accurate estimate for the population variance? i.e. Is the variance in a random sample of 50 wait times representative of the true variance in the population? No, because the sample variance does not have a large enough n for it to be accurate. Additional Exercises for Practice 1) The exponential distribution with θ = 10 has a variance of 100. Use the simulation above to estimate the probability that the sample variance among 50 random observations will be more than 150. Would you consider this probability to be high or low? 2) In Central Limit Theorem, we saw that as we increased the number of observations in each sample (i.e. the sample size), the distribution of sample means started to converge to a normal distribution. The distribution become narrower around the true population average, µ . Will wee see the same phenomenon with sample variance? (i) Pick a larger sample size, and repeat the lab activity to simulate the distribution of sample variances for the larger sample size. (ii) Compare it to the one from lab with n = 50 . (iii) How has the shape of the distribution changed? How has the spread? Are sample variances more or less accurate to the population variance of σ 2 = 100 ? 3) Suppose the response time R is uniformly distributed on the interval (0, b). If b is unknown, a natural way to try to estimate this is by collecting data of random wait times, and using the longest wait time in the data as an estimate of b. Let’s simulate this to see how well the maximum value of a data set will behave! Here, the sample statistic is the maximum value of a random sample. That is, T = max ( R 1 , R 2 , ..., R n ) . To be able to produce data to see how the maximum data point, let’s suppose b = 8 . Using a sample size of 20 and a simulation size of 10,000, produce a simulated distribution of T , the random maximum value of a random sample. #Initial Set-Up # Simulation # Plot Simulation (a) How does the shape of the distribution of maximum values compare with the shape or the response times? (b) Based on the simulated distribution, do you think the maximum of a data set is good at estimating the maximum wait time of 8 here? (c) Does T tend to vary far from the target value of 8? (d) Based on your simulated data, does it look like T will on average estimate b correctly based on a sample of size 20? 7