R-Lab-2--PM-

pdf

School

University of Toronto *

*We aren’t endorsed by this school

Course

247

Subject

Statistics

Date

Jan 9, 2024

Type

pdf

Pages

7

Uploaded by JudgeGoatPerson933

Report
R lab #2 (PM) Kenji Tan 2022-10-25 Submission Instructions 1. At the end of this R Lab, submit BOTH your .rmd AND .pdf files 2. This R Lab will be marked based on completion out of 3 points: Were the exercises covered during your R Lab section completed in your file? Yes/Partially/No (1/0.5/0) Did the student include comments of their own that wasn’t already provided in the skeleton to refer back on syntax usage? Yes/No (1/0) Did the student remove Eval = F from completed code chunks so that output is properly displayed in the .pdf file? Yes/No (0.5/0) Were BOTH completed .rmd AND corresponding .pdf files submitted? Yes/No (0.5/0) In this R Lab, we will use simulation to examine some common misconceptions of probability distributions. We will continue to use set.seed() to ensure results are reproducible. We will begin with studying simulations involving discrete distributions such as the Poisson and Binomial distribution, and left for take-home practices are some problems involving simple continuous distributions. This R Lab will consists of two parts: 1) Examining the effect of summing two Poisson distributed random variables 2) Examining the effect of taking the ratio of two Poisson distributed random variables 3) For take-home practice: Compare using simulation how well a Pois ( λ = np ) distribution can approximate a Bin ( n, p ) distribution when n is large and p is small. 4) For take-home practice: Use simulation to examine the distribution of summing two Binomial distributed random variables when trial sizes are different 5) For take-home practice: Use simulation to examine the distribution of summing two Binomial distributed random variables when success probabilities are different Tools and Skills Covered in this R Lab Differentiate between plotting simulated data and plotting the exact probability mass function Adding columns to a tibble easily Build on our ‘ggplot2‘ skills with the addition of ‘geom_line()‘ and ‘geom_point()‘ layers New package alert! Begin the tutorial by installing ‘latex2exp‘ in your console (never include package installations in your R Markdown file – your file will fail to knit!) Use ‘latex2exp‘ to build in mathematical labels into your plots for clearer communication Install and Loading Necessary Packages Step 1: Install the latex2exp package in your console by running install.packages( ' latex2exp ' ) . 1
# message = F omits any package loading outputs from being printed in pdf library(tidyverse) library(latex2exp) Examine: Sum of Two Poisson RVs Recall the mechanic problem from last class: The number of repairs, R, that is fulfilled by a local mechanic each day follows a Poisson distribution with a mean of 3 t , where t denotes the number of hours the mechanic is open on any particular day. Let’s expand on this problem by considering the following situation: On Mondays, the mechanic is open for 10 hours while on Tuesdays, the mechanic is open for 8 hours. On average, how many repairs can the mechanic expect to fulfill on Monday and Tuesday combined? 1) What properties can you use to find the average number of repairs the mechanic can expect to fulfill on Mondays and Tuesdays each week? 2) Use this property to find the average over the two days. What distribution do you think the average quantity will follow? Let R 1 represent the random number of repairs that arrive on Mondays and R 2 the random number of repairs that arrive on Tuesdays. We are given that each R i Pois (3 t ) with R 1 Pois (30) and R 2 Pois (24) . We are interested in how R 1 + R 2 behaves. To find the average of this sum, we can use properties of expected values: E ( R 1 ) = λ 1 = 30 E ( R 2 ) = λ 2 = 24 E ( R 1 + R 2 ) = E ( R 1 ) + E ( R 2 ) E ( R 1 + R 2 ) = 30 + 24 = 54 It’s possible that the total number of repairs over Monday and Tuesday, T = R 1 + R 2 to be modeled with T Pois ( λ 1 + λ 2 = 54) (remember that the parameter λ for the Poisson distribution is the same as the mean of the RV!). We know from one of the assumptions of Poisson distribution is that each period behaves independently and can be scaled up or down for the interval of interest. We could also argue that the number of deliveries over two days can just be modeled with T Pois (54) by the same principle. Let’s verify this! Sums of Two Poisson RVs via Simulation Steps: 1. Simulate random daily repairs for Monday and for Tuesday. Choose a large number of simulations. 2. Compute the sums of the randomly generated daily repairs. This will represent the random sums that could occur over the two-day interval. 3. Plot a relative frequency histogram of the sum of the two repairs (this is a density histogram with bin width of 1. Density x Bin Width = Relative Frequency (or Probability Mass)). 4. Add to your tibble a “table of values” of possible outcomes of T Pois (54) and its corresponding probability mass. You can then use this to add a graph of the exact probability mass function over top the density histogram in (3). 5. Compare the graph of the simulated distribution of sums with the hypothesized distribution (Poisson( λ = 54 )) and compare visually whether the hypothesized distribution seems like the correct model to represent the distribution of sums of two Poisson random variables that have different λ . sim.size = 10000 #higher simulation = more accurate snapshot of distribution lambda1 = 30 #Parameters defined above lambda2 = 24 set.seed( 1025 ) 2
weekday <- tibble( r1 = rpois(sim.size,lambda1), #Take a sample of size sim.size #From a pois(lambda1) r2 = rpois(sim.size,lambda2), #Sample from a pois(24) sum = r1+r2 ) Plotting the simulations: Remember to set your figure alignment and figure dimensions in your R chunk! Reminder that fig.dim = c(width, height) hist <- ggplot(weekday)+ geom_histogram(aes( x= sum, y= ..density..), binwidth = 1 , fill = ' thistle2 ' , colour= ' black ' )+ scale_x_continuous( breaks = seq( 0 , 180 , by = 20 ))+ labs( x= ' Total Repairs on Mon-Tues ' , y= ' Relative Frequency ' , title = TeX(r ' (Simulated Histogram of $R_1 + R_2, \, R_1 \sim Pois(30), \, R_2 \sim Pois(24)$) ' ))+ #How do we use latex2exp in the title to display clearer labels? theme_light() # run ' hist ' in console to view the plot From the first assignment, one way to get the sum of your outcomes was to recreate your tibble to include the new data. You can also use the tibble naming system to introduce new columns of data: weekday$t <- rpois(sim.size, lambda1+lambda2) #Add new column with random values from a #Poisson with parameter lambda1 + lambda2 weekday$pmf <- dpois(weekday$t, lambda1+lambda2) #dpois computes P(T=t) for the values provided in weekday$t) hist+ geom_line( data = weekday, aes( x = t, y = pmf), size= 1.25 , colour= ' blue ' )+ labs( subtitle = TeX(r ' (Overlay with smoothed PMF of $Pois(\lambda = 54)$) ' )) 0.00 0.02 0.04 40 60 80 Total Repairs on Mon-Tues Relative Frequency Overlay with smoothed PMF of Pois ( λ = 54 29 Simulated Histogram of R 1 + R 2 , R 1 ~ Pois ( 30 29 , R 2 ~ Pois ( 24 29 3
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Examine: Ratio of Two Poisson RVs It looks like on Mondays, due to the mechanic being open for longer hours, they are likelier to see a greater number of repairs than on Tuesdays. Let’s investigate the distribution of M = R 1 R 2 , or the proportion of repairs on Mondays as compared to Tuesdays. If M = 1 . 5 then this means Monday had 50% more repairs than Tuesday, while M = 0 . 8 means Monday had 20% fewer repairs than Tuesday. 1) Can the same properties be used to find the average ratio of repairs (Monday:Tuesday)? 2) If so, use this to find the average ratio of repairs of Monday compared to Tuesday. What distribution do you think the ratio of the two days might follow? Ratio of Two Poisson RVs via Simulation Steps 1. Using the data in the tibble from the first exercise, compute the corresponding ratio of Monday repairs to Tuesday repairs. This will represent the random ratios representing differences in repairs that could occur over the two-day interval. 2. Plot a density histogram of the ratio of the two days (recall that in a density histogram: Density x Bin Width = Relative Frequency (or Probability Mass) of Bin). Since we are mapping ratios, we will use a shorter binwidth, with bins measuring 0.25 units wide. weekday$ratio <- weekday$r1/weekday$r2 # Calculating ratio of repairs # weekday should have 6 columns now hist2 <- ggplot(weekday)+ geom_histogram(aes( x = ratio, y= ..density..), binwidth= 0.25 , #comment on choice of binwidth, why don ' t we use 1 here? (too wide for ratios) fill = ' thistle2 ' , colour= ' black ' )+ scale_x_continuous( breaks = seq( 0 , 5 , by= 0.5 ))+ labs( x= ' Ratio of Repairs (Mon:Tues) ' , y= ' Density ' , title = TeX(r ' (Simulated Histogram of $ \f rac{R_1}{R_2}$) ' ), subtitle = TeX(r ' ($R_1 \sim Pois(\lambda = 30), \, R_2 \sim Pois(\lambda = 24)$) ' ))+ #What is the latex2exp structure to include titles that have embedded latex? theme_light() hist2 0.0 0.3 0.6 0.9 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 Ratio of Repairs (Mon:Tues) Density R 1 ~ Pois ( λ = 30 29 , R 2 ~ Pois ( λ = 24 29 Simulated Histogram of R 1 R 2 4
Observations: 1. The support of R 1 + R 2 is all integers 0 while it appears that the support of R 1 R 2 is on the interval (0.5, 5). It seems that most of the time, R 1 R 2 . Looking at the distribution, we can see that typically, R 1 is typically 1 to 1.5 times that of R 2 , meaning that typically, Mondays will have at least 50% more repairs than are received on the corresponding Tuesday. Rarely will R 1 < R 2 , and in those instances, only 25% lower. 2. TRUE or FALSE: Sums of Poisson random variables, even with different average rates appears to remain Poisson distributed. That is, if X Pois ( λ 1 ) and Y Pois ( λ 2 ) , then X + Y Pois ( λ 1 + λ 2 ) . TRUE 3. TRUE or FALSE: Show using your simulation whether E R 1 R 2 = E ( R 1 ) E ( R 2 ) . E ( R 1 ) E ( R 2 ) = 30 24 = 1 . 25 E R 1 R 2 sum ( ratios ) simsize = 1.3065508 FALSE 4. TRUE or FALSE: Ratios of Poisson random variables, even with different average rates appears to remain Poisson distributed. That is, if X Pois ( λ 1 ) and Y Pois ( λ 2 ) , then X Y Pois λ 1 λ 2 . As an exercise, if you think that R 1 R 2 should follow a Poisson distribution, 1) Estimate the λ or mean parameter of R 1 R 2 2) Overlay the graph of the Pois ( λ ) PMF over your histogram of R 1 R 2 , similar to the method from the first exercise for sums. To do this, add a new table of values to your tibble by: 1. generating 10,000 Poisson observations using your estimated λ and then 2. calculating the corresponding PMF values for those observations. You can then plot the table of values over your histogram. Be extra mindful to adjust the histogram of R 1 R 2 to have a binwidth of 1 so everything is measured as an estimate of probability mass. lambda.est <- weekday$m <- weekday$mpmf <- dpois(weekday$m, lambda.est) ggplot(...)+ geom_histogram(aes(...), binwidth = ..., fill = ' turquoise1 ' , colour= ' black ' )+ geom_point(aes(...), size= 1.25 , colour= ' red ' )+ scale_x_continuous( breaks = seq( 0 , 8 , 0.5 ))+ labs( x= ' Ratio of Repairs (Mon:Tues) ' , y= ' Density ' , title = TeX(r ' (Simulated Histogram of $ \f rac{R_1}{R_2}$) ' ), subtitle = TeX((paste0( ' Compared with ' , r ' ($Pois(\lambda = $) ' , round(lambda.est, 2 ), ' ) ' ))))+ theme_light() Additional Exercises for Practice 1. Bin ( n, p ) versus Pois ( λ = np ) Last week, we derived the Poisson probability mass function which models the discrete number of occurrences that occur over a continuous interval. We did this by arbitrarily dividing the continuous interval into ‘n’ sub intervals while scaling the average rate of occurrence λ to the size of each sub interval. When the sub intervals become small enough, we can recreate the scenario in which a Binomial distribution can apply: n ‘trials’ each with constant probability of success p = λ n , and observing the PMF of this binomial distribution as n → ∞ . This suggests to us that when n is very large and p is very small, the binomial probabilities can be instead approximated by a Poisson probability mass function where we use λ = np . This is especially helpful since 5
Poisson PMF is easier to compute because it doesn’t need to find nCx which can be significantly more memory-intensive for large n . For now, let’s use the available tools in our skill set to demonstrate this: Steps: 1. Pick an appropriate number for the simulation size (larger = better accuracy), a large number for n and a small value for p . 2. Simulate many data points from Bin ( n, p ) and store it in a column called ‘binomial’. Repeat the simulation with Pois ( λ = np ) and store it in a column called ‘poisson’. The goal here is to see whether the occurrence of outcomes is similar enough between these two distributions. 3. Plot the histograms, making use of adjusting opacity and compare the two. Do they appear similar enough that we can reasonably use one PMF in place of the other? sim.size = n = p = sim.data <- tibble( binomial = , poisson = ) ggplot(sim.data)+ geom_histogram(...)+ geom_histogram(...)+ labs( x= ' Number of Occurrences ' , y= ' Relative Frequency ' , subtitle = TeX((paste0( "Turquoise = Bin(" , n, ", " ,p, "), Red = Pois(" , r ' ($\lambda = $) ' ,n*p, ")" ))), title = ' Comparing Poisson to Binomial with Large n, Small p ' )+ #latex2exp is also compatible with paste0, which allows you to embed r variables #in your plot titles as well. The above subtitle will update the plot labels #for every change in n and p you make above. theme_light() Verify with some calculations: Let X Bin ( n, p ) and Y Pois ( n p ) . Find the following probabilities and compute the difference in probabilities between the two distributions. Based on your observations, does it seem reasonable to use the Poisson model to compute Binomial probabilities when n is large and p is small? a) P ( X 35) versus P ( Y 35) b) P (55 X 70) versus P (55 Y 70) 2. Sums of Binomial Random Variables with Different p We saw in the first example in the R Lab that the sum of two Poisson distributed quantities, even if they have different rate parameters will remain Poisson distributed. That is, if X Pois ( λ 1 ) and Y Pois ( λ 2 ) , then X + Y Pois ( λ 1 + λ 2 ) . Will we see a similar property when we sum two binomially distributed quantities? Steps: 1. Pick two different trial sizes: n and m 2. Pick two different success probabilities: p 1 and p 2 . 3. Using simulation and histograms of your simulated data, find a conclusion to the following cases: (i) Summing two binomial quantities that have the same success probability but different trial sizes. Do you think this sum will still be binomial distributed? Why or why not? If you think it will still have a bimomial distribution, what are the parameters of this distribution? Overlay the simulated distribution of sums with the PMF of your hypothesized distribution. (ii) Summing two binomial quantities that have the same trial size but different success probabilities. Do you think this sum will still be binomial distributed? Why or why not? If you think it will still have a 6
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
bimomial distribution, what are the parameters of this distribution? Overlay the simulated distribution of sums with the PMF of your hypothesized distribution. What conclusions can you draw about taking the sum of two binomial RVs: a) When will it result in another binomial quantity? b) When will the sum fail to have a binomial distribution? c) Provide a contextual example (e.g. drawing a card, flipping a coin, etc.) that can demonstrate your conclusions in (a) and (b). 7