HWK5

docx

School

University of Wisconsin, Madison *

*We aren’t endorsed by this school

Course

371

Subject

Statistics

Date

Apr 3, 2024

Type

docx

Pages

9

Uploaded by AmbassadorRaven4190

Report
Stat 371 Homework #5 Rachel Marten Submit your homework to Canvas by the due date and time. Email your lecturer if you have extenuating circumstances and need to request an extension. If an exercise asks you to use R, include a copy of all relevant code and output in your submitted homework file. You can copy/paste your code, take screenshots, or compile your work in an Rmarkdown document. If a problem does not specify how to compute the answer, you many use any appropriate method. I may ask you to use R or use manual calculations on your exams, so practice accordingly. You must include an explanation and/or intermediate calculations for an exercise to be complete. Be sure to submit the HWK5 Autograde Quiz which will give you ~20 of your 40 accuracy points. 50 points total: 40 points accuracy, and 10 points completion Sampling Distributions and CLT Exercise 1. Retail stores experience their heaviest volume of transactions that include returns on December 26th and December 27th each year. The distribution for the Number of Items Returned (X) by Macy’s customers who do a return transaction on those days last year is given in the table below. It has mean: μ = 2.61 and variance σ 2 1.80 . Number of Items Returned in Transaction (x) Probability 1 0.25 2 0.28 3 0.20 4 0.17 5 0.08 6 0.02 vals <- c ( 1 , 2 , 3 , 4 , 5 , 6 ) probs <- c ( 0.25 , 0.28 , 0.20 , 0.17 , 0.08 , 0.02 ) EV_pop = sum (vals * probs) Var_pop <- sum (probs * (vals - EV_pop) ^ 2 ) a. Is this population distribution left skewed, symmetric, or right skewed? How do you know? This population is right skewed sine there is more data closer to 1-4. This
makes it heavy on the left side of the graph leaving a tail on the right part. The mean is 2.61 which is on the lower side of the data and the standard deviation also demonstrates a right skewed distribution. b. What proportion of returns had three or more items? 47% of returns had three or more items. 0.20+0.17+0.08+0.02=0.47 1 - pnorm ( 3 , 2.61 , sqrt ( 81 )) ## [1] 0.4827179 c. Identify which histogram below diplays (1) the population X values, (2) the simulated sampling distribution of the sample mean ´ X , (3) the simulated sampling distribution of the sample total T . Briefly explain how you know. c. Histogram A corresponds to the simulated sampling distribution of the sample total. Histogram A shows the mean to be around 117.45 which is the sample mean. Histogram B corresponds to the population x values. This histogram has all fo the numbers of items returned with the approximations of each. Histogram C corresponds to the simulated sampling distribution of the sample mean. The mean of histogram C is the same as the sample mean data. d. Describe the sampling distribution (shape, mean, and standard deviation) of the sample mean number of items returned in 45 return transactions
´ X = X 1 + X 2 + ... + X 45 45 according to theory. Make sure to name any theorems you are using. d. The sampling distribution has a normal distribution due to the CLT. The distribution is of a sample mean and a sample sum and each item is independent to each other, The sampling distribution of 45 has a mean of 2.61 and standard deviation of 0.2. [sqrt(1.8/45)] e. What is the probability that the mean number of items returned in the 45 return transactions reviewed will be 3 or more items? 1 - pnorm ( 3 , 2.61 , sqrt ( 1.8 / 45 )) ## [1] 0.02558806 f. Explain why the value you found in (e) was so much smaller than the value found in (b). f. The value in part e was more accurate than the value in part b since the value found in e includes the sample size of 45. b used an overall probability and transactions over two days. The more accurate representation will include the sample size. g. Consider the total number of items returned in 45 customer return transactions. Describe the sampling distribution (shape, center, and spread) of the total number of items returned T = X 1 + X 2 + + X 45 . Make sure to name any theorems you are using. g. The sampling distribution will look normal because of the CLT. There is a larger sampling population which will allow for a normal curve. The sampling distribution of 45 has a mean of 117.45 [2.61*45] and a standard deviation of 9 [sqrt(1.845)]. h. Find an upper bound b such that the total number of items returned in 45 customers’ return transactions will be less than b with probability 0.95. qnorm ( 0.95 , 117.45 , 9 ) ## [1] 132.2537 Interval estimation for a population mean Exercise 2. Consider the tree data set in R, trees . trees is a data frame object, which contains multiple variables. We can access a specific variable’s data by using the $ symbol. For example: # The data frame contains 3 columns (vectors) trees # This is how to access the "Girth" data specifically trees $ Girth # We can use this vector in our usual R functions mean (trees $ Girth)
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
a. Construct histograms and qqnorm plots for all three of the quantitative variables recorded on the 31 trees. For which of the three variables do we have the strongest evidence that the population of values may not be well approximated by a normal random variable? volumes <- trees $ Volume (x_bar <- mean (volumes)) ## [1] 30.17097 (s <- sd (volumes)) ## [1] 16.43785 (n <- length (volumes)) ## [1] 31 par ( mfrow = c ( 1 , 2 )) hist (volumes) qqnorm (volumes); qqline (volumes) Girth <- trees $ Girth (x_bar <- mean (Girth))
## [1] 13.24839 (s <- sd (Girth)) ## [1] 3.138139 (n <- length (Girth)) ## [1] 31 par ( mfrow = c ( 1 , 2 )) hist (Girth) qqnorm (Girth); qqline (Girth) Height <- trees $ Height (x_bar <- mean (Height)) ## [1] 76 (s <- sd (Height)) ## [1] 6.371813 (n <- length (Height)) ## [1] 31
par ( mfrow = c ( 1 , 2 )) hist (Height) qqnorm (Height); qqline (Height) The volume provides the strongest evidence that the population of values may not be well approximated by a normal random variable. b. Since n = 31 for each of these variables, we believe the CLT will make ´ X ≈ N even for the possibly non normal populations referenced above. Construct 90% t confidence intervals “by hand” for all three variables in the trees data set. Summaries of the variables are given below and you should use an R function to find the relevant t critical value. mean (trees $ Girth); sd (trees $ Girth); length (trees $ Girth) ## [1] 13.24839 ## [1] 3.138139 ## [1] 31 mean (trees $ Height); sd (trees $ Height); length (trees $ Height) ## [1] 76
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
## [1] 6.371813 ## [1] 31 mean (trees $ Volume); sd (trees $ Volume); length (trees $ Volume) ## [1] 30.17097 ## [1] 16.43785 ## [1] 31 qt (. 9 , 30 ) ## [1] 1.310415 # standard error se <- 3.14 / sqrt ( 31 ) # critical value t <- qt ( 0.05 , df = 30 ) # margin of error moe <- 1.697261 * 0.5636 1.697261 * 0.5636 ## [1] 0.9565763 #upper bound 13.25+0.9565763 ## [1] 14.20658 #lower bound 13.25-0.9565763 ## [1] 12.29342 #height 6.371813 / sqrt ( 31 ) #se ## [1] 1.144411 qt ( 0.95 , 30 ) #criticalvalue ## [1] 1.697261 1.697261 * 1.144411 #marginoferror ## [1] 1.942364 76 +1.942364 #upper bound
## [1] 77.94236 76 -1.942364 #lower bound ## [1] 74.05764 #girth 3.138139 / sqrt ( 31 ) #se ## [1] 0.5636264 qt ( 0.95 , 30 ) #critical value ## [1] 1.697261 1.697261 * 0.5636 #margin of error ## [1] 0.9565763 13.25+0.9565763 #upper bound ## [1] 14.20658 13.25-0.956763 #lower bound ## [1] 12.29324 #volume 16.43785 / sqrt ( 31 ) #se ## [1] 2.952325 qt ( 0.95 , 30 ) #criticalvalue ## [1] 1.697261 1.697261 * 2.95 #marginoferror ## [1] 5.00692 30.17+5.00692 #upper bound ## [1] 35.17692 30.17-5.00692 #lower bound ## [1] 25.16308 c. Construct the same confidence intervals that you constructed in (b) above using the t.test() command in R. Confirm that you get very similar endpoints. se <- 3.14 t.test (trees, conf.level = 0.9 )
## ## One Sample t-test ## ## data: trees ## t = 13.447, df = 92, p-value < 2.2e-16 ## alternative hypothesis: true mean is not equal to 0 ## 90 percent confidence interval: ## 34.8879 44.7250 ## sample estimates: ## mean of x ## 39.80645 d. Suppose this data came from 31 trees cut down by a single logger. How does that affect the conclusions we can draw? Suppose this data came from 31 trees selected at the saw mill from a variety of logging companies, how does that affect the conclusions we can draw? d. The conclusions we are drawing from a single logger, would not be a representation of a population or an average. This would be a biased sample. The logger most likely cut down 31 trees within the same area and the data of the trees would be affected by the climate and environment. With the multiple logging companies, there would be less bias.There would be a more consistent representation over a population of multiple logging companies as opposed to a single logger. You could not apply the findings from a single logger to an overall idea or population, because the population size is not larger enough to make an accurate conclusion.Additionally, the multiple logging companies would should an iid, which is that each tree is grown independently of each other and there would be a better representation of the population. e. Suppose the 31 trees in the trees data set is a random sample from those at a saw mill. The mill would like to use this sample to estimate the proportion of trees that they have at their mill with Volume over 65 cubic ft. Use the code below to determine what count of trees in this sample have Volume over 65 cubic ft. Then, explain why they should not do a large-sample z confidence interval for the proportion of trees at their mill with Volume over 65 cubic feet with this sample of 31 trees. e. There is not a large amount of trees with a volume over 65 cubic feet, so they should not do a large-sample z confidence interval. There is only one tree that is over 65 and that is too far away from the mean, which means they should be looking at the trees under 65. sum (trees $ Volume > 65 ) ## [1] 1
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help