HWK2_324_Soln

pdf

School

University of Wisconsin, Madison *

*We aren’t endorsed by this school

Course

324

Subject

Statistics

Date

Feb 20, 2024

Type

pdf

Pages

9

Uploaded by UltraDolphinMaster987

Report
Statistics 324 Homework #2 SOLUTIONS *Submit your homework to Canvas by the due date and time. Email your lecturer if you have extenuating circumstances and need to request an extension. *If an exercise asks you to use R, include a copy of the code and output. Please edit your code and output to be only the relevant portions. *If a problem does not specify how to compute the answer, you many use any appropriate method. I may ask you to use R or use manual calculations on your exams, so practice accordingly. *You must include an explanation and/or intermediate calculations for an exercise to be complete. *Be sure to submit the HWK2 Autograde Quiz which will give you ~20 of your 40 accuracy points. *50 points total: 40 points accuracy, and 10 points completion Summarizing Data Numerically and Graphically (I) Exercise 1: A company that manufactures toilets claims that its new presure-assisted toilet reduces the average amount of water used by more than 0.5 gallons per flush when compared to its current model. The company selects 20 toilets of the current type and 19 of the new type and measures the amount of water used when each toilet is flushed once. The number of gallons measured for each flush are recorded below. The measurements are also given in flush.csv. Current Model: 1.63, 1.25, 1.23, 1.49, 2.11, 1.48, 1.94, 1.72, 1.85, 1.54, 1.67, 1.76, 1.46, 1.32, 1.23, 1.67, 1.74, 1.63, 1.25, 1.56 New Model: 1.28, 1.19, 0.90, 1.24, 1.00, 0.80, 0.71, 1.03, 1.27, 1.14, 1.36, 0.91, 1.09, 1.36, 0.91, 0.91, 0.86, 0.93, 1.36 a. Use R to create histograms to display the sample data from each model (any kind of histogram that you want since sample sizes are very similar). Have identical x and y axis scales so the two groups’ values are more easily compared. Include useful titles. Curr <- c ( 1.63 , 1.25 , 1.23 , 1.49 , 2.11 , 1.48 , 1.94 , 1.72 , 1.85 , 1.54 , 1.67 , 1.76 , 1.46 , 1.32 , 1.23 , 1.67 , 1.74 , 1.63 , 1.25 , 1.56 ) New <- c ( 1.28 , 1.19 , 0.90 , 1.24 , 1.00 , 0.80 , 0.71 , 1.03 , 1.27 , 1.14 , 1.36 , 0.91 , 1.09 , 1.36 , 0.91 , 0.91 , 0.86 , 0.93 , 1.36 ) length (Curr); length (New) ## [1] 20 ## [1] 19 #This is where I put data into long format to then created the csv file #gallons=c(Curr, New) #Model=c(rep("Current",20), rep("New",19)) #flush_data<-data.frame(Model, gallons) #View(flush_data) 1
#write.csv(flush_data, "flush.csv", row.names=FALSE) #or to get the data from the csv: flush_df = read.csv ( "flush.csv" , header= TRUE ) Curr_df = subset (flush_df, Model == "Current" ) Curr_gall = Curr_df $ gallons New_df = subset (flush_df, Model == "New" ) New_gall = New_df $ gallons mean (flush_df $ gallons); sd (flush_df $ gallons) ## [1] 1.327692 ## [1] 0.3422561 par ( mfrow= c ( 1 , 2 ), mar= c ( 5.1 , 4.1 , 4.1 , 2.1 )) hist (Curr, breaks= c ( seq ( 0.5 , 2.5 , . 2 )), main= "Current Model" , xlab= "gallons" , ylim= c ( 0 , 8 )) hist (New, breaks= c ( seq ( 0.5 , 2.5 , . 2 )), main= "New Model" , xlab= "gallons" , ylim= c ( 0 , 8 )) Current Model gallons Frequency 0.5 1.0 1.5 2.0 2.5 0 2 4 6 8 New Model gallons Frequency 0.5 1.0 1.5 2.0 2.5 0 2 4 6 8 par ( mfrow= c ( 1 , 1 )) b. Compare the shape of the gallons per flush from the two types of toilets observed in this experiment. Both of the histograms are roughly symmetric. Current model has one primary peak around 1.6 and New Model has a primary peak around 1 c. Compute the mean and median gallons flushed for the Current and New Model toilets using the built-in R functions. Compare the measures of center across the two groups and comment on how that relationship is evident in the histograms. 2
mean (Curr); median (Curr) ## [1] 1.5765 ## [1] 1.595 mean (Curr_gall); median (Curr_gall) #making sure csv data gives same summaries ## [1] 1.5765 ## [1] 1.595 mean (New); median (New) ## [1] 1.065789 ## [1] 1.03 mean (New_gall); median (New_gall) #making sure csv data gives same summaries ## [1] 1.065789 ## [1] 1.03 Current: mean: 1.5765, median: 1.595. New: mean: 1.065789, median: 1.03. For both Models, we see that the mean and median values are pretty close to one another - this is consistent with the roughly symmetric shapes of the data. The centers of the current model data is slightly higher than those of the new model data which can be seen with a slight shift to the right for the Current Model histogram. d. Compute (using built-in R function) and compare the sample standard deviation of gallons flushed by the current and new model toilets. Comment on how the relative size of these values can be identified from the histograms. sd (Curr) ## [1] 0.2456843 sd (Curr_gall) ## [1] 0.2456843 sd (New) ## [1] 0.2058941 sd (New_gall) ## [1] 0.2058941 Current SD:0.2457 New SD:0.2059 The standard deviation of the new model gallons flushed is smaller than that of the current model. This relationship could be predicted from the histograms since the histogram of the new model data is more tightly clustered around the center. The values of gallons flushed look to be more predictable/consistent with the new model based on these samples- there is less variability in the outcome. e. Use R to create side-by-side boxplots of the two sets in R so they are easily comparable. boxplot (Curr, New, names= c ( "Current" , "New" ), ylab= "water flushed (gallons)" , main= "Toilet Water Flushed by Model" , ylim= c ( 0.5 , 2.3 )) text ( y= fivenum (Curr), labels= fivenum (Curr), x= 0.6 ) text ( y= fivenum (New), labels= fivenum (New), x= 2.4 ) 3
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Current New 0.5 1.0 1.5 2.0 Toilet Water Flushed by Model water flushed (gallons) 1.23 1.39 1.595 1.73 2.11 0.71 0.91 1.03 1.255 1.36 par ( mfrow= c ( 1 , 2 ), mar= c ( 4.5 , 4.5 , 2.5 , 1.5 )) boxplot (Curr, ylim= c ( 0.5 , 2.5 ), xlab= "Current" , ylab= "gallons" ) text ( y= fivenum (Curr), labels= fivenum (Curr), x= 0.7 ) boxplot (New, ylim= c ( 0.5 , 2.5 ), xlab= "New" , ylab= "gallons" ) text ( y= fivenum (New), labels= fivenum (New), x= 0.7 ) 0.5 1.0 1.5 2.0 2.5 Current gallons 1.23 1.39 1.595 1.73 2.11 0.5 1.0 1.5 2.0 2.5 New gallons 0.71 0.91 1.03 1.255 1.36 par ( mfrow= c ( 1 , 1 )) f. Explain why there are no values shown as a dot on the Current Model flush boxplot. To 4
what values do the Current model flush boxplot whiskers extend? (Use R for your boxplot calculations and type=2 for quantiles) quantile (Curr, type= 2 ) ## 0% 25% 50% 75% 100% ## 1.230 1.390 1.595 1.730 2.110 IQR (Curr, type= 2 ) ## [1] 0.34 sort (Curr) ## [1] 1.23 1.23 1.25 1.25 1.32 1.46 1.48 1.49 1.54 1.56 1.63 1.63 1.67 1.67 1.72 ## [16] 1.74 1.76 1.85 1.94 2.11 Current: 1 . 5 IQR = 1 . 5 0 . 34 = 0 . 51 , so any value above Q 3 + 1 . 5 0 . 34 = 1 . 73 + 0 . 51 = 2 . 24 or below Q 1 1 . 5 0 . 34 = 1 . 39 0 . 51 = 0 . 88 would be marked as an outlier. Since all Current flush values are between 0.88 and 2.24, no points are shown as dots. We see that the whiskers extend to the max (2.11) and min(1.23) of the sample. g. What would be the mean and median gallons flushed if we combined the two data sets into one large data set with 39 observations? Show how the mean can be calculated from all observations in one vector or the summary measures in part (c) along with the sample sizes. Explain why the median of the combined set cannot be computed based on (c). combined = c (Curr,New) mean (combined); median (combined) ## [1] 1.327692 ## [1] 1.28 ( mean (Curr) * length (Curr) + mean (New) * length (New)) / ( length (Curr) + length (New)) ## [1] 1.327692 mean: 1.327692 is the weighted average of the means found in part c: 20 1 . 5765+19 1 . 065789 39 = 51 . 7799939 = 1 . 327692 . Notice, it is not the strict average of the sample means (1.321144) since the samples are not the same size. There is no good way to get the combined median from the above medians because we don’t know how the data overlaps between the two groups from their summaries. R finds a combined median of 1.28 Exercise 2 An elementary school surveys its families and tabulates the number of children reported in each household. A frequency histogram summarizes the data received: Children = c ( rep ( 1 , 47 ), rep ( 2 , 70 ), rep ( 3 , 45 ), rep ( 4 , 23 ), rep ( 5 , 11 ), rep ( 6 , 5 ), rep ( 7 , 2 )) hist (Children, breaks= seq ( 0.5 , 7.5 , 1 ), labels= TRUE , ylim= c ( 0 , 80 ), main= "Number of Children in Household" , ylab= "Number of Households" ) 5
Number of Children in Household Children Number of Households 1 2 3 4 5 6 7 0 20 40 60 80 47 70 45 23 11 5 2 a. Consider a randomly chosen household, Household A. Identify whether the events “Household A has 1 Child” and “Household A has 2 Children” are (i) independent, (ii) mutually exclusive, (iii) both independent and mutually exclusive or (iv) neither independent nor mutually exclusive. Explain how you know. Mutually Exclusive (and therefore dependent) since a household cannot both have exactly 1 and exactly 2 children in the household. b. Suppose the principal chooses a random family from those at the school to call each day and each family is equally likely to be chosen on the first day of school. What is the probability that the family has more than two (2) children? (45+23+11+5+2)/203=0.4236453 sum (Children > 2 ) ## [1] 86 ( 45 + 23 + 11 + 5 + 2 ) ## [1] 86 length (Children) ## [1] 203 ( 45 + 23 + 11 + 5 + 2 ) / 203 ## [1] 0.4236453 c. Suppose the principal randomly chooses a family to call from those at the school that they have not already called. What is the probability that all of the families called the first 5 days of school had a single (1) child in the household? Is this a highly likely or unlikely event? 6
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
This is sampling without replacement. This is a very unlikely event since 0.0005611291 is essentially 0. ( 47 / 203 ) * ( 46 / 202 ) * ( 45 / 201 ) * ( 44 / 200 ) * ( 43 / 199 ) ## [1] 0.0005611291 Exercise 3 You are adding Badger-themed bedazzle to your striped overalls and are using both red and white beads. You are interested in how the size of the bag of beads you select your beads from changes the probability of outcomes of interest. Compute the probability for outcomes a and b using three different sampling strategies each time. (Small Pop) drawing without replacement from a small population where the bag of beads contains 6 White beads and 4 Red beads. (Large Pop) drawing without replacement from a large population where the bag of beads contains 600 White beads and 400 Red beads. (Same Pop) drawing from a population where the bag of beads always contains 60% White and 40% Red beads. Example: Consider choosing 3 beads. Calculate the probability of selecting no white beads. Small Pop: P([RRR])= 4 10 3 9 2 8 = 0 . 03333333 Large Pop: P([RRR])= 400 1000 399 999 398 998 = 0 . 06371181 Same Pop: P([RRR])= 0 . 40 . 40 . 40 = 0 . 064 ( 4 / 10 ) * ( 3 / 9 ) * ( 2 / 8 ) ## [1] 0.03333333 ( 400 / 1000 ) * ( 399 / 999 ) * ( 398 / 998 ) ## [1] 0.06371181 . 4 * . 4 * . 4 ## [1] 0.064 a. Consider choosing 3 beads. Calculate the probability of selecting exactly 1 white bead. Small Pop: P([WRR or RWR or RRW])= 3 6 10 4 9 3 8 = 0 . 3 Large Pop: P([WRR or RWR or RRW])= 3 600 1000 400 999 399 998 = 0 . 2881439 Same Pop: P([WRR or RWR or RRW])= 3 0 . 60 . 40 . 40 = 0 . 288 3 * ( 6 / 10 ) * ( 4 / 9 ) * ( 3 / 8 ) ## [1] 0.3 3 * ( 600 / 1000 ) * ( 400 / 999 ) * ( 399 / 998 ) ## [1] 0.2881439 3 * 0.6 * 0.4 * 0.4 ## [1] 0.288 b. Consider choosing 3 beads. Calculate the probability of selecting at least 1 white bead. Small Pop: 1-P(OW)=1-P(RRR)= 1 4 3 2 10 9 8 = 1 0 . 03333333 = 0 . 9666667 Large Pop: 1-0.06371181=0.9362882 , Same Pop: 1-0.064=0.936 7
1 - ( 4 / 10 ) * ( 3 / 9 ) * ( 2 / 8 ) ## [1] 0.9666667 1 - ( 400 / 1000 ) * ( 399 / 999 ) * ( 398 / 998 ) ## [1] 0.9362882 1-.4 * . 4 * . 4 ## [1] 0.936 c. Consider sampling without replacement. Does drawing from a population that is small or large relative to your sample size result in an probability that is closest to the probability when sampling with replacement? drawing from a population that is large relative to the sample size results in probability closest to prob when sampling with replacement since when you remove one (or more) observations from a large population, the population doesn’t change much while a small population can change drastically Exercise 4 Six hundred (600) paving stones were examined for cracking and discoloration. Eighteen (18) were found to be cracked and 24 were found to be discolored. A total of 562 stones were neither cracked nor discolored. a. Create a 2-way table to organize the counts of stones in each of the 4 combinations of Cracked/Not Cracked and Discolored/Not Discolored. X Cracked Not Cracked Total Discolored 4 20 24 Not Discolored 14 562 576 Total 18 582 600 b. What is the probability that a randomly sampled paving stone from this set is discolored and not cracked? ANSWER: 20 / 600 = 0 . 03333 c. What is the probability that a randomly sampled paving stone from this group is cracked or discolored? ANSWER: 38/600=0.06333 d. What is the probability that in a random sample of 3 paving stones from this set (without replacement), at least one of the three is cracked or discolored? ANSWER: P(At least 1 of three cracked or discolored)=1-P(0 are cracked or discolored)= 1 562 / 600 561 / 599 560 / 598 = 0 . 17849 Here’s a simulation to check my work: #1: Cracked or Discolored pop = c ( rep ( 1 , 38 ), rep ( 0 , 562 )) nsamp = 5000000 num_CorD = rep ( 9 ,nsamp) for (i in 1 : nsamp){ samp = sample (pop, 3 , replace= FALSE ) num_CorD[i] = sum (samp) } 8
hist (num_CorD, labels= TRUE ) sum (num_CorD == 1 | num_CorD == 2 | num_CorD == 3 ) / nsamp sum (num_CorD >= 1 ) / nsamp 1 - sum (num_CorD == 0 ) / nsamp #or using all 4 classifications: pop = rep ( c ( "C_D" , "NC_D" , "C_ND" , "NC_ND" ), times= c ( 4 , 20 , 14 , 562 )) nsamp = 1000000 have_CorD = rep ( 9 ,nsamp) for (i in 1 : nsamp){ samp = sample (pop, 3 , replace= FALSE ) num_CorD = sum ((samp == "C_D" ) | (samp == "NC_D" ) | (samp == "C_ND" )) have_CorD[i] = num_CorD > 0 } sum (have_CorD) / nsamp e. What is the probability that a randomly sampled paving stone from this group has discol- oration, given we know that it is cracked? P(discolored| cracked)= 4 / 18 = 0 . 2222 f. Is being discolored and cracked independent in this set of 600 paving stones? Could check in a number of ways, but lets check if P(discolored| cracked)= 4 / 18 = 0 . 2222 = P(discolored)= 24 / 600 = 0 . 04 so no, there is not independence in this group of 600 pavers - there is a higher rate of discoloredness in the cracked stones g. Now suppose in another group of 600 paving stones, forty-eight (48) were found to be cracked and 25 were found to be discolored. How many stones would be cracked and discolored if the events: discolored and cracked are independent in this group of 600 stones? Make sure to show how you calculated your answer. X Cracked Not Cracked Total Discolored ?? 25 Not Discolored Total 48 552 600 ANSWER: Algebra: P(Cracked)= 48 / 600 = 0 . 08 = P(Cracked| Discolored)= x 25 , solving for x gives us: 2 48 600 = x 25 48 25 = 600 x x = 48 25 600 = 2 9
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help