STAT 2080 Practice Final Exam: Key Concepts and Questions

STAT 2080 Practice Final Exam August, 2023 DALHOUSIE UNIVERSITY FACULTY OF SCIENCE Department of Mathematics and Statistics STAT 2080 / MATH 2080 Practice Final Examination Date and Time: August 2023 NAME (PRINT CLEARLY): BANNER ID: SIGNATURE: You may use a 2-sided formula sheet, and a calculator. There are 13pages in this exam. The number of points allocated to each problem is indicated. To get maximum credit, SHOW ALL YOUR WORK . Question: 1 2 3 4 5 6 7 8 9 10 11 Total Points: 6 5 4 2 10 8 6 12 7 6 5 71 Score: 1. Answer the following multiple choice questions (circle the correct one). (a) (1 point) If two explanatory variables have a correlation coefficient of 1, this can cause a problem in a linear model because of: A. non-independence B. non-normality C. multicollinearity D. all of the above (b) (1 point) If R 2 = 0 in a linear model with one explanatory variable what does that tell us? A. SSR = 0 B. The proportion of variation explained by the linear model is zero. C. There is no linear relationship between x and y D. all of the above (c) (1 point) In hypothesis testing as the test statistic gets larger A. The p value gets larger B. The p value gets smaller C. The p value may get smaller or larger D. The linear relationship grows stronger (d) (1 point) Non-parametric statistical tests are used A. because they are fast B. because they do not assume a specific distribution for the data C. because they have a lower probability of Type I error D. because they have a lower probability of Type II error (e) (1 point) In one-way ANOVA MSTr represents: A. The variation between groups B. The variation within groups C. The test statistic D. The pooled sample standard deviation (f) (1 point) A 90% confidence interval A. is wider than a 95% confidence interval B. contains the true parameter estimate with 90% probability C. contains 90% of the data D. is narrower than a 95% confidence interval Page 1 of 13

STAT 2080 Practice Final Exam August, 2023 2. A study on reaction times was conducted. The reaction times for 21 professional athletes and 21 members of the general public were tested and recorded. A 90% confidence for the for the true mean difference in reaction time in milliseconds ( µ p − µ g , professional - general) was found to be: ( − 30 , − 10) (a) (1 point) Is this a matched pairs or pooled standard deviation confidence interval? Solution: pooled standard deviation (b) (1 point) What is the mean difference of the reaction times between professional athletes and the general public? Solution: (¯ x − critical value × SE ) + (¯ x + critical value × SE ) = 2¯ x − 30 + ( − 10) = − 40 ¯ x = − 40 / 2 = − 20 (c) (1 point) What is the standard error that was used to build the confidence interval? Solution: t 0 . 1 / 2 , 40 = 1 . 684 − 20 − 1 . 684 × SE = − 30 SE = − 10 − 1 . 684 = 5 . 94 (d) (2 points) Test the hypothesis that professional athletes have a lower reaction time than the general public using α = 0 . 05 . Solution: H 0 : µ p − µ g = 0 vs. H a : µ p − µ g < 0 t = − 20 5 . 94 = − 3 . 36 Since |− 3 . 36 | > 2 . 021 we reject H 0 suggesting there is evidence that professional althetes have lower reaction times. Page 2 of 13

STAT 2080 Practice Final Exam August, 2023 3. Use the following two plots to answer the questions below: -10 0 10 20 30 40 50 -100 -50 0 50 100 150 x vs. y x y -50 0 50 100 -100 -50 0 50 Fitted Values vs. Residuals Fitted Values Residuals (a) (1 point) The point marked by the * in the left hand plot is an outlier removed from the linear model. Would removing the outlier increase, decrease or have no effect on Pearson’s correlation coefficient for x and y ? Solution: Increase (b) (1 point) How would including the outlier impact the slope of the linear model? Solution: It would increase the slope (c) (2 points) Based on the two plots does a linear model appear to fit the data well? Why or why not? Solution: Based on the two plots a linear model seems like a poor fit. The residuals and plots of the data suggest a non-linear trend between x and y 4. (2 points) Sketch an example of a fitted values vs. residual plot in the empty plot below that breaks the assumption of constant variance. Page 3 of 13

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version

Access to all documents
Unlimited textbook solutions
24/7 expert homework help

STAT 2080 Practice Final Exam August, 2023 Fitted Values vs. Residuals Fitted Values Residuals Solution: Something like: Fitted Values vs. Residuals Fitted Values Residuals 5. A biologist spent a week collecting and categorizing various wild flower samples. They have noticed that all the samples are either red, blue or yellow in color and have four, eight or ten petals. The biologist is interested in determining whether or not there is a relationship between color and the number of petals. The table of the biologists counts for each combination of factors is given below: petals colors eight four ten blue 143 23 39 red 66 10 23 yellow 78 7 11 (a) (6 points) Test the biologists question of interest using α = 0 . 1 : Solution: H 0 : There is no relationship between number of petals and color vs. H a : There is a relationship between number of petals and color. Use the two-way contingency table test. First find the row and column sums as well as the total count. Row Sums: Page 4 of 13

STAT 2080 Practice Final Exam August, 2023 blue red yellow 205 99 96 Column Sums: eight four ten 287 40 73 And the total is 400. Then find the expected counts for each cell: e ij = (ith row count)(jth column count) total count eight four ten blue 147.09 20.5 37.4 red 71.03 9.9 18.07 yellow 68.88 9.6 17.52 Then find the test statistic: χ 2 obs = r X i =1 c X j =1 ( x ij − e ij ) 2 e ij = 6 . 5273 Compare to χ 2 ( r − 1)( c − 1) = χ 2 4 distribution, the p value is > 0.1 from table. We fail to reject the null hypothesis suggesting there is no relationship between color and number of petals. (b) (4 points) The biologist now wants to test the idea that blue flowers can be found 50% of the time and red and yellow flowers 25% of the time. Perform the hypothesis test to help the biologist. Use α = 0 . 1 . Solution: H 0 : p b = 0 . 5 , p r = 0 . 25 , p y = 0 . 25 vs. H a : the probabilities are not as stated. First find expected counts n = 400 e b = 400(0 . 5) = 200 e y = e r = 400(0 . 25) = 100 Then calculate the test statistic: χ 2 obs = (205 − 200) 2 200 + (99 − 100) 2 100 + (96 − 100) 2 100 = 0 . 295 Which we compare to χ 2 2 which is greater than > 0.1 from table so we fail to reject the null hypothesis. Page 5 of 13

STAT 2080 Practice Final Exam August, 2023 6. Use the following data to anwser the questions below: x y 4 12 2 4 9 10 1 3 5 9 (a) (3 points) Calculate Pearson’s correlation coefficient r . Say whether the data has a positive or negative linear relationship or none at all. Solution: ∑ x = 21 , ∑ x 2 = 127 , ∑ y = 38 , ∑ y 2 = 350 , ∑ xy = 194 SS xx = X x 2 − ( ∑ x ) 2 n = 127 − (21) 2 5 = 38 . 8 SS yy = 350 − (38) 2 5 = 61 . 2 SS xy = X xy − 1 n X x X y = 194 − 1 5 (21)(38) = 34 . 4 r = SS xy p SS xx SS yy = 34 . 4 √ 38 . 8 × 61 . 2 = 0 . 706 There is a positive linear relationship. (b) (3 points) Calculate the estimates of β 1 and β 0 for a linear model line of best fit and write down the equation of the line. Solution: ˆ β 1 = SS xy SS xx = 34 . 4 38 . 8 = 0 . 887 β 0 = ∑ y n − ˆ β 1 ∑ x n = 38 5 − 0 . 887 21 5 = 3 . 87 y = 3 . 87 + 0 . 887 x (c) (1 point) Predict y for when x = 3 . Solution: ˆ y = 3 . 87 + 0 . 887(3) = 6 . 531 (d) (1 point) Calculate the residual for x = 4 . Solution: ˆ y = 3 . 87 + 0 . 887(4) = 7 . 418 ˆ ε = 12 − 7 . 418 = 4 . 582 Page 6 of 13

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version

Access to all documents
Unlimited textbook solutions
24/7 expert homework help

STAT 2080 Practice Final Exam August, 2023 7. (6 points) The biologist from question 5 is back and needs your help. For the last 8 years they have counted the number of offspring born to 3 different gorilla troops. They are interested in determining whether the average number of offspring each year among troops is different or not. Below is the number of offspring born in each troop for the past eight years: troop1 troop2 troop3 3 2 3 1 2 2 1 2 3 1 4 1 3 3 3 2 1 1 1 6 2 2 3 0 Using α = 0 . 05 determine whether the average number of offspring in each troop every year is different or not. Solution: Since this is count data use Kruskal-Wallis test H 0 : µ 1 = µ 2 = µ 3 vs. H a : µ i ̸ = µ j for some i First rank the data from smallest to largest: troop1 troop2 troop3 19 12 19 5 12 12 5 12 19 5 23 5 19 19 19 12 5 5 5 24 12 12 19 1 Get the average rank of each group: troop1 10.25 troop2 15.75 troop3 11.5 Then calculate the test statistic: K = 12 n ( n + 1) k X i =1 n i ( ¯ R i − n + 1 2 ) 2 K = 12 600 8(10 . 25 − 12 . 5) 2 + 8(15 . 75 − 12 . 5) 2 + 8(11 . 5 − 12 . 5) 2 = 2 . 66 Then we compare to χ 2 k − 1 = χ 2 2 and p is greater than 0.1 so we fail to reject the null hypothesis. Page 7 of 13

STAT 2080 Practice Final Exam August, 2023 8. Data was collected on the final selling price of 50 homes recently sold in the Halifax area. In addition the number of bedrooms , the size of the house in square feet as well as if the house has a garage or not. (a) (2 points) What kind of variable is required to represent if the house has a garage or not in a linear model? Show how you would represent it in the model. Solution: It needs to be a dummy variable or an indicator variable. It would be represented as something like (or the opposite) x garage = ( 1 if it has a garage 0 if it does not have a garage (b) (2 points) The following linear models were fit to the data and the SSEs recorded. Let y be the final sale price in tens of thousands, x 1 be the number of bedrooms, x 2 be the size of house in square feet and x 3 is whether or not the house has a garage. Are all the models here nested within each other? Why or why not? Model SSE y = β 0 + β 1 x 1 + β 2 x 2 + β 3 x 3 + β 4 x 2 x 1 + ε 14052.29 y = β 0 + β 1 x 1 + β 2 x 2 + β 3 x 3 + ε 37210.43 y = β 0 + β 2 x 2 + ε 1399015 y = β 0 + β 1 x 1 + ε 497424.3 Solution: Yes all the models are nested. You see this by setting β 4 = 0 you get the second model, setting β 1 , β 3 and β 4 to zero gives you the third model and setting β 2 , β 3 and β 4 to zero gives you the fourth model. (c) (4 points) Using the table in part (b) test whether the interaction between number of bedrooms and the square footage of the house is significant or not at α = 0 . 1 Solution: Use the Partial F test. H 0 : β 4 = 0 vs. H a : β 4 ̸ = 0 F obs = ( SSE 2 − SSE 1 ) / ( k − m ) MSE 1 Here k = 4 , m = 3 MSE 1 = SSE 1 n − k − 1 = 14052 . 29 50 − 4 − 1 = 312 . 27 F obs = (37210 . 43 − 14052 . 29) / (4 − 3) 312 . 27 = 74 . 16 Compare to F k − m,n − k − 1 = F 1 , 45 shows that the p value is less than 0.001 and the interaction is significant. (d) (4 points) For these linear models, SST = 2022185 . Which of these models has the highest R 2 and the highest adjusted R 2 ? Solution: R 2 1 = 1 − 14052 . 23 2022185 = 0 . 9931 R 2 2 = 1 − 37210 . 43 2022185 = 0 . 9816 R 2 3 = 1 − 1399015 2022185 = 0 . 3081 Page 8 of 13

STAT 2080 Practice Final Exam August, 2023 R 2 4 = 1 − 497424 . 3 2022185 = 0 . 7540 R 2 adj, 1 = 1 − (1 − R 2 1 ) n − 1 n − k = 1 − (1 − 0 . 9931) 49 45 = 0 . 9925 R 2 adj, 2 = 1 − (1 − R 2 2 ) n − 1 n − k = 1 − (1 − 0 . 9816) 49 46 = 0 . 9804 R 2 adj, 3 = 1 − (1 − R 2 3 ) n − 1 n − k = 1 − (1 − 0 . 3081) 49 48 = 0 . 3937 R 2 adj, 4 = 1 − (1 − R 2 4 ) n − 1 n − k = 1 − (1 − 0 . 7540) 49 48 = 0 . 7489 Model 1 has the highest R 2 and adjusted R 2 . Page 9 of 13

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version

Access to all documents
Unlimited textbook solutions
24/7 expert homework help

STAT 2080 Practice Final Exam August, 2023 9. An experiment was ran testing the leavening power of 5 different brands of baking powder. The same cupcake recipe (making one dozen) was baked with each of the powders and the height of each cupcake measured (in mm). A linear model was fit to the data using R and the output is shown below. Call: lm(formula = y ~ powders) Residuals: Min 1Q Median 3Q Max -6.0810 -2.1314 0.3569 1.8447 6.4687 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 29.5154 0.8333 35.420 < 2e-16 *** powderspowder2 20.1413 1.1785 17.091 < 2e-16 *** powderspowder3 24.0130 1.1785 20.376 < 2e-16 *** powderspowder4 15.6148 1.1785 13.250 < 2e-16 *** powderspowder5 6.0756 1.1785 5.155 3.55e-06 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 2.887 on 55 degrees of freedom Multiple R-squared: 0.912,Adjusted R-squared: 0.9056 F-statistic: 142.5 on 4 and 55 DF, p-value: < 2.2e-16 (a) (1 point) What is the mean height of cupcakes baked with baking powder 4? Solution: ¯ x 4 = 29 . 5154 + 15 . 6148 = 45 . 1302 mm (b) (1 point) What is the mean height of cupcakes baked with baking powder 1? Solution: ¯ x 1 = 29 . 5154 (c) (1 point) What is the SSE of the regression? Solution: SSE = Residual Std. Error 2 × df Err = 2 . 887 2 × 55 = 458 . 4123 (d) (3 points) Test whether baking powder number 5 increases average cupcake height by 5 mm or not using α = 0 . 05 . Solution: H 0 : β 4 = 5 vs. H a : β 4 ̸ = 5 t = ˆ β 4 − β 4 , 0 SE ( ˆ β 4 ) = 6 . 0756 − 5 1 . 1785 = 0 . 913 Then compare to t n − k − 1 = t 55 and the p value is between 0.1 and and .25 so we fail to reject the null hypothesis. (e) (1 point) Construct the 95% confidence interval for the slope of baking powder 3 Page 10 of 13

STAT 2080 Practice Final Exam August, 2023 Solution: Since there is no t 55 on table, round down and use t 0 . 05 / 2 , 40 =2.021 24 . 0130 ± 2 . 021 × 1 . 1785 = (21 . 63 , 26 . 39) Page 11 of 13

STAT 2080 Practice Final Exam August, 2023 10. A study was conducted with two factors, A and B with 2 and 3 levels respectively. 15 replications of each combination of factors was used. The sum of squares for a two-way ANOVA with interaction is given below: A 12.99 B 147.21 Interaction 2.02 Errors 358.75 (a) (2 points) Is the interaction significant using α = 0 . 05 ? Solution: H 0 : γ ij = 0 vs. H a : γ ij ̸ = 0 The test for the interaction is F obs = MS Interaction MSE The degrees of freedom of the interaction is df inter = ( A − 1)( B − 1) = (1)(2) = 2 So MS Interaction = 2 . 02 / 2 = 1 . 01 The degrees of freedom for the errors is df Err = AB ( K − 1) = (2)(3)(14) = 84 So MSE = 358 . 75 84 = 4 . 27 and F obs = 1 . 01 / 4 . 27 = 0 . 237 Which comes from a F 2,84 distribution and using the more conservative 60 de- nominator degrees of freedom from the table we see that the p value is greater than 0.25, so we fail to reject H 0 and the interaction is not significant. (b) (4 points) Construct the two-way ANOVA table without the interaction term. Are any of the factor effects siginficant at α = 0 . 05 ? Solution: Source df SS MS F A 1 12.99 12.99 12.99/4.2=3.09 B 2 147.21 73.61 73.61/4.2=17.52 Error 84+2=86 358.75+2.02=360.8 360.8/86=4.20 From F table Factor A has a p value > 0.05 and so is not significant (compare to F 1,86 but use F 1,60 from table), Factor B has a p value < 0.01 from table (compare to F 2,86 ) which is significant. Page 12 of 13

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version

Access to all documents
Unlimited textbook solutions
24/7 expert homework help

STAT 2080 Practice Final Exam August, 2023 11. A running coach has developed two new training plans. To assess the effectiveness of her new plans she recorded the time it took for her students to run 10 km after a few months of training. One group of 15 students were given the old training plan and used as a control. The control students averaged 43 minutes with a standard deviation of 5 minutes. 16 students used the first new plan and they averaged 10 km in 40 minutes with a standard deviation of 6 minutes. Her remaining 12 students used the other new plan and averaged 10 km in 39 minutes with a standard deviation of 7 minutes. (a) (4 points) Construct the 90% simultanous confidence intervals for the average 10 km running time for students using the new plan vs. the old plan. Solution: 90% CI, so α = 0 . 1 . Find the pooled variance s 2 p = ( n 1 − 1) s 2 1 + ( n 2 − 1) s 2 2 + ( n 3 − 1) s 2 3 n 1 + n 2 + n 3 − k = (14)5 2 + (15)6 2 + (11)7 2 15 + 16 + 12 − 3 = 1429 40 = 35 . 725 We need to correct α to account for the fact we are doing two comparisons: α ∗ = 0 . 10 2 = 0 . 05 Then t α ∗ / 2 ,n − k = t 0 . 05 / 2 , 40 = 2 . 021 from the t table. Then the confidence intervals are: (43 − 40) ± 2 . 021 × √ 35 . 725 p 1 / 15 + 1 / 16 = ( − 1 . 341 , 7 . 341) and (43 − 39) ± 2 . 021 × √ 35 . 725 p 1 / 15 + 1 / 12 = ( − 0 . 678 , 8 . 678) (b) (1 point) At α = 0 . 1 do either of the training plans result in a statistically siginfi- cant average running time different from the control plan? Solution: No, since both 90% CI intervals contain zero. Page 13 of 13

practicefinalSolutions

Related Documents