Assignment5

pdf

School

Dalhousie University *

*We aren’t endorsed by this school

Course

1315

Subject

Statistics

Date

Jan 9, 2024

Type

pdf

Pages

12

Uploaded by AgentGoat4066

Report
STAT 3340 Assignment 5, Fall 2023 - due Thursday, Nov 23, at 11:59 PM Hetvi parsana Banner: B00877530 ====================================================================== =========================================== 1. The data set ``fish’’ has data on fish lengths, age and water temperature. The following reads the data, centres the age variable by subtracting its mean, and calculates the square of the centred age variable fish = read.csv ( "http://chase.mathstat.dal.ca/~bsmith/stat3340/Data/fish.csv" , h eader= T) age = fish $ age age = age - mean (age) age2 = age ^ 2 length = fish $ length temp = fish $ temp The following fits the linear model ?????ℎ = ? 0 + ? 1 ???? + 𝜖 and displays the summary output. lm1 = lm (length ~ temp) summary (lm1) ## ## Call: ## lm(formula = length ~ temp) ## ## Residuals: ## Min 1Q Median 3Q Max ## -2825.3 -839.9 302.0 1121.1 1602.3 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 6455.11 2661.53 2.425 0.0199 * ## temp -120.39 95.25 -1.264 0.2136
## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 1351 on 40 degrees of freedom ## Multiple R-squared: 0.03841, Adjusted R-squared: 0.01437 ## F-statistic: 1.598 on 1 and 40 DF, p-value: 0.2136 anova (lm1) ## Analysis of Variance Table ## ## Response: length ## Df Sum Sq Mean Sq F value Pr(>F) ## temp 1 2915414 2915414 1.5976 0.2136 ## Residuals 40 72993418 1824835 Following are a plot of residuals vs fitted values, and a normal probability plot of the residuals. fit1 = fitted (lm1) e1 = residuals (lm1) plot (fit1,e1, xlab= "fitted values" , ylab= "residuals" )
qqnorm (e1) qqline (e1) 1a) Comment briefly on the plots. Do one or more of the assumptions of the linear model appear to be violated? Which one(s)? ###The Residuals vs. Fitted Values plot reveals no distinct patterns, with the data points being widely dispersed. This dispersion is a positive sign, suggesting that the data may conform to the assumptions of linearity and consistent variance . However, it’s worth noting that the data points tend to diverge more in the central range of fitted values. Observing the Normal Quantile- Quantile Plot, it’s evident that a majority of the points remain close to the reference line, typically indicating a normal distribution of residuals. However, there’s a noticeable divergence at the lower left end of the plot. This divergence raises some questions about the normality of our data. Although this deviation from normality is not overly significant, it is not negligible either. 1b) Following is an added variable plot which helps to decide whether age should be added to the model, and to determine the functional form of age to use - eg. linear, quadratic, cubic The points on the plot are coloured according to the value of temp. lm2 = lm (age ~ temp) summary (lm2) ## ## Call:
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
## lm(formula = age ~ temp) ## ## Residuals: ## Min 1Q Median 3Q Max ## -71.799 -35.965 -0.836 35.998 75.053 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 54.685 85.936 0.636 0.528 ## temp -1.963 3.075 -0.638 0.527 ## ## Residual standard error: 43.62 on 40 degrees of freedom ## Multiple R-squared: 0.01008, Adjusted R-squared: -0.01466 ## F-statistic: 0.4074 on 1 and 40 DF, p-value: 0.5269 anova (lm2) ## Analysis of Variance Table ## ## Response: age ## Df Sum Sq Mean Sq F value Pr(>F) ## temp 1 775 775.13 0.4074 0.5269 ## Residuals 40 76097 1902.43 e2 = residuals (lm2) plot (e2,e1, col= temp)
Which functional form seems more appropriate, a linear or a quadratic term? The shape of the curve in the added variable plot implies that incorporating a quadratic term for age may be more fitting than a linear term. The visible curvature indicates that the association between age and fish length is not linear. Therefore, adding a term for age squared (age^2) in our model might more effectively represent the observed trend. 2. In class we talked about how we can consider regression of 𝑦 on 𝑋 1 and 𝑋 2 to be the result of three regressions. In this question we apply this approach where 𝑦 is length, 2a) ??1 contains the result of regressing length on temp, with the residuals stored in e1. 2b) ??2 contains the result of regressing age on temp, with the residuals stored in e2. 2c) Regress the residuals ?1 on the residuals ?2 . Do not include an intercept. Use the formula ??(?1 ∼ ?2 − 1) . Print the ????𝑎?𝑦 and 𝑎???𝑎 outputs. lm3 = lm (e1 ~ e2 - 1 ) summary (lm3) ## ## Call: ## lm(formula = e1 ~ e2 - 1) ## ## Residuals: ## Min 1Q Median 3Q Max ## -998.44 -413.05 37.35 282.92 1211.70 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## e2 28.551 1.875 15.23 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 517.1 on 41 degrees of freedom ## Multiple R-squared: 0.8498, Adjusted R-squared: 0.8461 ## F-statistic: 232 on 1 and 41 DF, p-value: < 2.2e-16 anova (lm3) ## Analysis of Variance Table ## ## Response: e1 ## Df Sum Sq Mean Sq F value Pr(>F) ## e2 1 62029733 62029733 231.97 < 2.2e-16 *** ## Residuals 41 10963685 267407 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
2d) Fit the model including 𝑎?? and ??????𝑎???? , and show the ????𝑎?𝑦 and 𝑎???𝑎 outputs. lmfull = lm (length ~ age + temp, data= fish) summary (lmfull) ## ## Call: ## lm(formula = length ~ age + temp, data = fish) ## ## Residuals: ## Min 1Q Median 3Q Max ## -998.44 -413.05 37.35 282.92 1211.70 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 2604.323 1076.324 2.420 0.0203 * ## age 28.551 1.922 14.854 <2e-16 *** ## temp -64.345 37.575 -1.712 0.0948 . ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 530.2 on 39 degrees of freedom ## Multiple R-squared: 0.8556, Adjusted R-squared: 0.8482 ## F-statistic: 115.5 on 2 and 39 DF, p-value: < 2.2e-16 anova (lmfull) ## Analysis of Variance Table ## ## Response: length ## Df Sum Sq Mean Sq F value Pr(>F) ## age 1 64120749 64120749 228.0902 < 2e-16 *** ## temp 1 824397 824397 2.9325 0.09475 . ## Residuals 39 10963685 281120 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 2e) Show that the coefficient of 𝑎?? in ?????? is is the same as that in the regression of ?1 on ?2 . Ans: The coefficient of 𝑎?? equals “28.551” in both cases. 2f) Use {Step 3} in the notes to show that the intercept and the coefficient of ???? in the ?????? fit are the same as those reconstructed from the three stage regression process. (This is what we did in class with the tree data. That is, substitute for ? 1 and ? 2 in the equation ? 1 = ?? 2 , where ? is the coefficient from the 3rd regression. Isolate length on the left hand side, and calculate the regression coefficients on the right hand side.) 2g) Show that the residual sum of squares from the third regression equals that of the ?? fit to the full model. Ans: The error SS equals 10963685 in both cases.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
2h) Show that ???(? 2 |? 1 ) , the extra regression sum of squares explained by 𝑎?? is the same in the third regression as in the 𝑎???𝑎 output for the full model. Ans: the regression sum of squares is ??? in both cases. 3. It is apparent from the added variable plot in 1b that a nonlinear term in age should be added. 3a) The following fits the model 𝑦 = ? 0 + ? 1 ???? + ? 2 𝑎?? + ? 3 𝑎?? 2 + ? , evaluates the fitted values and the residuals, plots residuals (on y axis) vs fitted values (on x axis), and shows a normal QQ plot of the residuals. # enter your work here lmbig = lm (length ~ temp + age + age2) summary (lmbig) ## ## Call: ## lm(formula = length ~ temp + age + age2) ## ## Residuals: ## Min 1Q Median 3Q Max ## -933.85 -226.43 31.48 217.42 851.29 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 5754.3065 734.8577 7.831 1.90e-09 *** ## temp -80.3907 26.0001 -3.092 0.00372 ** ## age 29.1300 1.3271 21.950 < 2e-16 *** ## age2 -0.2259 0.0340 -6.645 7.45e-08 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 365.3 on 38 degrees of freedom ## Multiple R-squared: 0.9332, Adjusted R-squared: 0.9279 ## F-statistic: 177 on 3 and 38 DF, p-value: < 2.2e-16 anova (lmbig) ## Analysis of Variance Table ## ## Response: length ## Df Sum Sq Mean Sq F value Pr(>F) ## temp 1 2915414 2915414 21.848 3.654e-05 *** ## age 1 62029733 62029733 464.844 < 2.2e-16 *** ## age2 1 5892884 5892884 44.161 7.454e-08 *** ## Residuals 38 5070801 133442 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 fbig = fitted (lmbig)
rbig = residuals (lmbig) qqnorm (rbig); qqline (rbig) plot (fbig,rbig)
Normal Q-Q Plot The residuals mostly follow the reference line, which is good, but there’s a slight deviation at the upper end of the plot. This deviation indicates that the residuals might have a slight skewness or that there are outliers affecting the normal distribution assumption. Residuals vs. Fitted Values Plot The plot displays a fairly random scatter of residuals across the range of fitted values, which generally suggests that the assumptions may not be violated. However, there’s a slight pattern of increased variability in residuals for higher fitted values, which could hint at potential heteroscedasticity or non-linearity. 3b) now do the same for the model 𝑦 = ? 0 + ? 1 ???? + ? 2 𝑎?? + ? 3 𝑎?? 2 + ? 4 ???? × 𝑎?? + ? 5 ???? × 𝑎?? 2 + ? which includes the interaction of age and temperature, and the interaction of 𝑎?? 2 and temperature. That is using the R code “lm(length~ temp+age+age2+temp:age + temp:age2)”. lmbig_mix = lm (length ~ temp + age + age2 + temp : age + temp : age2, data= fish) summary (lmbig_mix)
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
## ## Call: ## lm(formula = length ~ temp + age + age2 + temp:age + temp:age2, ## data = fish) ## ## Residuals: ## Min 1Q Median 3Q Max ## -763.10 -194.30 -3.18 181.55 858.38 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 1621.77199 1723.65000 0.941 0.35303 ## temp -15.50754 62.21333 -0.249 0.80457 ## age 60.77248 17.32830 3.507 0.00123 ** ## age2 -0.59354 0.44709 -1.328 0.19267 ## temp:age -1.13763 0.62542 -1.819 0.07724 . ## temp:age2 0.01294 0.01616 0.801 0.42836 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 354.8 on 36 degrees of freedom ## Multiple R-squared: 0.9403, Adjusted R-squared: 0.932 ## F-statistic: 113.4 on 5 and 36 DF, p-value: < 2.2e-16 anova (lmbig_mix) ## Analysis of Variance Table ## ## Response: length ## Df Sum Sq Mean Sq F value Pr(>F) ## temp 1 2915414 2915414 23.1599 2.661e-05 *** ## age 1 62029733 62029733 492.7615 < 2.2e-16 *** ## age2 1 5892884 5892884 46.8128 5.296e-08 *** ## temp:age 1 458279 458279 3.6405 0.06439 . ## temp:age2 1 80775 80775 0.6417 0.42836 ## Residuals 36 4531747 125882 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 fbig_interact = fitted (lmbig_mix) rbig_interact = residuals (lmbig_mix) qqnorm (rbig_interact) qqline (rbig_interact)
plot (fbig_interact, rbig_interact, xlab= "Fitted Values" , ylab= "Residuals" , main= "Residuals vs Fitted Values" )
Normal Quantile-Quantile Plot Analysis The Normal Quantile-Quantile Plot of our analysis presents a generally positive picture, as the majority of the points align closely with the reference line. However, there is a noticeable deviation in the upper section of the plot. This could suggest the presence of some outliers or a slight deviation from a normal distribution of residuals, though the deviation is not overly pronounced. Residuals vs. Fitted Values Plot In the plot comparing Residuals to Fitted Values, the data points appear to be randomly dispersed, which is the desired outcome in this context. However, it’s observable that the variance of the residuals increases with larger fitted values. This trend may indicate the presence of heteroscedasticity in the model.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help