Problem Set 5

docx

School

University of Texas, Dallas *

*We aren’t endorsed by this school

Course

6359

Subject

Statistics

Date

Apr 3, 2024

Type

docx

Pages

Uploaded by CaptainChimpanzeeMaster771

Problem 1 Problem 1a > # Load the necessary library for linear regression > library(stats) > > mpg_data <- read.csv('mpg.csv') > > # Check the structure of the data > str(mpg_data) 'data.frame': 1000 obs. of 4 variables: $ MPG : int 24 32 60 56 28 58 51 51 46 36 ... $ Freeway: int 0 0 1 1 0 1 0 1 0 0 ... $ AC : int 0 1 0 1 0 0 1 0 1 0 ... $ Speed : int 25 31 52 68 27 66 44 78 40 31 ... > > # Fit a linear regression model > model <- lm(MPG ~ Freeway + AC + Speed, data = mpg_data) > > # Print the summary of the regression model > summary(model) Call: lm(formula = MPG ~ Freeway + AC + Speed, data = mpg_data) Residuals: Min 1Q Median 3Q Max -27.573 -7.469 2.011 9.042 16.104 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 32.56637 1.43217 22.739 < 2e-16 *** Freeway 5.37573 1.52431 3.527 0.00044 *** AC -3.55279 0.68807 -5.163 2.93e-07 *** Speed 0.21081 0.03661 5.759 1.13e-08 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 10.84 on 996 degrees of freedom Multiple R-squared: 0.3056, Adjusted R-squared: 0.3035 F-statistic: 146.1 on 3 and 996 DF, p-value: < 2.2e-16 Interpreting coefficients, Predicted MPG = 32.57 + 5.38 x Freeway – 3.55 x AC + 0.21 x Speed As an example, if Freeway (F) is 1, AC (A) is 0, and Speed (S) is 50: The predicted MPG is approximately 48.456 . Problem 1b # Scatter plot of MPG against Speed > plot(mpg_data$Speed, mpg_data$MPG, main = "Scatter Plot of MPG against Speed", + xlab = "Speed (mph)", ylab = "MPG", pch = 16, col = "blue")

Since the relationship appears nonlinear, you might consider adding a quadratic term for Speed to capture potential curvature. # Fit a model with a quadratic term for Speed > model_with_quadratic <- lm(MPG ~ Freeway + AC + Speed + I(Speed^2), data = mpg_data) > # Print the summary of the new model > summary(model_with_quadratic) Call: lm(formula = MPG ~ Freeway + AC + Speed + I(Speed^2), data = mpg_data) Residuals: Min 1Q Median 3Q Max -1.9206 -0.4032 0.0342 0.4929 2.1024 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -4.992e+01 2.340e-01 -213.33 <2e-16 *** Freeway -1.513e+00 1.206e-01 -12.54 <2e-16 *** AC -3.965e+00 5.391e-02 -73.55 <2e-16 *** Speed 3.743e+00 9.251e-03 404.67 <2e-16 *** I(Speed^2) -3.119e-02 7.765e-05 -401.66 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.8488 on 995 degrees of freedom Multiple R-squared: 0.9957, Adjusted R-squared: 0.9957 F-statistic: 5.819e+04 on 4 and 995 DF, p-value: < 2.2e-16

Adding the quadratic term for Speed seems to significantly improve the model fit, capturing the potential curvature in the relationship between Speed and MPG. Problem 1c # Fit a linear regression model with the new variable > model_with_quadratic_and_original <- lm(MPG ~ Freeway + AC + Speed + I(Speed^2), data = mpg_data) > # Print the summary of the new model > summary(model_with_quadratic_and_original) Call: lm(formula = MPG ~ Freeway + AC + Speed + I(Speed^2), data = mpg_data) Residuals: Min 1Q Median 3Q Max -1.9206 -0.4032 0.0342 0.4929 2.1024 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -4.992e+01 2.340e-01 -213.33 <2e-16 *** Freeway -1.513e+00 1.206e-01 -12.54 <2e-16 *** AC -3.965e+00 5.391e-02 -73.55 <2e-16 *** Speed 3.743e+00 9.251e-03 404.67 <2e-16 *** I(Speed^2) -3.119e-02 7.765e-05 -401.66 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.8488 on 995 degrees of freedom Multiple R-squared: 0.9957, Adjusted R-squared: 0.9957 F-statistic: 5.819e+04 on 4 and 995 DF, p-value: < 2.2e-16 Model Assessment:  R-squared: The adjusted R-squared is 0.9957, indicating that the model explains a substantial amount of variability in the data.  F-statistic: The highly significant F-statistic suggests that the overall model is statistically significant. Comparison with Previous Model:  The adjusted R-squared and F-statistic are the same in both models, indicating that the improvement in explanatory power is marginal.  The new model with the quadratic term may be statistically significant due to the highly significant quadratic term for Speed, but the practical significance should be carefully considered.  While the new model may capture the potential curvature in the relationship between Speed and MPG, the improvement in explanatory power is minimal.  Depending on the goals of the analysis and the practical implications of the model, one might choose the simpler model without the quadratic term for simplicity and interpretability. Problem 1d # Fit a linear regression model with interaction terms > model_with_interactions <- lm(MPG ~ Freeway * AC * Speed + I(Speed^2), data = mpg_data) > # Print the summary of the new model

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version

Access to all documents
Unlimited textbook solutions
24/7 expert homework help

> summary(model_with_interactions) Call: lm(formula = MPG ~ Freeway * AC * Speed + I(Speed^2), data = mpg_data) Residuals: Min 1Q Median 3Q Max -2.07487 -0.40798 -0.00478 0.39050 2.22863 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -4.915e+01 4.359e-01 -112.772 <2e-16 *** Freeway 3.102e-01 1.316e+00 0.236 0.8137 AC -4.167e+00 3.232e-01 -12.891 <2e-16 *** Speed 3.700e+00 2.300e-02 160.884 <2e-16 *** I(Speed^2) -3.056e-02 3.144e-04 -97.207 <2e-16 *** Freeway:AC 1.272e+00 6.324e-01 2.012 0.0445 * Freeway:Speed -3.862e-02 2.475e-02 -1.560 0.1190 AC:Speed 3.461e-03 9.044e-03 0.383 0.7020 Freeway:AC:Speed -1.740e-02 1.173e-02 -1.483 0.1383 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.8462 on 991 degrees of freedom Multiple R-squared: 0.9958, Adjusted R-squared: 0.9958 F-statistic: 2.928e+04 on 8 and 991 DF, p-value: < 2.2e-16 Model Assessment:  R-squared: The adjusted R-squared is 0.9958, indicating a high level of explanatory power. However, it's like the adjusted R-squared in the previous models.  F-statistic: The F-statistic is highly significant, indicating that the overall model is statistically significant. Comparison with Previous Models:  The addition of interaction terms does not seem to substantially improve the model's explanatory power compared to the model with only the quadratic term for Speed.  The inclusion of interaction terms introduces some complexity to the model, but the marginal improvement in adjusted R-squared suggests that the model's ability to explain variability in MPG may not have significantly increased. Problem 1e Considering a trade-off between model complexity and interpretability, the model with the quadratic term for Speed (Model 2) might be a reasonable choice. It introduces a level of nonlinearity without adding too much complexity. The interpretation is still straightforward, and it captures potential curvature in the relationship between Speed and MPG. However, if the interactions in Model 3 are of specific interest and have practical implications, and the additional complexity is justified, then Model 3 could be considered. It's essential to assess whether the added complexity aligns with the goals of the analysis and if the improvement in explanatory power is meaningful.

Problem 1f # Obtain 95% confidence intervals for the coefficients > conf_intervals <- confint(model_with_interactions) > # Print the confidence intervals > print(conf_intervals) 2.5 % 97.5 % (Intercept) -50.00890198 -48.298251698 Freeway -2.27215901 2.892576831 AC -4.80115470 -3.532532765 Speed 3.65445971 3.744709807 I(Speed^2) -0.03118169 -0.029947653 Freeway:AC 0.03136636 2.513169946 Freeway:Speed -0.08718406 0.009949595 AC:Speed -0.01428628 0.021208682 Freeway:AC:Speed -0.04042393 0.005618530 Interpretation: 1. Intercept:  The 95% confidence interval for the intercept is approximately (−50.01,−48.30)  Interpretation: We are 95% confident that the true mean MPG when all predictors are zero falls within this interval. 2. Freeway:  The 95% confidence interval for the coefficient of Freeway is (−2.27,2.89).  Interpretation: We are 95% confident that the true effect of being on a freeway on MPG is between -2.27 and 2.89, holding other variables constant. 3. AC:  The 95% confidence interval for the coefficient of AC is (−4.80,−3.53).  Interpretation: We are 95% confident that the true effect of having the AC on (compared to off) on MPG is between -4.80 and -3.53, holding other variables constant. 4. Speed:  The 95% confidence interval for the coefficient of Speed is (3.65,3.74).  Interpretation: We are 95% confident that the true effect of a one-unit increase in Speed on MPG is between 3.65 and 3.74, holding other variables constant. 5. I(Speed^2):  The 95% confidence interval for the coefficient of the quadratic term for Speed is (−0.03118,−0.02995).  Interpretation: We are 95% confident that the true effect of the quadratic term for Speed on MPG is between -0.03118 and -0.02995. 6. Interaction Terms (Freeway:AC, Freeway:Speed, AC:Speed, Freeway:AC:Speed):  The confidence intervals for these interaction terms indicate the range of plausible values for their effects on MPG, considering the interactions with other variables. Problem 1g # Create a data frame with the specific values for prediction > new_data <- data.frame(Freeway = 1, AC = 0, Speed = 60) > > # Obtain the prediction and prediction interval > prediction <- predict(model_with_interactions, newdata = new_data, interval = "prediction", level = 0.90)

> # Print the prediction interval > print(prediction) fit lwr upr 1 60.78186 59.38079 62.18293 Interpreting the results: 1. Point Prediction (fit):  The point prediction, or fitted value, is approximately 60.78 MPG for a vehicle driving on a freeway without AC at 60 mph. 2. Prediction Interval (lwr to upr):  The 90% prediction interval is given by the lower and upper bounds.  Lower Bound (lwr): Approximately 59.38 MPG.  Upper Bound (upr): Approximately 62.18 MPG. We are 90% confident that the true MPG for a vehicle driving on a freeway without AC at 60 mph will fall within the range of approximately 59.38 to 62.18 MPG. This means that if we were to observe multiple vehicles with the same characteristics (freeway driving, no AC, 60 mph), we would expect about 90% of their observed MPG values to fall within this prediction interval. Problem 2 Problem 2a When assessing the performance of a linear regression model, both R-squared and Adjusted R- squared are important metrics, but they serve slightly different purposes. R-squared (coefficient of determination) is a measure that indicates the proportion of the variance in the dependent variable (response) that is explained by the independent variables (predictors) in the model. It ranges from 0 to 1, with higher values indicating a better fit. However, R-squared tends to increase when adding more predictors, even if those predictors do not significantly contribute to explaining the variability in the response. This is known as overfitting. Adjusted R-squared, on the other hand, addresses the issue of overfitting by penalizing the inclusion of irrelevant predictors in the model. It considers the number of predictors in the model and adjusts R-squared accordingly. The formula for Adjusted R-squared incorporates a penalty for each additional predictor, making it a more conservative measure of model fit. It is particularly useful when comparing models with different numbers of predictors. In summary, Adjusted R-squared provides a more balanced assessment of a model's performance by considering not only the amount of variance explained but also the number of predictors involved. It helps prevent the overestimation of model performance that can occur when relying solely on R- squared. Problem 2b Relying solely on visual inspection and drawing a line through data, known as eyeballing, has limitations in linear regression. It is subjective and lacks precision, as different interpretations may

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version

Access to all documents
Unlimited textbook solutions
24/7 expert homework help

lead to different lines. Eyeballing doesn't provide the optimal fit, quantitative measures, or assess the significance of relationships. Linear regression, in contrast, employs statistical methods to determine the best-fitting line, minimizing the sum of squared differences between observed and predicted values. It offers quantitative measures like coefficients, standard errors, and p-values, providing a more accurate and objective estimation of relationships. The approach is less sensitive to outliers, and it allows for model validation and generalization to new data, enhancing the reliability of results. In essence, while visual inspection can offer a preliminary understanding, linear regression ensures a more rigorous, objective, and statistically grounded analysis of relationships in the data.