Wee 3 _peer_review_M

.pdf

School

Poolesville High *

*We aren’t endorsed by this school

Course

DTSA5003

Subject

Statistics

Date

Jun 3, 2024

Type

pdf

Pages

12

Uploaded by MateComputerAntelope99

C1M3_peer_reviewed May 29, 2024 1 Module 3: Peer Reviewed Assignment 1.0.1 Outline: The objectives for this assignment: 1. Learn how to read and interpret p-values for coefficients in R. 2. Apply Partial F-tests to compare different models. 3. Compute confidence intervals for model coefficients. 4. Understand model significance using the Overall F-test. 5. Observe the variability of coefficients using the simulated data. General tips: 1. Read the questions carefully to understand what is being asked. 2. This work will be reviewed by another human, so make sure that you are clear and concise in what your explanations and answers. [16]: # Load Required Packages library(ggplot2) 1.1 Problem 1: Individual t-tests The dataset below measures the chewiness (mJ) of different berries along with their sugar equiv- alent and salt (NaCl) concentration. Let’s use these data to create a model to finally understand chewiness. Here are the variables: 1. nacl : salt concentration (NaCl) 2. sugar : sugar equivalent 3. chewiness : chewiness (mJ) Dataset Source: I. Zouid, R. Siret, F. Jourjion, E. Mehinagic, L. Rolle (2013). “Impact of Grapes Heterogeneity According to Sugar Level on Both Physical and Mechanical Berries Properties and their Anthocyanins Extractability at Harvest,” Journal of Texture Studies, Vol. 44, pp. 95-103. 1. (a) Simple linear regression (SLR) parameters In the below code, we load in the data and fit a SLR model to it, using chewiness as the response and sugar as the predictor. The summary of the model is printed. Let α = 0 . 05 . 1
Look at the results and answer the following questions: * What is the hypothesis test related to the p-value 2.95e-09 ? Clearly state the null and alternative hypotheses and the decision made based on the p-value. * Does this mean the coefficient is statistically significant? * What does it mean for a coefficient to be statistically significant? [4]: # Load the data chew . data = read . csv( "berry_sugar_chewy.csv" ) chew . lmod = lm(chewiness ~ sugar, data = chew . data) summary(chew . lmod) Call: lm(formula = chewiness ~ sugar, data = chew.data) Residuals: Min 1Q Median 3Q Max -2.4557 -0.5604 0.1045 0.5249 1.9559 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 7.662878 0.756610 10.128 < 2e-16 *** sugar -0.022797 0.003453 -6.603 2.95e-09 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.9178 on 88 degrees of freedom Multiple R-squared: 0.3313,Adjusted R-squared: 0.3237 F-statistic: 43.59 on 1 and 88 DF, p-value: 2.951e-09 1. The p value tests the null hypothesis that sugar has no effect on chewiness. The null hypoth- esis: Sugar has no effect on chewiness. Alternate hypothesis: Chewiness is affected by sugar and there is a relationship between chewiness and sugar. 2. The very low p value («<0.05)indicates that the correlation between chewiness and sugar is statistically significant 3. Statistical significance means that the null hypothesis can be rejects, i.e., the value of chewi- ness will increase or decrease based on the value of sugar 1. (b) MLR parameters Now let’s see if the second predictor/feature nacl is worth adding to the model. In the code below, we create a second linear model fitting chewiness as the response with sugar and nacl as predictors. Look at the results and answer the following questions: * Which, if any, of the slope parameters are statistically significant? * Did the statistical significance of the parameter for sugar stay the same, when compared to 1 (a)? If the statistical signficance changed, explain why it changed. If it didn’t change, explain why it didn’t change. 2
[5]: chew . lmod .2 = lm(chewiness ~ . , data = chew . data) summary(chew . lmod .2 ) Call: lm(formula = chewiness ~ ., data = chew.data) Residuals: Min 1Q Median 3Q Max -2.3820 -0.6333 0.1234 0.5231 1.9731 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -7.1107 13.6459 -0.521 0.604 nacl 0.6555 0.6045 1.084 0.281 sugar -0.4223 0.3685 -1.146 0.255 Residual standard error: 0.9169 on 87 degrees of freedom Multiple R-squared: 0.3402,Adjusted R-squared: 0.325 F-statistic: 22.43 on 2 and 87 DF, p-value: 1.395e-08 The statistical significance of the paremeter “sugar” changes (it decreased) with the addition of nacl as a parameter. The decrease in statistical significance of sugar paramter could be due to multiple reasons: (1) nacl and sugar parameters might be correlated, (2) complex model leading to overfitting - the models aims to explain more variance of the reponse variable, but the added paramter (nacl) does not add substantial new information, and this can affect the pereceived significance of teh original parameter. 1. (c) Model Selection Determine which of the two models we should use. Explain how you arrived at your conclusion and write out the actual equation for your selected model. [6]: anova(chew . lmod,chew . lmod .2 ) A anova: 2 × 6 Res.Df RSS Df Sum of Sq F Pr(>F) <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 88 74.12640 NA NA NA NA 2 87 73.13801 1 0.9883882 1.175719 0.2812249 The chew.lmod should be used as the Pr(>F) value for the chew.lmod2 is significantly higher than 0.05. chewness = 7.682 - 0.023*sugar 1. (d) Parameter Confidence Intervals Compute 95 % confidence intervals for each parameter in your selected model. Then, in words, state what these confidence intervals mean. [6]: # Your Code Here confint(chew . lmod) 3
A matrix: 2 × 2 of type dbl 2.5 % 97.5 % (Intercept) 6.15927388 9.16648152 sugar -0.02965862 -0.01593536 (1) Intercept - we can be 95% confident that the intercept parameter falls between 6.159 and 9.166, (2) sugar (slope) - we can be 95% confident that the sugar parameter falls between -0.0296 and -0.0159. 2 Problem 2: Variability of Slope in SLR In this exercise we’ll look at the variability of slopes of simple linear regression models fitted to realizations of simulated data. Write a function, called sim_data() , that returns a simulated sample of size n = 20 from the model Y = 1 + 2 . 5 X + ϵ where ϵ iid N (0 , 1) . We will then use this generative funciton to understand how fitted slopes can vary, even for the same underlying population. [7]: sim_data <- function(n =20 , var =1 , beta .0=1 , beta .1=2.5 ){ # BEGIN SOLUTION HERE x = seq( -1 , 1 , length . out = n); beta0 = 1 ; beta1 = 2.5 ; e = rnorm(n, 0 , , sqrt(var)) y = beta0 + beta1 * x + e # END SOLUTION HERE data = data . frame(x = x, y = y) return (data) } 2. (a) Fit a slope Execute the following code to generate 20 data points, fit a simple linear regression model and plot the results. Just based on this plot, how well does our linear model fit the data? [11]: data = sim_data() lmod = lm(y ~ x, data = data) ggplot(aes(x = x, y = y), data = data) + geom_point() + geom_smooth(method = "lm" , formula = y ~ x, se = FALSE, color = "#CFB87C" ) 4
The linear model reasonably fits the data. 2. (b) Do the slopes change? Now we want to see how the slope of our line varies with different random samples of data. Call our data genaration funciton 50 times to gather 50 independent samples. Then we can fit a SLR model to each of those samples and plot the resulting slope. The function below performs this for us. Experiment with different variances and report on what effect that has to the spread of the slopes. [15]: gen_slopes <- function(num . slopes =50 , var =1 , num . samples =20 ){ g = ggplot() # Repeat the sample for the number of slopes for (ii in 1 :num . slopes){ 5
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help