Wee 3 _peer_review_M
.pdf
keyboard_arrow_up
School
Poolesville High *
*We aren’t endorsed by this school
Course
DTSA5003
Subject
Statistics
Date
Jun 3, 2024
Type
Pages
12
Uploaded by MateComputerAntelope99
C1M3_peer_reviewed
May 29, 2024
1
Module 3: Peer Reviewed Assignment
1.0.1
Outline:
The objectives for this assignment:
1. Learn how to read and interpret p-values for coefficients in R.
2. Apply Partial F-tests to compare different models.
3. Compute confidence intervals for model coefficients.
4. Understand model significance using the Overall F-test.
5. Observe the variability of coefficients using the simulated data.
General tips:
1. Read the questions carefully to understand what is being asked.
2. This work will be reviewed by another human, so make sure that you are clear and concise
in what your explanations and answers.
[16]:
# Load Required Packages
library(ggplot2)
1.1
Problem 1: Individual t-tests
The dataset below measures the chewiness (mJ) of different berries along with their sugar equiv-
alent and salt (NaCl) concentration. Let’s use these data to create a model to finally understand
chewiness.
Here are the variables: 1.
nacl
: salt concentration (NaCl) 2.
sugar
: sugar equivalent 3.
chewiness
:
chewiness (mJ)
Dataset Source: I. Zouid, R. Siret, F. Jourjion, E. Mehinagic, L. Rolle (2013). “Impact of Grapes
Heterogeneity According to Sugar Level on Both Physical and Mechanical Berries Properties and
their Anthocyanins Extractability at Harvest,” Journal of Texture Studies, Vol. 44, pp. 95-103.
1. (a) Simple linear regression (SLR) parameters
In the below code, we load in the data
and fit a SLR model to it, using
chewiness
as the response and
sugar
as the predictor.
The
summary of the model is printed. Let
α
= 0
.
05
.
1
Look at the results and answer the following questions: * What is the hypothesis test related to the
p-value
2.95e-09
? Clearly state the null and alternative hypotheses and the decision made based
on the p-value. * Does this mean the coefficient is statistically significant? * What does it mean
for a coefficient to be statistically significant?
[4]:
# Load the data
chew
.
data
=
read
.
csv(
"berry_sugar_chewy.csv"
)
chew
.
lmod
=
lm(chewiness
~
sugar, data
=
chew
.
data)
summary(chew
.
lmod)
Call:
lm(formula = chewiness ~ sugar, data = chew.data)
Residuals:
Min
1Q
Median
3Q
Max
-2.4557 -0.5604
0.1045
0.5249
1.9559
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)
7.662878
0.756610
10.128
< 2e-16 ***
sugar
-0.022797
0.003453
-6.603 2.95e-09 ***
---
Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.9178 on 88 degrees of freedom
Multiple R-squared:
0.3313,Adjusted R-squared:
0.3237
F-statistic: 43.59 on 1 and 88 DF,
p-value: 2.951e-09
1. The p value tests the null hypothesis that sugar has no effect on chewiness. The null hypoth-
esis: Sugar has no effect on chewiness. Alternate hypothesis: Chewiness is affected by sugar
and there is a relationship between chewiness and sugar.
2. The very low p value («<0.05)indicates that the correlation between chewiness and sugar is
statistically significant
3. Statistical significance means that the null hypothesis can be rejects, i.e., the value of chewi-
ness will increase or decrease based on the value of sugar
1. (b) MLR parameters
Now let’s see if the second predictor/feature
nacl
is worth adding to
the model. In the code below, we create a second linear model fitting
chewiness
as the response
with
sugar
and
nacl
as predictors.
Look at the results and answer the following questions: * Which, if any, of the slope parameters
are statistically significant? * Did the statistical significance of the parameter for
sugar
stay the
same, when compared to 1 (a)? If the statistical signficance changed, explain why it changed. If it
didn’t change, explain why it didn’t change.
2
[5]:
chew
.
lmod
.2 =
lm(chewiness
~ .
, data
=
chew
.
data)
summary(chew
.
lmod
.2
)
Call:
lm(formula = chewiness ~ ., data = chew.data)
Residuals:
Min
1Q
Median
3Q
Max
-2.3820 -0.6333
0.1234
0.5231
1.9731
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)
-7.1107
13.6459
-0.521
0.604
nacl
0.6555
0.6045
1.084
0.281
sugar
-0.4223
0.3685
-1.146
0.255
Residual standard error: 0.9169 on 87 degrees of freedom
Multiple R-squared:
0.3402,Adjusted R-squared:
0.325
F-statistic: 22.43 on 2 and 87 DF,
p-value: 1.395e-08
The statistical significance of the paremeter “sugar” changes (it decreased) with the addition of nacl
as a parameter. The decrease in statistical significance of sugar paramter could be due to multiple
reasons: (1) nacl and sugar parameters might be correlated, (2) complex model leading to overfitting
- the models aims to explain more variance of the reponse variable, but the added paramter (nacl)
does not add substantial new information, and this can affect the pereceived significance of teh
original parameter.
1. (c) Model Selection
Determine which of the two models we should use. Explain how you
arrived at your conclusion and write out the actual equation for your selected model.
[6]:
anova(chew
.
lmod,chew
.
lmod
.2
)
A anova: 2 × 6
Res.Df
RSS
Df
Sum of Sq
F
Pr(>F)
<dbl>
<dbl>
<dbl>
<dbl>
<dbl>
<dbl>
1
88
74.12640
NA
NA
NA
NA
2
87
73.13801
1
0.9883882
1.175719
0.2812249
The chew.lmod should be used as the Pr(>F) value for the chew.lmod2 is significantly higher than
0.05. chewness = 7.682 - 0.023*sugar
1. (d) Parameter Confidence Intervals
Compute
95
% confidence intervals for each parameter
in your selected model. Then, in words, state what these confidence intervals mean.
[6]:
# Your Code Here
confint(chew
.
lmod)
3
A matrix: 2 × 2 of type dbl
2.5 %
97.5 %
(Intercept)
6.15927388
9.16648152
sugar
-0.02965862
-0.01593536
(1) Intercept - we can be 95% confident that the intercept parameter falls between 6.159 and
9.166, (2) sugar (slope) - we can be 95% confident that the sugar parameter falls between
-0.0296 and -0.0159.
2
Problem 2: Variability of Slope in SLR
In this exercise we’ll look at the variability of slopes of simple linear regression models fitted to
realizations of simulated data.
Write a function, called
sim_data()
, that returns a simulated sample of size
n
= 20
from the model
Y
= 1 + 2
.
5
X
+
ϵ
where
ϵ
iid
∼
N
(0
,
1)
. We will then use this generative funciton to understand how
fitted slopes can vary, even for the same underlying population.
[7]:
sim_data
<-
function(n
=20
, var
=1
, beta
.0=1
, beta
.1=2.5
){
# BEGIN SOLUTION HERE
x
=
seq(
-1
,
1
, length
.
out
=
n); beta0
= 1
; beta1
= 2.5
; e
=
rnorm(n,
0
,
␣
,
→
sqrt(var))
y
=
beta0
+
beta1
*
x
+
e
# END SOLUTION HERE
data
=
data
.
frame(x
=
x, y
=
y)
return
(data)
}
2.
(a) Fit a slope
Execute the following code to generate 20 data points, fit a simple linear
regression model and plot the results.
Just based on this plot, how well does our linear model fit the data?
[11]:
data
=
sim_data()
lmod
=
lm(y
~
x, data
=
data)
ggplot(aes(x
=
x, y
=
y), data
=
data)
+
geom_point()
+
geom_smooth(method
=
"lm"
, formula
=
y
~
x, se
=
FALSE, color
=
"#CFB87C"
)
4
The linear model reasonably fits the data.
2. (b) Do the slopes change?
Now we want to see how the slope of our line varies with different
random samples of data.
Call our data genaration funciton
50
times to gather
50
independent
samples. Then we can fit a SLR model to each of those samples and plot the resulting slope. The
function below performs this for us.
Experiment with different variances and report on what effect that has to the spread of the slopes.
[15]:
gen_slopes
<-
function(num
.
slopes
=50
, var
=1
, num
.
samples
=20
){
g
=
ggplot()
# Repeat the sample for the number of slopes
for
(ii
in
1
:num
.
slopes){
5
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Related Questions
Fit these three regression models and then discuss the similarities and differences between them, particularly as relates to slope estimates (use CI’s) and R2. Also address why this is a “special case” and we wouldn’t necessarily expect to see these model characteristics for a typical dataset.
a) Additive model including both predictors (output attached)
b) Model including only Moisture (output attached)
c) Model including only Sweetness
BrandLiking = 68.62 + 4.38 Sweetness
Term 95% CI P-ValueConstant (50.16, 87.09) 0.000Sweetness (-1.46, 10.21) 0.130
S R-sq R-sq(adj)10.8915 15.57% 9.54%
arrow_forward
I ran these regressions in SAS but don't know how to interpret them.
1a. What is the causal effect of gun control on life expectancy?
b. What potential sources of bias might there be in the regression model?
arrow_forward
a, b, and c
arrow_forward
Part 2. Directions: Gather a realistic set of data about 2 continuous variables which might be related
with each other and test for their significant relationship using the 6 steps of Hypothesis Testing (with
presentation and interpretation of results). You can show the regression line function IF there is a
significant relationship. The required sample size is 30 or more.
arrow_forward
The following result perspective in RapidMiner shows a multiple linear regression model.
Based on the diagram, the model for our dependent variable Y is Predicted Y=
(Insulation *0.420)+(Temperature *0.071)+(Avg_Age*0.065)+(Home_Size *0.311)+7.589
Attribute
Insulation
Temperature
Avg Age
Home Size
(Intercept)
O True
O False
Coefficient
3.323
-0.869
1.968
3.173
134.511
Std. Error
0.420
0.071
0.065
0.311
7.589
Std. Coefficient
0.164
-0.262
0.527
0.131
?
Tolerance
0.431
0.405
0.491
0.914
?
t-Stat
7.906
-12.222
30.217
10.210
17.725
arrow_forward
The attached results are for a multiple regression study of smartphone addiction (SSA-SV) proneness in relation to 1) Gender 2)Age 3) Anxiety (GAD-7). I just want to clarify what the F-statistic means and the effects of AGE on the F-statistic; given that AGE is statistically significant compared to the other predictor variables.
arrow_forward
An engineer at a company wants to model the relationship between the yield (y) and three variables: pressure (x1), volume (x2) and temperature (x3). Data is collected and the output of the statistical analysis is displayed table below.
i. From appropriate statistical tests, determine the relationship between yield and the three variables, Use a 5% significance level where required.
arrow_forward
A market study found that the sales for a firm were related to advertising expenditure, as follows:
Advertising Expenditure (Kshs ‘000’)
Sales (Kshs ‘000’)
0
13
1
16
2
14
3
22
4
17
5
21
6
26
Required
Draw a scatter diagram with the line of best fit to show the relationship.
Determine the regression line equation for estimating the sales for a given level of advertising expenditure
What is the estimated sale in thousand, if no advertising expenditure is incurred?
arrow_forward
a) Determine sum of squares of error (SSE) and correlation coefficient (R?) for the model.
b) Estimate the parameters of reaction model with 95% confidence limits.
c) Evaluate the fit of the model equation you obtained to your data.
d) Estimate concentration of flavor compound after 17 days of storage by using the best model.
Time
Concentration
(d)
(mg/L)
561.00
569.67
3.
252.11
258.40
7.
107.95
7.
113.22
10
47.77
10
50.83
15
23.80
15
22.95
23
9.35
23
10.20
arrow_forward
Explain whether each scenario below is a regression, classification, or unsupervised learn-
ing problem. If it is a supervised learning scenario, indicate whether we are more interested
in inference or prediction. Finally, provide in each case the number of observations, n,
and the number of predictors, p.
(1) An online retailer must decide whether to display advertisement A or advertisement
B to each customer on the basis of collected customer demographics (age, income,
zip code, and gender). A set of 150 of its customers have already expressed a
preference for one advertisement or the other.
(2) A policy analyst is interested in discovering factors that are associated with the
crime rate across different U.S. cities. For each of 500 cities, the policy analyst
gathers the following data: the crime rate, unemployment rate, population, median
income, median home price, and state.
(3) The
the channel owner to see where the subscribers are located, their age and gender, the
times and days…
arrow_forward
ANOVA or Regression based on the project data (provided in the module 4) and research question in the project file.
Your answer need to include 1. Output, 2. Ho and Ha, 3. P value, 4, statistical decision and 5. Interpretation.
arrow_forward
Online clothes II For the online clothing retailer dis-cussed in the previous problem, the scatterplot of Total
Yearly Purchases by Income showsThe correlation between Total Yearly Purchases and Incomeis 0.722. Summary statistics for the two variables are:
a) What is the linear regression equation for predictingTotal Yearly Purchase from Income?
b) Do the assumptions and conditions for regression ap-pear to be met?
c) What is the predicted average Total Yearly Purchasefor someone with a yearly Income of $20,000? Forsomeone with an annual Income of $80,000?d) What percent of the variability in Total YearlyPurchases is accounted for by this model?e) Do you think the regression might be a useful one forthe company? Comment.
arrow_forward
The following regression is fitted using variables identified that could be related to student
requested
loan amount, LOAN-AMT(S) for returning students of a certain University.
LOAN - AMT = a + B ACCEPT + y PREV +
OUTS
Where ACCEPT = the percentage of applicants that was accepted by the university, PREV=
previous loan amount and OUTS-outstanding loan amount
The data was processed using MNITAB and the following is an extract of the output obtained:
Predictor
Constant
ACCEPT
PREV
OUTS
S = 2685
Coef
-26780
116.00
-4.21
70.85
Analysis of Variance
R-Sq 69.6%
Source
DE SS
Regression
3
Residual Error 49
Total
52
EXHIBIT 2
StDev
6115
37.17
14.12
15.77
808139371
353193051
1161332421
T
*
3.14
-0.30
4.49
MS
269379790
7208021
a) What is dependent and independent variables?
b) Fully write out the regression equation
c) What is the sample size used in this investigation?
d) Fill in the blanks identified by *' and ****.
e) Is B significant, at the 5% level of significance?
P
0.000
0.003
**
0.000…
arrow_forward
Part A: find R and the regression equation, then use Table I for the critical value. Let x =
Weight and y = Handling. Test at the .05 significance level.
1: Null and Alternative hypotheses
2: Critical score
3: Test score
4: Decision.
Part B: If there is a strong linear correlation, do the following.
1. State the regression equation using variables x and y.
2. State the regression equation using Weight and Handling instead of x and y.
arrow_forward
Modified Exercise
1. In an effort to determine the cost of air conditioning, a resident in College Station, TX,
recorded daily values of the variables
Tavg = mean temperature
Kwh = electricity consumption
for the period from September 19 through November 4 (Table 7.20).
(a) Make a scatterplot to show the relationship of power consumption and temperature.
Describe the relationship you see in the data.
(b) State the LS regression line for predicting electricity consumption using the mean
temperature.
(c) Give the point estimate for the parameters Bo, B1, and o2.
(d) Interpret Bo, B1, and o² in terms of this problem.
(e) What is the percentage of raw variability in electricity consumption that is explained by
mean temperature?
(f) Use R2 to calculate the value of r by hand.
(g) Give and interpret a 98% confidence interval for o.
(h) Give and interpret a 90% confidence interval for B1.
(i) Give and interpret a 90% interval for the average electrify consumption for days that are
79 degrees…
arrow_forward
In the context of the linear regression model, briefly explain why it makes sense to provide a
confidence interval (meaning a range of likely values) for ŷ (the predicted value of y), instead of just
providing a single value for g.
Edit View Insert Format Tools Table
BIUA ļ T? v|
12pt v
Paragraph v
arrow_forward
Use the given dataset*note: Gender takes on a value of 1 if the student is male, and 0 otherwise
Estimate a linear regression model relating overall grade weighted average (OGWA) of student to their gender, available internet speed (mbps) and previous term’s grade weighted average (lgwa)a. Interpret the slope coefficients (discuss their values and statistical significance)b. Are the coefficients jointly statistically significant? Explain your answer.c. How much of the variability of the overall grade weighted average is explained by the variability of the model?
arrow_forward
do fast
arrow_forward
ONA model is developed for forecasting of sale and the effects of three independent variables , advertising expenditure (X1), Price (X2), and time (X3) resulted in the following.
Regression Statistics
Standard Error
232.29
Table 1: ANOVA
df
SS
MS
F
Regression
3
53184931.86
?
?
Residual
?
1133108.30
?
Total
24
54318040.16
Table 2: regression
Coefficients
Standard Error
t Stat
Intercept
927.23
1229.86
?
Advertising (X1)
1.02
3.09
?
Price (X2)
15.61
5.62
?
Time (X3)
170.53
28.18
?
Fill in the blanks in table 1 and table 2 .
What is the total number of observations .
Write down the…
arrow_forward
The model developed from sample data that has the form of Yhat = bo +bjX is known as the multiple regression model
with two predictor variables. (True or False)
O True
O False
arrow_forward
SEE MORE QUESTIONS
Recommended textbooks for you
Algebra and Trigonometry (MindTap Course List)
Algebra
ISBN:9781305071742
Author:James Stewart, Lothar Redlin, Saleem Watson
Publisher:Cengage Learning
College Algebra
Algebra
ISBN:9781305115545
Author:James Stewart, Lothar Redlin, Saleem Watson
Publisher:Cengage Learning
Glencoe Algebra 1, Student Edition, 9780079039897...
Algebra
ISBN:9780079039897
Author:Carter
Publisher:McGraw Hill
Big Ideas Math A Bridge To Success Algebra 1: Stu...
Algebra
ISBN:9781680331141
Author:HOUGHTON MIFFLIN HARCOURT
Publisher:Houghton Mifflin Harcourt
Related Questions
- Fit these three regression models and then discuss the similarities and differences between them, particularly as relates to slope estimates (use CI’s) and R2. Also address why this is a “special case” and we wouldn’t necessarily expect to see these model characteristics for a typical dataset. a) Additive model including both predictors (output attached) b) Model including only Moisture (output attached) c) Model including only Sweetness BrandLiking = 68.62 + 4.38 Sweetness Term 95% CI P-ValueConstant (50.16, 87.09) 0.000Sweetness (-1.46, 10.21) 0.130 S R-sq R-sq(adj)10.8915 15.57% 9.54%arrow_forwardI ran these regressions in SAS but don't know how to interpret them. 1a. What is the causal effect of gun control on life expectancy? b. What potential sources of bias might there be in the regression model?arrow_forwarda, b, and carrow_forward
- Part 2. Directions: Gather a realistic set of data about 2 continuous variables which might be related with each other and test for their significant relationship using the 6 steps of Hypothesis Testing (with presentation and interpretation of results). You can show the regression line function IF there is a significant relationship. The required sample size is 30 or more.arrow_forwardThe following result perspective in RapidMiner shows a multiple linear regression model. Based on the diagram, the model for our dependent variable Y is Predicted Y= (Insulation *0.420)+(Temperature *0.071)+(Avg_Age*0.065)+(Home_Size *0.311)+7.589 Attribute Insulation Temperature Avg Age Home Size (Intercept) O True O False Coefficient 3.323 -0.869 1.968 3.173 134.511 Std. Error 0.420 0.071 0.065 0.311 7.589 Std. Coefficient 0.164 -0.262 0.527 0.131 ? Tolerance 0.431 0.405 0.491 0.914 ? t-Stat 7.906 -12.222 30.217 10.210 17.725arrow_forwardThe attached results are for a multiple regression study of smartphone addiction (SSA-SV) proneness in relation to 1) Gender 2)Age 3) Anxiety (GAD-7). I just want to clarify what the F-statistic means and the effects of AGE on the F-statistic; given that AGE is statistically significant compared to the other predictor variables.arrow_forward
- An engineer at a company wants to model the relationship between the yield (y) and three variables: pressure (x1), volume (x2) and temperature (x3). Data is collected and the output of the statistical analysis is displayed table below. i. From appropriate statistical tests, determine the relationship between yield and the three variables, Use a 5% significance level where required.arrow_forwardA market study found that the sales for a firm were related to advertising expenditure, as follows: Advertising Expenditure (Kshs ‘000’) Sales (Kshs ‘000’) 0 13 1 16 2 14 3 22 4 17 5 21 6 26 Required Draw a scatter diagram with the line of best fit to show the relationship. Determine the regression line equation for estimating the sales for a given level of advertising expenditure What is the estimated sale in thousand, if no advertising expenditure is incurred?arrow_forwarda) Determine sum of squares of error (SSE) and correlation coefficient (R?) for the model. b) Estimate the parameters of reaction model with 95% confidence limits. c) Evaluate the fit of the model equation you obtained to your data. d) Estimate concentration of flavor compound after 17 days of storage by using the best model. Time Concentration (d) (mg/L) 561.00 569.67 3. 252.11 258.40 7. 107.95 7. 113.22 10 47.77 10 50.83 15 23.80 15 22.95 23 9.35 23 10.20arrow_forward
- Explain whether each scenario below is a regression, classification, or unsupervised learn- ing problem. If it is a supervised learning scenario, indicate whether we are more interested in inference or prediction. Finally, provide in each case the number of observations, n, and the number of predictors, p. (1) An online retailer must decide whether to display advertisement A or advertisement B to each customer on the basis of collected customer demographics (age, income, zip code, and gender). A set of 150 of its customers have already expressed a preference for one advertisement or the other. (2) A policy analyst is interested in discovering factors that are associated with the crime rate across different U.S. cities. For each of 500 cities, the policy analyst gathers the following data: the crime rate, unemployment rate, population, median income, median home price, and state. (3) The the channel owner to see where the subscribers are located, their age and gender, the times and days…arrow_forwardANOVA or Regression based on the project data (provided in the module 4) and research question in the project file. Your answer need to include 1. Output, 2. Ho and Ha, 3. P value, 4, statistical decision and 5. Interpretation.arrow_forwardOnline clothes II For the online clothing retailer dis-cussed in the previous problem, the scatterplot of Total Yearly Purchases by Income showsThe correlation between Total Yearly Purchases and Incomeis 0.722. Summary statistics for the two variables are: a) What is the linear regression equation for predictingTotal Yearly Purchase from Income? b) Do the assumptions and conditions for regression ap-pear to be met? c) What is the predicted average Total Yearly Purchasefor someone with a yearly Income of $20,000? Forsomeone with an annual Income of $80,000?d) What percent of the variability in Total YearlyPurchases is accounted for by this model?e) Do you think the regression might be a useful one forthe company? Comment.arrow_forward
arrow_back_ios
SEE MORE QUESTIONS
arrow_forward_ios
Recommended textbooks for you
- Algebra and Trigonometry (MindTap Course List)AlgebraISBN:9781305071742Author:James Stewart, Lothar Redlin, Saleem WatsonPublisher:Cengage LearningCollege AlgebraAlgebraISBN:9781305115545Author:James Stewart, Lothar Redlin, Saleem WatsonPublisher:Cengage LearningGlencoe Algebra 1, Student Edition, 9780079039897...AlgebraISBN:9780079039897Author:CarterPublisher:McGraw Hill
- Big Ideas Math A Bridge To Success Algebra 1: Stu...AlgebraISBN:9781680331141Author:HOUGHTON MIFFLIN HARCOURTPublisher:Houghton Mifflin Harcourt
Algebra and Trigonometry (MindTap Course List)
Algebra
ISBN:9781305071742
Author:James Stewart, Lothar Redlin, Saleem Watson
Publisher:Cengage Learning
College Algebra
Algebra
ISBN:9781305115545
Author:James Stewart, Lothar Redlin, Saleem Watson
Publisher:Cengage Learning
Glencoe Algebra 1, Student Edition, 9780079039897...
Algebra
ISBN:9780079039897
Author:Carter
Publisher:McGraw Hill
Big Ideas Math A Bridge To Success Algebra 1: Stu...
Algebra
ISBN:9781680331141
Author:HOUGHTON MIFFLIN HARCOURT
Publisher:Houghton Mifflin Harcourt