Assignment 5 F23

pdf

School

University of Waterloo *

*We aren’t endorsed by this school

Course

231

Subject

Statistics

Date

Feb 20, 2024

Type

pdf

Pages

Uploaded by zhangjames617

1 STAT 231 Fall 2023 Assignment 5 Assignment 5 is due on Tuesday November 30 at 11:00am Eastern Time. Your assignment must be typed. You may create your document in Word, Google Docs, LaTeX or any other word processor. The requirement to type your assignment is to facilitate the grading so that the marked assignments can be returned to you in a timely fashion. It is also useful for you to gain some experience in creating a document containing mathematical expressions. Two documents have been posted in the Assignment 1 folder in LEARN on how to use the equation editor in Word. If you wish to use LaTeX then you may find Overleaf particularly useful for this. See https://www.overleaf.com/edu/uwaterloo Upload your assignment to Crowdmark as a pdf file. You can upload your assignment as one document or individually for each problem. If you upload one document then you must drag and drop the pages for each problem to the appropriate question as indicated in Crowdmark. This is extremely important since dealing with assignments which are left as one document requires extra time and effort by the markers. Be sure to upload your assignment well in advance of the due time since uploading an assignment of many pages to Crowdmark requires time. In addition to submitting your assignment component to Crowdmark, you must submit your assignment as a single pdf document to the Assignment 5 Dropbox in LEARN to facilitate the running of your assignment through plagiarism detection software. Your submissions to Crowdmark and the LEARN Dropbox must be identical. Please do not include these two pages of information or any instructions given for each problem in your assignment submission to Crowdmark and the LEARN Dropbox. Doing so means that your assignment is flagged by the Turnitin software used for checking plagiarism. Many problems on this assignment indicate that your answers must be given in sentences. This course emphasizes learning to communicate statistical concepts in sentences. In some of the problems on this assignment you are asked to use R. Only the answers/results you obtain using R must be included in your Crowdmark pdf submission. Your R code must be uploaded as an R file to the Assignment 5 R Code Dropbox in LEARN . Effectively commenting your code is a important skill to develop. Markers will review your file and run it to verify the answers match those in your Crowdmark submission and that the code runs without error. Your code must correctly find the answers needed to get the marks associated with the problems. Good commenting will allow the marker to more easily assign you a full score when reviewing your file. Please ensure your code submitted in the R file is well commented. Penalties: (1) Answers which are not typed will not be marked and will receive a mark of zero. (2) An assignment which is uploaded late to Crowdmark will be assigned a penalty of 5% per hour. (3) An assignment which is left as a single document and not uploaded to the appropriate places in Crowdmark will be assigned a 10% overall penalty. (4) An assignment which is submitted late to the Assignment 2 Dropbox in LEARN will be assigned a 5% overall penalty.

2 (5) If the file of R code is submitted late to the Assignment 2 R Code Dropbox in LEARN, then the assignment will be assigned a 5% overall penalty. (6) Answers which are required to be written in sentences but are not in sentences will be assigned a 5% overall penalty. (7) Assignments which include R code in the Crowdmark submission will be assigned a 5% overall penalty. Checklist to complete for this assignment: Upload the pdf of your assignment to Crowdmark by the deadline. Upload the pdf file of your assignment to the Assignment 5 Dropbox in LEARN by the deadline. Upload the R file of your R code to the Assignment 5 R Code Dropbox in LEARN by the deadline. This assignment is based on the material in Chapters 1‐4, Sections 5.1‐5.3, and Sections 6.1‐6.4 of the STAT 231 Course Notes. Coursework 5 Assignment Component Learning Outcomes Here are the intended learning outcomes for this assignment component. Try to identify the learning outcomes which are achieved by each of the given problems. Enjoy 😊  Fit and analyse a simple linear regression model.  Interpret diagnostic plots and make recommendations on how model issues can be fixed.  Perform two‐sample testing for independent samples and make conclusions based on the observed data.

3 Problem 1: Regression models The purpose of this problem is to look at fitting regression models using the shiny app https://shiny.math.uwaterloo.ca/sas/stat231/regressionmodels/ The app explores regression models by generating two variates ‐ x and y ‐ according to an underlying relationship. In the first setting ('Random') there is no relationship between x and y. In the second setting ('Linear') there is a linear relationship between x and y. In the third setting ('Quadratic') there is a quadratic relationship between x and y. If you select Linear or Quadratic you can specify the true parameters that relate x and y in the models. The data are generated as follows. First, a sample of the specified size is generated for x from a Uniform distribution. The y values are then generated depending on the true relationship and the specified parameters. For example, if you choose 'Linear' for the true relationship, and set the y intercept to 1 and the coefficient of x to 2, then the y values are observations from a Gaussian distribution with mean equal to 1 + 2x and standard deviation equal to 𝝈 . The variance of the error term can also be controlled with a slider, and you can specify whether the error is homoscedastic (constant variance) or heteroscedastic (non‐constant variance). Once the true relationship is specified, you can then choose whether to fit a linear or quadratic model to the data, and view the resulting model statistics as well as look at various residual plots. Try experimenting with the different inputs and see what happens in the results! Several questions below ask you to include the plots generated by the Shiny app. You can do this by either right‐clicking the graph and selecting 'save as', or by taking a screenshot. Include only the plot in your screenshot. All written answers must be in full sentences. Please do not include any instructions in your assignment submission to Crowdmark or the LEARN Dropbox. (a) On the Shiny app select quadratic as the true relationship, 125 as the sample size, ‐2 as the y intercept, 0 as the coefficient of x, ‐1 as the coefficient of x 2 , 0.6 as the standard deviation, homoscedastic as the variability behaviour, and linear as the model to fit. Increase the sample size up and down from 125 and see what happens. (Note: you do not need to write anything for this, just observe what changes in the plots.) Now, set the sample size back to 125 and answer the following questions: (i) Hit the resample! button serval times until you view a scatterplot which you think clearly displays a quadratic relationship between the explanatory variate x and the response variate y. Insert the following 3 plots in your assignment: (1) the scatterplot, (2) the plot of the standardized residuals versus x, and (3) the qqplot of the standard residuals.

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version

Access to all documents
Unlimited textbook solutions
24/7 expert homework help

4 (ii) The numerical output provided by the Shiny app just below the plots under the title “Coefficients” assumes the model 𝑌 ௜ ~ 𝐺ሺ 𝛼 ൅ 𝛽 𝑥 ௜ , 𝜎ሻ , 𝑖 ൌ 1 ,2 , … , 𝑛 independently where the 𝑥 ௜ ′𝑠 are assumed to be known constants Use this output to test the hypothesis of no relationship between the response variate and the explanatory variate (H 0 : β = 0). Be sure to include the value of the t test statistic ห 𝛽 መ െ 0ห 𝑠 ௘ /ඥ 𝑆 ௫௫ the degrees of freedom associated with the test statistic, the p‐value, and the conclusion. Use Table 5.1 in the Course Notes to make your conclusion. (iii) Consider the statement: "There is no evidence of a relationship between the explanatory variate x and the response variate y for these data." Discuss whether there is evidence to support this statement. Your answer should only discuss the evidence presented in your answers to parts (i) and (ii). (iv) Now select quadratic as the model to fit. (Note: you should only change the model to fit to quadratic. You should not change the true relationship.) Do not change any other inputs. Insert the following 3 plots in your assignment: (1) the scatterplot, (2) the plot of the standardized residuals versus x, and (3) the qqplot of the standard residuals. (v) By comparing the plots you included as your answers to parts (i) and (iv), discuss whether the linear or quadratic model is a more appropriate fit for your data. (b) On the Shiny app select linear as the true relationship, 160 as the sample size, ‐2 as the y intercept, 2 as the coefficient of x, 0.5 as the standard deviation, heteroscedastic as the variability behaviour, linear as the model to fit, and standardized residuals versus fitted as the residual plot. Move the standard deviation slider up and down and see what happens. (Note: you do not need to write anything for this, just observe what changes in the plots.) Now, set the standard deviation back to 0.5 and answer the following questions: (i) Hit the resample! button serval times until you view a plot which you think clearly displays the heteroscedastic behaviour of the variability. In your assignment insert this scatterplot and the corresponding residual plot which you think most clearly illustrates heteroscedasticity. (ii) Using words a non‐statistician would understand, describe how the plots can be used as evidence that there is heteroscedasticity in your data.

5 (iii) Leaving all other selections on the app at the same values as in (b), select qqplot of standardized residuals as the residual plot. Hit the resample! button serval times. Choose one qqplot which you think illustrates heteroscedasticity and insert it in your assignment. (iv) Explain how heteroscedasticity in the data is illustrated in the qqplot of standardized residuals which you chose in (b)(iii).

6 In Problems 2 and 3 you will continue to analyse variates in your data set. Make sure that you use the same data set that you generated, saved, and uploaded to the LEARN Dropbox as part of the Prerequisite Assignment. Your commented R code should only be included in the R file that you upload to the LEARN Dropbox. Do not include your R code in your answers submitted to Crowdmark. All written answers must be in full sentences. Please do not include any instructions in your assignment submission to Crowdmark or the LEARN Dropbox. Problem 2: Simple Linear Regression – Wrist circumference and Right foot length The purpose of this problem is to use R to fit a simple linear regression model. You are also asked to check the model, test the hypothesis of no relationship, construct confidence intervals for the slope and the mean response, and construct a prediction interval for a future response. See Sections 6.1 – 6.3 of the Course Notes. You may find it helpful to refer to the material posted on LEARN in the 'Assignment 5 R Tutorial' folder. All written answers must be in full sentences. In this problem you will examine the data for the variate Wrist.circumference (What is the circumference of your left wrist? Answer in centimetres to one decimal place.) and the variate Right.foot.length (What is the length of your right foot, without a shoe? Answer to the nearest centimetre.) for students aged greater than 14. The following R code stores the subset of observations used in this analysis in the vectors y (right foot length) and x (wrist circumference). # determine the complete cases for Age, Wrist circumference and Right foot length cc<‐complete.cases(dataset$Age,dataset$Wrist.circumference,dataset$Right.foot.length) # # Right foot length for students aged greater than 14 y<‐dataset$Right.foot.length[cc&dataset$Age>14] # Their corresponding Wrist circumference x<‐dataset$Wrist.circumference[cc&dataset$Age>14]

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version

Access to all documents
Unlimited textbook solutions
24/7 expert homework help

7 (a) Complete the following statements: (i) My student ID number is __________. (ii) The variate Wrist.circumference is a __________ variate. (iii) The variate Right.foot.length is a __________ variate. (iv) The number of observations for which Height and Right.foot.length are both observed for students aged greater than 14 is n = __________. (b) Assume the variate Wrist.circumference is the explanatory variate x and the variate Right.foot.length is the response variate y. Assume the simple linear regression model 𝑌 ௜ ~ 𝐺ሺ 𝛼 ൅ 𝛽 𝑥 ௜ , 𝜎ሻ , 𝑖 ൌ 1 ,2 , … , 𝑛 independently where the 𝑥 ௜ ′𝑠 are assumed to be known constants (i) The parameters β, 𝜇ሺ 𝑥ሻ ൌ 𝛼 ൅ 𝛽𝑥 , and σ correspond to what attributes of interest in the study population? Be sure to define the study population . (ii) Run the command summary(lm(y~x)) and any other commands needed to complete the table below. Values in the table may be rounded to 3 decimal places for convenience. ( 𝑥̅ , 𝑦 ത ) sample correlation coefficient maximum likelihood estimate of the intercept α least squares estimate of the slope β equation of the fitted line: 𝑦 ൌ 𝛼 ො ൅ 𝛽 መ 𝑥 unbiased estimate of σ estimate of the standard deviation of 𝛽 ෨ (iii) Use the output from (ii) to test the hypothesis H 0 : β = 0 (the hypothesis of no relationship). Be sure to include the value of the t test statistic ห 𝛽 መ െ 0ห 𝑠 ௘ /ඥ 𝑆 ௫௫ the degrees of freedom associated with the test statistic, the p‐value, and the conclusion. Use Table 5.1 in the Course Notes to make your conclusion.

8 (iv) Insert the following plots in your assignment: (1) scatterplot of your data with the fitted line superimposed and labeled (2) plot of the standardized residuals versus the explanatory variate with the line “r* = 0” added and labeled (3) plot of the standardized residuals versus the fitted values with the line “r* = 0” added and labeled (4) qqplot of the standardized residuals with qqline added To receive full marks:  plots must have a titles  axes must be labeled  added lines must be in a contrasting colour and labeled  plot must occupy 1/3 to 1/2 of a page both vertically and horizontally (v) Why do the plots in (2) and (3) look nearly identical? What is the difference? (vi) For plots (1), (3) and (4) indicate which assumptions of the simple linear regression model are being checked, what you expect to see for each plot if the model assumptions hold, and what you observe for your data. (See Section 6.3.) (vii) What is your conclusion regarding the fit of the simple linear regression model to your data? If the linear regression model is not a good fit then what do these plots suggest is wrong with the model? (viii) Discuss what the results of your model fit tell you about the relationship between the variate Wrist.circumference and the variate Right.foot.length . Your answer should be phrased in 'real‐world' terms, that is, discussing the actual variate names and not simply referring to algebraic notation. Are you surprised by your results? Why or why not? (ix) Use the output from the R command confint to give a 99% confidence interval for the slope β. (x) Use the R command predict to determine an estimate of the mean response 𝜇ሺ 𝑥ሻ ൌ 𝛼 ൅ 𝛽𝑥 for x = 12 and also x = 16. Use the R command predict to a 95% confidence interval for the mean response 𝜇ሺ 𝑥ሻ ൌ 𝛼 ൅ 𝛽𝑥 for x = 12 and also for x = 16. Which of the two confidence intervals is wider? Explain in statistical (not mathematical) terms why this should be the case. (xi) Use R to obtain a 90% confidence interval for σ. (xii) Use the R command predict to determine an estimate of the predicted right foot length of a student whose wrist circumference is x = 12. Use the R command predict to determine a 95% prediction interval for the predicted right foot length of a student whose wrist circumference is x = 12. Explain in statistical (not mathematical) terms why this interval is wider than the interval in (x) when x = 12.

9 (c) Now assume that the variate Right.foot.length is the explanatory variate y and the variate Wrist.cicumference is the response variate x. In other words, assume the simple linear regression model 𝑋 ௜ ~ 𝐺ሺ 𝑎 ൅ 𝑏 𝑦 ௜ , 𝜎 ௑ ሻ , 𝑖 ൌ 1 ,2 , … , 𝑛 independently where the 𝑦 ௜ ′𝑠 are assumed to be known constants (i) Complete the following table based on the output from the R command summary(lm(x~y)). Values in the table may be rounded to 3 decimal places for convenience. maximum likelihood estimate of the parameter a maximum likelihood estimate of the parameter b equation of the fitted line: 𝑥 ൌ 𝑎 ො ൅ 𝑏 ෠ 𝑦 unbiased estimate of σ X (ii) For your data set are the estimates of σ and σ X the same? Explain why or why not. ( Hint: Remember what these parameters represent in their respective models.) (iii) Use R to create a plot which includes all of the following:  a scatterplot of your data (x i , y i ), i=1,2,…,n  the fitted line 𝑦 ൌ 𝛼 ො ൅ 𝛽 መ 𝑥  the fitted line 𝑥 ൌ 𝑎 ො ൅ 𝑏 ෠ 𝑦 (Note: To plot this line on the scatterplot you need to rearrange this equation into the form 𝑦 ൌ 𝑏 ൅ 𝑚𝑥 .)  the point ሺ 𝑥̅ , 𝑦 തሻ clearly labeled To receive full marks:  plots must have a titles  axes must be labeled  lines must be in a different colours and labeled  point of intersection must be clearly labeled  plot must occupy 1/3 to 1/2 of a page both vertically and horizontally (iv) You will notice that the two fitted lines in (iii) intersect at a point. What are the coordinates of this point? (Note: If they do not, then you have done something wrong and you will need to go back and carefully check your work.) (v) Suppose you are given a set of data (x i , y i ), i=1,2,…,n and you fit the two lines 𝑦 ൌ 𝛼 ො ൅ 𝛽 መ 𝑥 and 𝑥 ൌ 𝑎 ො ൅ 𝑏 ෠ 𝑦 . Show why these two fitted lines intersect at the point ሺ 𝑥̅ , 𝑦 തሻ . Note: Your proof must be for a general set of points, not just for your data set.

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version

Access to all documents
Unlimited textbook solutions
24/7 expert homework help

10 Problem 3: Two sample Gaussian model The purpose of this problem is to use R to conduct a two sample t test. See Section 6.4 of the Course Notes and the Chapter 6 Problems. All written answers must be in full sentences. In this problem you will examine the data for the variate Wrist.circumference for students aged less than 13. You are asked to compare the mean of the variate Wrist.circumference if the student answered “right” to the question “Are you right‐handed, left‐handed or ambidextrous?” with the mean of the variate Wrist.circumference if the student answered “left”. The following R code stores the observations for Wrist.circumference for students aged 13 who answered “right” in the vector y1. cc<‐complete.cases(dataset$Age,dataset$Wrist.circumference,dataset$Handedness) y1<‐dataset$Wrist.circumference [cc&dataset$Age<13&dataset$Handedness=="right"] (a) Assume a G(μ 1 , σ 1 ) model for the variate Wrist.circumference for the students aged less than 13 who answered “right” to the question “Are you right‐handed, left‐handed or ambidextrous?”. (i) Indicate how the parameters μ 1 , and σ 1 relate to the attributes of interest in the study population. Be sure to define the study population . (ii) Complete the following table for this variate. Values in the table may be rounded to 3 decimal places for convenience. number of observations sample mean sample median sample standard deviation sample skewness sample kurtosis minimum observation maximum observation (iii) Insert the qqplot for this variate. If you notice any unusual behaviour in your plot, discuss what may be causing this unusual behaviour.

11 (iv) How well does the Gaussian model fit these data? Use the graphical and numerical summaries to justify your answer. Clearly indicate what you observe for these data and what you expect to see for Gaussian data. Comment on any unusual behaviour in the plot. The following R code stores the observations for Wrist.circumference for students aged less than 13 who answered “left” in the vector y2. cc<‐complete.cases(dataset$Age,dataset$Wrist.circumference,dataset$Handedness) y2<‐dataset$Wrist.circumference [cc&dataset$Age<13&dataset$Handedness=="left"] (b) Assume a G(μ 2 , σ 2 ) model for the variate Wrist.circumference for the students who answered “left” to the question “Are you right‐handed, left‐handed or ambidextrous?”. (i) Indicate how the parameters μ 2 , and σ 2 relate to the attributes of interest in the study population. (ii) Complete the following table for this variate. Values in the table may be rounded to 3 decimal places for convenience. number of observations sample mean sample median sample standard deviation sample skewness sample kurtosis minimum observation maximum observation (iii) Insert the qqplot for this variate. If you notice any unusual behaviour in your plot, discuss what may be causing this unusual behaviour. (iv) How well does the Gaussian model fit these data? Use the graphical and numerical summaries to justify your answer. Clearly indicate what you observe for these data and what you expect to see for Gaussian data. Comment on any unusual behaviour in the plot. (c) Insert the side‐by‐side boxplots comparing these variates. Describe all the information about the differences and similarities between these two data sets that you can based only on these boxplots .

12 Note: Even if you conclude the Gaussian models do not fit your data well, continue to do the following analyses based on the Gaussian models. Note that this would not be done in a real world analysis! (d) Determine a 95% confidence interval for σ 1 and a 95% confidence interval for σ 2 . Based on these intervals, is it reasonable to assume that σ 1 = σ 2 ? Give reasons for your answers. (e) Explain what the null hypothesis H₀: μ 1 = μ 2 means in real‐world terms, that is, explain what this null hypothesis means in words that a non‐statistician would understand. (f) Use the R command t.test to test the hypothesis H₀: μ 1 = μ 2 . Regardless of your answer to (d) assume that σ = σ 1 = σ 2 to do this question . Be sure to include the pooled estimate of σ, the observed value of the test statistic 𝐷 ൌ |𝑌 ത ଵ െ𝑌 ത ଶ | 𝑆 ௣ ට 1 𝑛 ଵ ൅ 1 𝑛 ଶ the degrees of freedom associated with the test statistic, the p‐value, and the conclusion. Does your conclusion agree with what you observed for the side‐by‐side boxplots in (c)? (g) If it is not reasonable to assume that σ 1 = σ 2 then the asymptotic pivotal quantity 𝐷 ൌ |𝑌 ത ଵ െ𝑌 ത ଶ | ඨ 𝑆 ଵ ଶ 𝑛 ଵ ൅ 𝑆 ଶ ଶ 𝑛 ଶ which has an approximate G(0,1) distribution if n 1 and n 2 are large, could be used to test H₀: μ 1 = μ 2 . Use this test statistic to test H₀: μ 1 = μ 2 . Be sure to include the observed value of the test statistic, the p‐value and the conclusion. Is your conclusion the same as in (f)?

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version

Access to all documents
Unlimited textbook solutions
24/7 expert homework help

Assignment 5 F23

Related Documents