hw05_instructions

pdf

School

University of Illinois, Urbana Champaign *

*We aren’t endorsed by this school

Course

425

Subject

Statistics

Date

Feb 20, 2024

Type

pdf

Pages

Uploaded by SargentCheetahPerson730

STAT 425 — Sections 2UG, 2GR, 3UG, 3GR — Spring 2023 Homework 5 Due: April 17, 11:59 PM (US Central) Please submit your assignment electronically as a PDF document, using the appropriate interface on Canvas. Remember to include relevant computer output. 1. The list object PAC from R package chemometrics contains “GC-retention indices” (component y ) along with “molecular descriptors” (component X ) for 209 “polycyclic aromatic compounds.” (Note that component X has the form of a matrix , with rows corresponding to the compounds and columns to the descriptors.) The goal is to predict the GC-retention index using the descriptors. After installing and loading the chemometrics package, run the following R code to create a data frame that will be easier to use: data(PAC) PAC.df <- as.data.frame(cbind(y=PAC$y,PAC$X)) (a) [2 pts] Examine the R summary of the following fit: lm(y ~ ., data=PAC.df) What is unusual about it? Why did that happen? (You do not need to include the R summary output in your submission.) (b) [1 pt] Run the following R code to perform forward selection (by AIC) with up to 60 iterations: stepPAC <- step(lm(y ~ 1, data=PAC.df), scope=formula(PAC.df), direction="forward", trace=0, steps=60) How many predictors were added? (c) [1 pt] Plot the observed GC-retention index values versus their predicted values from the stepPAC model. (d) [1 pt] Run the following R code to compute 10-fold crossvalidated RMSEP for the stepPAC model: library(boot) sqrt(cv.glm(PAC.df, glm(formula(stepPAC), data=PAC.df), K=10)$delta[1]) What value do you obtain? (e) [1 pt] Using function pcr from R package pls (which you may need to install and load), fit principal component regression (PCR) models with up to 60 PCs. Make sure to use scale=TRUE and validation="CV" . Show your R code. 1

(f) [3 pts] Compute and plot (10-fold crossvalidated) RMSEP for the PCR models versus their number of PCs (up to 60). (Use function RMSEP with argument estimate="CV" , then plot the object it produces.) How many PCs are in the model with the best RMSEP? What is your value of RMSEP for that best model? (g) [2 pts] Plot the observed GC-retention index values versus their predicted values from your best PCR model (the one you found in the previous part). Comparing this with the plot from part (c), which seems to be the better predictor: the forward-selected model or the best PCR model? 2. In the following, use the seatpos data set from R package faraway , with hipcenter as the response and all other variables as possible predictors. (a) [2 pts] Using function lm.ridge from R package MASS , fit ridge regression models for λ values ranging from 0 to 50. Display a ridge trace plot of the coefficients over that λ range. Show your R code. (b) [2 pts] For your models, plot the generalized crossvalidation (GCV) criterion value versus the λ value (over the λ range from the previous part). Approximately which λ value minimizes the criterion? (c) [2 pts] For the minimizing λ from the previous part, display the ridge regression estimates of the coefficients for the original predictors. (Function coef returns these for all λ values used, so you will have to extract the ones for the minimizing λ .) Do any of these estimates exactly equal zero? (d) [2 pts] Using function lars from R package lars , fit lasso regression models, also computing (10-fold crossvalidated) RMSEP values for them. Then plot RMSEP versus the lasso “index” s (which is such that s = 0 for the intercept-only model, and s = 1 for the least squares solution). Show your R code. (e) [2 pts] From the previous part, approximately what value of s minimizes RMSEP? What is that minimum value of RMSEP? (f) [2 pts] Using the minimizing s from the previous part, display the lasso coefficient estimates for the predictors. Which predictors are selected? (g) [1 pt] By applying plot to the output of lars (from part (d)), display a kind of lasso trace plot. 3. The data set in the file named bullsbears.txt has the Height (inches) and Weight (lb) of players on the Chicago Bulls men’s basketball team ( Team=Bulls ) and the Chicago Bears American football team ( Team=Bears ) from a previous season. You will fit a homogeneity-of-regressions (“ANCOVA”) model. (a) [2 pts] Fit the regression of log(Weight) on log(Height) and Team . (Make Team a factor variable such that its dummy variable equals 1 for Bulls .) Give a summary of the results. Is Team significant? (b) [2 pts] Plot the (parallel) regression lines representing the relationship between log(Weight) and log(Height) for each team, according to your model from the previous part. (The two lines should be on the same plot.) Label the lines. (c) [2 pts] Fit the regression of log(Weight) on log(Height) , Team , and the interaction between log(Height) and Team . Give a summary of the results. Is the interaction significant? 2

(d) [2 pts] Plot the ( not parallel) regression lines representing the relationship between log(Weight) and log(Height) for each team, according to your model from the previous part that includes an interaction. (The two lines should be on the same plot.) Label the lines. 4. In a pilot study of treatment for facial acne vulgaris, 10 subjects were instructed to topically apply, twice daily for 4 weeks, each of two different formulations (#1 and #2) of benzoyl peroxide 10.0% cream. Each formulation was assigned to one side of the face (left or right), with the side determined at random for each subject, for the duration of the study. Severity of acne was monitored from the beginning (Day 0) to the end (Day 28). The following table lists, for each formulation and each subject, the difference in the number of papules observed on Day 28 versus on Day 0 (before application) for the side of the subject’s face to which the formulation was applied. 1 Subject 1 2 3 4 5 6 7 8 9 10 Formulation #1: 0 - 7 1 - 4 3 - 10 1 - 3 - 4 - 7 Formulation #2: - 6 1 - 9 3 6 5 3 0 - 8 - 4 (a) [2 pts] Carefully enter this data into R. Display the R object(s) you create. (b) [2 pts] Perform an appropriate t -test for whether there is any mean difference in the papule differences of the two formulations. Display a summary of the results. What do you conclude? (c) [2 pts] Perform an approximate randomization test (based on simulating the t -statistic under re-randomization) for whether there is any mean difference. What is your approximate p -value? Show the R code you used to produce it. Does your conclusion change? (d) [2 pts] Display a histogram of the randomization distribution of the t -statistic, based on your simulation of the previous part. Show the R code you used to produce it. 5. [ GRADUATE SECTION ONLY ] Recall Mallows’ C p criterion. Let the “full” model (on which ˆ σ 2 is based) have p full variables ( p 0 full < n ), and a residual sum of squares RSS full . (a) [2 pts] Show that C p full = p 0 full . (b) [3 pts] For p < p full , express C p exclusively in terms of p 0 , p 0 full , and the F -statistic, call it F p , for testing the p -variable reduced model versus the “full” model. [ Hint: RSS p = ( RSS p - RSS full ) + RSS full , and express ˆ σ 2 using RSS full ] (c) [3 pts] The F ν 1 ,ν 2 distribution has expected value ν 2 / ( ν 2 - 2) (when ν 2 > 2). Use this and your expression from the previous part to derive the expected value of C p for a correct p -variable model (when the expected value exists, and assuming F p really does have the null F distribution). (d) [2 pts] Justify the statement that “if all of the predictors that are left out have coefficients near 0, the expected value of C p is approximately p 0 .” Under what condition(s) does it tend to be true? 1 Modified as described from original data source: ClinicalTrials.gov , Identifier: NCT00787943, last update May 1, 2015 3

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version

Access to all documents
Unlimited textbook solutions
24/7 expert homework help

Some comments: As appropriate, use the function read.table or the function read.csv to read an external text data file into R. Unless otherwise stated, use a 5% level ( α = 0 . 05) in all tests. 4