hw8-2fin

docx

School

Texas A&M University *

*We aren’t endorsed by this school

Course

112

Subject

Statistics

Date

Feb 20, 2024

Type

docx

Pages

Uploaded by ConstableBook9735

Homework 8 for Stat 312 100 pts total (50-30- 20 pts) Name Download the RData file "BreastCancer.RData" (which includes an R data frame "cancer") and load it into your workspace. We will use this dataset for all questions. This dataset is downloaded from UCI Machine Learning Repository: http://archive.ics.uci.edu/ml/datasets/ Breast+Cancer+Coimbra . If you have difficulty loading the RData file, below are 3 methods for loading RData files. Feel free to skip to Question 1 if you already understand this very well. Method 1. In RStudio click Sessions -> Load Workspace and then locate and open the file “BreastCancer.RData”. Method 2. You need to know the FULL path of the file “BreastCancer.RData”. This may not be easy, especially on a Mac computer. Assume the full path is “XXX/Stat312/Data/BreastCancer.RData”. Then in the RStudio Console run load ( "XXX/Stat312/Data/BreastCancer.RData" ) Method 3. (1) In RStudio create an R Script file (click File -> New File -> R Script) and save it under a folder, say, “FolderX”. (You need to do this anyway to write your R code for the homework.) (2) In File Manager, move “BreastCancer.RData” to “FolderX”. (3) In RStudio click Session -> Set Working Directory -> To Source File Location. (4) Run the following command load ( "BreastCancer.RData" ) 1. A simple linear regression model can be written as y i = β 0 + β 1 x i + E i (for the i th sample). Run the following code to perform a linear regression using the cancer data and answer the questions. For part (iii) and (iv), attach your R code. fit <- lm (HOMA ~ Leptin, data = cancer) summary (fit) plot (cancer $ Leptin, cancer $ HOMA) abline (fit, col= ' blue ' ) subX = which.max (cancer $ HOMA) points (cancer $ Leptin[subX], cancer $ HOMA[subX], col= ' red ' , pch= 19 )

cancer$Leptin (i) What is the response variable and what is the explanatory variable? The response variable is HOMA and the explanatory variable is Leptin (ii) According to the output of the summary function, what are the numerical values of the intercept and the slope of the blue line? The intercept is 1.04159 and the slope is 0.06212 (iii) subX is the index of the individual who is represented by the red dot in the top of the plot. Hence, its x-coordinate is given by cancer$Leptin[subX] and the y-coordinate is cancer$HOMA[subX]. Find the fitted value for this individual. (Hint: You can either use the formula y ˆ = β ˆ 0 + β ˆ 1 x or use the fitted.values component of the lm output.) The fitted value is 5.445 (iv) Use the result of part (iii) to compute the residual for the individual subX. (Hint: The value you get should be equal to fit$residuals[subX].) Residual= 19.605 20 40 60 80 cancer$HOMA 0 5 10 15 20 25

Residual= observed-fitted value 25.05034 - 5.445= 19.605 (v) Report the p-value in the last line of the summary output. This is the p-value for the null hypothesis that β 1 = 0 (i.e. x and y are unrelated). According to the p-value, do you think the two variables HOMA and Leptin are related to each other? 2. Use lm to fit the following linear regression model ( E represents the random error), log( y ) = β 0 + β 1 log( x ) + E where x still denotes the variable Leptin and y denotes the variable HOMA. Then plot log(cancer$HOMA) against log(cancer$Leptin) and add a regression line to your plot. Attach your code and plot. (Hint: Mimic the code in Question 1. You just need to use the log function at a few places.) There is a relationship between the two variables HOMA and Leptin.

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version

Access to all documents
Unlimited textbook solutions
24/7 expert homework help

3. The Classification variable in the cancer data gives the disease status of each individual (0: healthy controls; 1: cancer patients). Use glm to fit a multiple logistic regression model using Classification as the response variable Y , Glucose as the explanatory variable X 1 and Resistin as the explanatory variable X 2 . That is, P ( y i = 1 | x 1 i , x 2 i ) log P y ( i = 0 | x 1 i , x 2 i = β 0 + β 1 x 1 i + β 2 x 2 i , where y i is the value of Classification for the i th individual, x 1 i is the value of Glucose and x 2 i is the value of Resistin for the i th individual. (i) Attach your code and report the estimates for β 0 , β 1 , β 2 . β 0 = -7.72704 β 1 = 0.07697 β 2 = 0.05095

(ii) Suppose β 0 + β 1 x 1 i + β 2 x 2 i = 3. Then what would be the value of P ( y i = 1 | x 1 i , x 2 i )? (No coding is needed except that you may want to use R to compute the exponential function.) P=0.953