HW1 LR

docx

School

New York University *

*We aren’t endorsed by this school

Course

MISC

Subject

Statistics

Date

Apr 3, 2024

Type

docx

Pages

16

Uploaded by DeaconBoulderAlpaca37

Report
HW1 The prostate dataset comes from a study on 97 men with prostate cancer who were due to receive a radical prostatectomy. The data contain the following variables: lcavol: log(cancer volume) lweight: log(prostate weight) age: age lbph: log(benign prostatic hyperplasia amount) svi: seminal vesicle invasion lcp: log(capsular penetration) gleason: Gleason score pgg45: percentage Gleason scores 4 or 5 lpsa: log(prostate specific antigen) Question 1: Some of the variables in the dataset have undergone log transformations. Compare the distributions of log transformed “prostate specific antigen” and log transformed “caner volume” with their distributions without transformation. Based on the histograms, do you think that the transformations were justified? Hint: 1. The following R statement will load the prostate data frame data(“prostate”, package = “faraway”) 2. Assume that the variables were transformed using the natural logarithm 3. Relevant R functions: hist(), exp(). https://astrostatistics.psu.edu/su07/R/html/base/html/Log.html rm ( list = ls ()) data ( "prostate" , package = "faraway" ) par ( mfrow = c ( 2 , 2 )) # Enter your code below rm ( list = ls ()) data ( "prostate" , package = "faraway" ) par ( mfrow = c ( 2 , 2 )) hist (prostate $ lpsa, main = "Log PSA" , xlab = "Histogram of PSA" ) hist (prostate $ lcavol, main = "Log Cancer Volume" , xlab = "Histogram of cavol" ) hist ( exp (prostate $ lpsa), main = "Original PSA" , xlab = "Histogram of Original PSA" )
hist ( exp (prostate $ lcavol), main = "Original Cancer Volume" , xlab = "Histogram of Original cavol" ) When compared to their original distributions, the log-transformed distributions seem more symmetrical or normal, indicating that the transformations assisted in achieving normality or stabilizing variance. Question 2: Calculate the correlation between all the variables. Hint: Relevant function: cor # Enter your code below corr_var <- cor (prostate) print (corr_var) ## lcavol lweight age lbph svi lcp ## lcavol 1.00000000 0.194128387 0.2249999 0.02734971 0.53884500 0.67531058
## lweight 0.19412839 1.000000000 0.3075247 0.43493174 0.10877818 0.10023889 ## age 0.22499988 0.307524741 1.0000000 0.35018592 0.11765804 0.12766778 ## lbph 0.02734971 0.434931744 0.3501859 1.00000000 -0.08584327 - 0.00699944 ## svi 0.53884500 0.108778185 0.1176580 -0.08584327 1.00000000 0.67311122 ## lcp 0.67531058 0.100238891 0.1276678 -0.00699944 0.67311122 1.00000000 ## gleason 0.43241705 -0.001283003 0.2688916 0.07782044 0.32041222 0.51482991 ## pgg45 0.43365224 0.050846195 0.2761124 0.07846000 0.45764762 0.63152807 ## lpsa 0.73446028 0.354121818 0.1695929 0.17980950 0.56621818 0.54881316 ## gleason pgg45 lpsa ## lcavol 0.432417052 0.4336522 0.7344603 ## lweight -0.001283003 0.0508462 0.3541218 ## age 0.268891599 0.2761124 0.1695929 ## lbph 0.077820444 0.0784600 0.1798095 ## svi 0.320412221 0.4576476 0.5662182 ## lcp 0.514829912 0.6315281 0.5488132 ## gleason 1.000000000 0.7519045 0.3689867 ## pgg45 0.751904512 1.0000000 0.4223157 ## lpsa 0.368986693 0.4223157 1.0000000 Which pair of variables has the highest R 2 (exclude any correlation between a variable and itself)? Hint: You can find the highest R 2 manually or using R statements # Enter your code below diag (corr_var) <- 0 highcorr <- max (corr_var) indices <- which (corr_var == highcorr, arr.ind = TRUE ) var1 <- rownames (corr_var)[indices[ 1 , 1 ]] var2 <- colnames (corr_var)[indices[ 1 , 2 ]] cat ( "Pair of variables with the highest R^2 (excluding self- correlation): \n " ) ## Pair of variables with the highest R^2 (excluding self- correlation): cat (var1, "and" , var2, "with R^2 =" , highcorr) ## pgg45 and gleason with R^2 = 0.7519045 Pair of variables with the highest R^2 (excluding self-correlation):pgg45 and gleason with R^2 = 0.7519045
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Question 3: Create a scatter plot matrix for all the variables Hint: Relevant function: pairs # Enter your code below pairs (prostate) Question 4: List all the variables (excluding lpsa) in descending order (from largest to smallest) according to their squared correlation ( r 2 ) with lpsa. The table should include two columns: the variable name and correlation. Hint: You may order the variables manually or using R statements # creating a data frame to store the correlations. The frame will have a row for each variable
corDf <- data.frame ( Variable = names (prostate)) # Enter your code below corr <- cor (prostate $ lpsa, prostate)[ - 8 , ] corr_sqrd <- corr ^ 2 corDf <- data.frame ( Variable = names (corr), Correlation = corr_sqrd) corDf <- corDf[ order (corDf $ Correlation, decreasing = TRUE ), ] print (corDf) ## Variable Correlation ## lpsa lpsa 1.00000000 ## lcavol lcavol 0.53943191 ## svi svi 0.32060303 ## lcp lcp 0.30119588 ## pgg45 pgg45 0.17835059 ## gleason gleason 0.13615118 ## lweight lweight 0.12540226 ## lbph lbph 0.03233146 ## age age 0.02876175 Question 5: Fit eight regression models. The response (dependent) variable in all the models should be lpsa . The first model should include only the variable with the highest correlation as the predictor (based on the table that you created) The second model should include the highest and the second-highest variables as predictors. The third model should include all the predictors used in the second model and the variable with the third-highest correlation as predictors. The fourth model… The eighth and last model should include all the variables. Store each model object in an R variable. You do not need to show any outputs. # Enter your code below m1 <- lm (lpsa ~ lcavol, data = prostate) m2 <- lm (lpsa ~ lcavol + svi, data = prostate) m3 <- lm (lpsa ~ lcavol + svi + lcp, data = prostate) m4 <- lm (lpsa ~ lcavol + svi + lcp + pgg45, data = prostate) m5 <- lm (lpsa ~ lcavol + svi + lcp + pgg45 + gleason, data = prostate) m6 <- lm (lpsa ~ lcavol + svi + lcp + pgg45 + gleason + lweight, data = prostate) m7 <- lm (lpsa ~ lcavol + svi + lcp + pgg45 + gleason + lweight + lbph, data = prostate)
m8 <- lm (lpsa ~ lcavol + svi + lcp + pgg45 + gleason + lweight + lbph + age, data = prostate) Question 6: Extract the residual standard error (RSE) and R 2 for each model and put these values into a table. The table should have three columns: The model number (1 to 8), RSE, and R 2 . Create two plots. The x-axis should show the model number in both plots. The y-axis should show the RSE in the first plot and R 2 in the second. # creating a data frame to store the RSE and correlation of each regression # The frame will have a row for each regression. The regression will be numbered from 1 to 8 regDf <- data.frame ( RegNo = 1 : 8 ) # adding two columns for storing the errors and correlations: regDf $ rse <- NA regDf $ rsq <- NA # Enter your code below regDf <- data.frame ( RegNo = 1 : 8 ) regDf $ rse <- sapply ( 1 : 8 , function (i) summary ( get ( paste0 ( "m" , i))) $ sigma) regDf $ rsq <- sapply ( 1 : 8 , function (i) summary ( get ( paste0 ( "m" , i))) $ r.squared) print (regDf) ## RegNo rse rsq ## 1 1 0.7874994 0.5394319 ## 2 2 0.7556684 0.5803761 ## 3 3 0.7587017 0.5815006 ## 4 4 0.7568200 0.5880516 ## 5 5 0.7609184 0.5881043 ## 6 6 0.7208619 0.6343914 ## 7 7 0.7166818 0.6426346 ## 8 8 0.7084155 0.6547541 Creating the plots: par ( mfrow = c ( 1 , 2 )) # Enter your code below par ( mfrow = c ( 1 , 2 )) plot (regDf $ RegNo, regDf $ rse, type = "b" , xlab = "Model Number" , ylab = "Residual Standard Error" , main = "RSE vs Model Number" )
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
plot (regDf $ RegNo, regDf $ rsq, type = "b" , xlab = "Model Number" , ylab = "R-squared" , main = "R-squared vs Model Number" ) Question 7: Show the model summary of model 8. Should the gleason variable be kept in the model? Based on the table and plots, what happened to the RSE and R 2 when the gleason variable was introduced. # Enter your code below summary (m8) ## ## Call: ## lm(formula = lpsa ~ lcavol + svi + lcp + pgg45 + gleason + lweight + ## lbph + age, data = prostate) ## ## Residuals:
## Min 1Q Median 3Q Max ## -1.7331 -0.3713 -0.0170 0.4141 1.6381 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.669337 1.296387 0.516 0.60693 ## lcavol 0.587022 0.087920 6.677 2.11e-09 *** ## svi 0.766157 0.244309 3.136 0.00233 ** ## lcp -0.105474 0.091013 -1.159 0.24964 ## pgg45 0.004525 0.004421 1.024 0.30886 ## gleason 0.045142 0.157465 0.287 0.77503 ## lweight 0.454467 0.170012 2.673 0.00896 ** ## lbph 0.107054 0.058449 1.832 0.07040 . ## age -0.019637 0.011173 -1.758 0.08229 . ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.7084 on 88 degrees of freedom ## Multiple R-squared: 0.6548, Adjusted R-squared: 0.6234 ## F-statistic: 20.86 on 8 and 88 DF, p-value: < 2.2e-16 adjustedRSQbefore <- summary (m4) $ adj.r.squared adjustedRSQafter <- summary (m5) $ adj.r.squared cat ( "Model 4 (without 'gleason' variable): \n " ) ## Model 4 (without 'gleason' variable): print (regDf[ 4 , ]) ## RegNo rse rsq ## 4 4 0.75682 0.5880516 cat ( "Model 5 (after 'gleason' was added): \n " ) ## Model 5 (after 'gleason' was added): print (regDf[ 5 , ]) ## RegNo rse rsq ## 5 5 0.7609184 0.5881043 plot (m4)
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
plot (m5)
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
The p value of gleason variable (0.77503) is not statistically significant (p<0.05), therefore, it should not be included in the model. Also, before the gleason variable, the rse is = 0.75682 and r2 = 0.5880516. After adding the gleason variable, the rse = 0.7609184 and r2 = 0.5881043 Therefore, there is not much difference seen in the values of rse and rsq after adding the gleason variable in the model