Final 2020 Keys

pdf

School

Syracuse University *

*We aren’t endorsed by this school

Course

687

Subject

Statistics

Date

Apr 3, 2024

Type

pdf

Pages

22

Uploaded by MegaRain11856

Report
Name: uniq name: BIOSTAT 651 Applied Statistics II: Extensions of Linear Regression Final Exam April 29, 2020 8am to 6pm If you have any ques tions about a prob lem, con tact me via email veerab@umich.edu: Sub - ject line: 651 Fi nal Exam Please clearly mark your answers and append your codes/output at the end Show all your work for partial points unless indicated otherwise.
Question Points Possible Points Received 1 20 2 15 3 10 4 40 5 15 Total 100 2
1 (20 points) Case con trol stud ies 1. (3 points) Describe the basic structure of a case-control study . When is such a study design appropriate? 2. (5 points) Define the odds ratio, φ , in the 2 2 table below Disease Status (Y) Absent (Y=0) Present (Y=1) Risk factor Absent (X=0) 00 01 Present (X=1) 10 11 What constraint(s) should the probabilities 00 , 01 , 10 and 11 satisfy in a case-control study? 3. (3 points) Under what assumption(s) is it appropriate to use a logistic regres- sion framework for analyses of case control studies? 4. (9 points) Using the table below, determine log-likelihood and score functions for the model log i 1 - i = + β X, where i = P ( Y = 1 | X ). Y=0 Y=1 Total X=0 n 00 n 01 n 0 X=1 n 10 n 11 n 1 Obtain MLE of β , b β , and show that exp( b β ) is the same as the sample odds ratio, n 00 n 11 / ( n 01 n 10 ) . 3
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
2. (15 points) Consider a model y i Binomial( n i , p i ) , i = 1 , . . . , N, where the y i are independent. Let β = ( β 1 , · · · , β p ) T be a parameter vector and x i = ( x i 1 , · · · , x ip ) T is a covariate vector for observation i . 1. (3 points) Consider F ( y ) = 1 - e - λ y , the cdf of an exponential random variable with mean 1 / λ > 0. Compute the exponential quantile function F - 1 ( p ) , 0 < p < 1, and verify if it is a valid link function for a GLM with binary response. 2. (2 points) Using the link function F - 1 ( · ) computed above, show that the success probability p i can be defined as p i = n i (1 - e - λ β T x i ) , i = 1 . . . , N . 3. (5 points) Suppose, under the logistic model, p i = e β T x i / (1 + e β T x i ). Write down the log-likelihood for β and hence show that the deviance, D ( β ), is given by D ( β ) = 2 N X i =1 h y i { logit(ˆ p i ) - β T x i } + n i { log(1 - ˆ p i ) + log(1 + e β T x i ) } i , where ˆ p i = y i /n i , i = 1 , . . . , N , and logit( p ) = log( p/ (1 - p )). 4. (5 points) Under the logistic model, with p i = e β T x i / (1 + e β T x i ), show that the maximum likelihood estimator of β satisfies the p equations N X i =1 x ij ( y i - ˆ μ i ) = 0 , j = 1 , . . . , p, where the ˆ μ i should be defined in your answer. 4
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
3. (10 points) Suppose Y has a negative binomial distribution with probability mass function f ( y | k, p ) = Γ ( y + k ) Γ ( k ) Γ ( y + 1) (1 - p ) k p y , y = 0 , 1 , . . . , (1) where p 2 (0 , 1) is a parameter, k is a known positive constant and Γ ( t ) = R 1 0 x t - 1 e - x dx is the gamma function. 1. (3 points) Show that (1) may be written as a GLM distribution f ( y ) = exp y - b ( ) a ( φ ) + c ( y, φ ) . Determine as a function of p . 2. (5 points) Hence prove that the mean and variance of the negative binomial are given by E ( Y ) = kp 1 - p and V ( Y ) = kp (1 - p ) 2 . 3. (2 points) Discuss briefly when the negative binomial is useful as GLM. 5
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
5. (40 points) A retrospective study was carried out by the University of Adelaide on a random sample of graduate students. Each student was followed for 50 years after graduation and classified as dead or alive. Data are contained in the file Adelaide1.txt , with columns YEAR, DEPT, SURVIVORS, TOTAL. The following model is of interest: log i 1 - i = β 0 + β 1 ( Y EAR i - 1900) + β 2 ART i + β 3 MED i + β 4 ENG i , where Y EAR i is the year of graduation; ART i = 1 if the student graduates from the Arts Department and 0 otherwise; MED i and ENG i are defined analogously. Note that i = ( x i ) = P ( Y i = 1 | x i ) with Y i = 1 if the graduate survives (0 if not). You can use R or any software to implement the model. Please attach the codes and output either as .RMD file or any other format with your exam [Part A (27 points)] For the above generalized linear model, 1. (3 points) Define i , μ i , v ( μ i ). 2. (5 points) Sketch out an Iteratively Re-weighted Least Squares algorithm for estimating β = ( β 0 , . . . , β 4 ) T . Specifically, (i) choose a starting value (ii) give the update formula (iii) set up a stopping criterion. Compute b β using IRWLS. Print the iteration history. Your output should provide (for j = 0 , . . . , 4): b β j , c SE ( b β j ), Wald statistic for test of H 0 : β j = 0 and corresponding p-value. 3. (5 points) Test H 0 : β 2 = β 3 = β 4 = 0 using the Wald’s, Score and likelihood ratio tests 4. (5 points) Give an interpretation for the regression coe ffi cients: β 0 , β 1 , β 2 , β 3 and β 4 . 5. (3 points) Suppose that you replaced ( Y EAR - 1900) with Y EAR in the model. Describe whether and how each β 0 , . . . , β 4 would change as a result of the re-coding. In each case, provide a brief justification of your claim. 6. (3 points) Suppose that, instead, Y i was redefined such that it represented an indicator for death. How would this a ect each of β 0 , . . . , β 4 ? Briefly justify your claim. 7. (3 points) Suppose that there were 20 Engineering graduates in the 1946 cohort. Estimate the number of 50-year survivors along with the corresponding 95% confidence interval. [Part B (13 points)] Now consider a saturated model for the Adelaide data. 1. (3 points) Write out an equation for a saturated model. How many parameters does this model contain? Hint: take a careful look at the data set before answering. 6
2. (5 points) Compute parameter estimates for the saturated model you listed in #1 above. 3. (3 points) State the most prominent good property of this model; and the most obvious bad property. 4. (2 points) What role does this model play in hypothesis testing in logistic regression? 7
5. (15 points) Consider the Poisson regression example we covered in class, where-in we analyzed a coronary heart disease (CHD) data set. Recall, that the study observed n = 3 , 154 males ages 40-50. In the study from which the table is derived, men were followed for an average of 8 years, and the number of CHD cases was recorded. The following model was fitted (with an o set): log μ i = log T i + β 0 + + β 1 I ( TYPE i = A ) + β 2 HBP + β 3 I ( CIG i = 1) + β 4 I ( CIG i = 2) + β 5 I ( CIG i = 3) The datasets details and analyses outputs are attached at the end. 1. (5 points) Interpret exp { b β 2 } , from the R-output 2. (5 points) We noted that there is might be an issue with overdisperson in this analsyes. Suppose the scale parameter estimate is 1 . 9282, re-do the Wald test for H 0 : β 3 = 0, this time correcting for overdispersion. 3. (5 points) Based on this data and analyses, one of the investigators is planning a future study of non-smoking Type A individuals with high blood pressure. What is predicted rate (per person year (PY)) for non-smoking Type A individuals with high blood pressure? S/he has received funding for 5 years of follow-up. Based on the results from the current analysis, how many men should s/he enroll in the study if s/he wants to observe at least 50 cases with CHD? 8
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Biostatistics 651 Final Exam Question 5: Poisson Regression – Dataset details and analyses outputs 1. Introduction We analyze the coronary heart disease (CHD) data. The study observed n = 3154 males ages 40-50. The study featured a prospective cohort design, with staggered entry and the observation period concluding on 12/31/70. Men were followed for an average of 8 years, and the number of CHD cases was recorded. The data includes the following variables (other risk factor of interest also included): Behavior type (A and B) High blood pressure (HBP); 1 indicates Blood Pressure (BP) > 140 ; 0 otherwise Cigarettes usage (CIG): 0 = low; 1 = medium; 2 = high Number of CHD cases Follow-up time (in person year, PY) 2. Data Analysis if ( ! require ( "pacman" )) install.packages ( "pacman" ) ## Loading required package: pacman pacman :: p_load (knitr, car) #boot: logit function df= read.csv ( CHD_Poisson.csv ) kable (df) TYPE HBP CIG CHD PY A 1 0 29 1251.9 A 1 1 21 640.0 A 1 2 7 374.5 A 1 3 12 338.2 A 0 0 41 4451.1 A 0 1 24 2243.5 A 0 2 27 1153.6 A 0 3 17 925.0 B 1 0 8 1366.8 B 1 1 9 497.0 B 1 2 3 238.1 B 1 3 7 146.3 B 0 0 20 5268.2 B 0 1 16 2542.0 B 0 2 13 1140.7 B 0 3 3 614.6 Model: df $ TYPE<- relevel (df $ TYPE, ref= "B" ) df $ CIG<- factor (df $ CIG, levels = c ( "0" , "1" , "2" , "3" )) df $ HBP<- factor (df $ HBP, levels= c ( "0" , "1" )) 1
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
(a) Fitting the following main e ff ects model using an o ff set. Model: log( μ i T i ) = log( μ i ) log( T i ) = log( i ) = 0 + 1 I ( TYPE i = A ) + 2 HBP + 3 I ( CIG i = 1) + 4 I ( CIG i = 2) + 5 I ( CIG i = 3) #Specify the offset term in the formula glm.Offset1<- glm ( data= df, CHD ~ TYPE + HBP + CIG + offset ( log (PY)), family= poisson ( link= log )) summary (glm.Offset1) ## ## Call: ## glm(formula = CHD ~ TYPE + HBP + CIG + offset(log(PY)), family = poisson(link = "log"), ## data = df) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -2.22781 -0.78956 -0.00962 0.91396 2.14434 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -5.4738 0.1407 -38.900 < 2e-16 *** ## TYPEA 0.7566 0.1362 5.556 2.76e-08 *** ## HBP1 0.7576 0.1293 5.860 4.64e-09 *** ## CIG1 0.3904 0.1566 2.494 0.012645 * ## CIG2 0.7186 0.1740 4.129 3.64e-05 *** ## CIG3 0.7400 0.1904 3.887 0.000101 *** ## --- ## Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 ## ## (Dispersion parameter for poisson family taken to be 1) ## ## Null deviance: 119.061 on 15 degrees of freedom ## Residual deviance: 19.282 on 10 degrees of freedom ## AIC: 101.59 ## ## Number of Fisher Scoring iterations: 4 (b) Overdispersion analyses H 0 : the model fits data well By comparing the deviance from the fitted model and the saturated model, LRT=19.2822 Since 19.2822 > =18.3 ( 2 10 , 0 . 05 ) Reject H 0 at = 0 . 05 . Scale parmeter estimate =1.92822 2
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help