W22

pdf

School

University of California, Santa Barbara *

*We aren’t endorsed by this school

Course

127

Subject

Statistics

Date

Jan 9, 2024

Type

pdf

Pages

8

Uploaded by BailiffBoulderPanther4

Report
Solutions: PSTAT 127 Midterm, W22, February 10 Clear working must be shown to receive credit. The maximum possible score is 100 . There are four questions with multiple parts. Answer all parts, starting each question on a new page. Read exam carefully and make sure you answer all components, which may span multiple pages. Upload your solutions as a pdf file to gradescope. You may annotate this template, or write on paper and then scan and upload. Tables are attached at the end of this question sheet. Best wishes during this midterm exam! 1. [20 points] Consider a single random variable Y with probability mass function P ( Y = y ) = 8 > > < > > : exp { - ( - 1) + y ln( - 1) - ln( y !) } if y 2 { 0 , 1 , 2 , 3 , . . . } 0 otherwise where parameter > - 1, and ln( . ) is natural log (i.e., log base e) Clearly specify the following components when writing above distribution in natural exponential family form. (a) Write down the canonical parameter in terms of : = ln( - 1) (b) Now find in terms of : (i.e., write as a function of ) = ln( - 1) () = exp( ) + 1 (c) b ( ) = b ( ) = - 1 = exp( ) (d) φ = φ = 1 (e) a ( φ ) = 1
a ( φ ) = 1 (f) c ( y, φ ) = c ( y, φ ) = - ln( y !) (g) In the following, use natural exponential formulae from PSTAT 127. Write down the formulae used and show clear working. i. Find E ( Y ) in terms of . E ( Y ) = b 0 ( ) = @ exp( ) @✓ = exp( ) = - 1 ii. Find V ar ( Y ) in terms of . V ar ( Y ) = a ( φ ) b 00 ( ) = 1 @ 2 exp( ) @✓ 2 = exp( ) = - 1 2
2. [20 points] For homoskedastic (constant variance) uncorrelated Gaussian linear regression, you wrote models in the form Y = X β + where i iid N (0 , σ 2 ). (a) If X is n p where n > p = 5 and rank ( X ) = 3, can we estimate the parameter vector β uniquely by ordinary least squares without adding any constraints? No i. If you answer “yes”, then write down the formulae for the ordinary least squares estimator of β : (you may have simply called this the least squares estimator) ii. If you answer “no”, then clearly explain why not. X does not have full column rank: rank ( X ) = 3 < p . X 0 X is singular, so its inverse does not exist, therefore you cannot estimate β by ( X 0 X ) - 1 X 0 Y . Fine if they write equivalent reasons — e.g., a unique solution to the normal equations does not exist, where they indicate what the normal equations are. (b) When writing above model in terms of the 3 GLM components in PSTAT 127, what link function does above model use? No derivation is needed if you know the answer. Identity link, i.e., the function g ( . ) defined as g ( μ ) = μ . Please give full credit if they write either ”Identity link” or if they state ” g ( μ ) = μ ”. (c) When writing above model in terms of the 3 GLM components in PSTAT 127, what is the natural exponential family parameter φ in terms of the notation used above? Note, I am asking for the true parameter φ , which may be an unknown parameter; I am not asking for an estimator of φ . No derivation is needed if you immediately know the answer. φ = σ 2 (d) What is the distribution of the i th element of random vector Y (i.e., the distribution of Y i for some i 2 { 1 , . . . , n } ), where x T i is the i th row of matrix X ? No derivation is needed if you immediately can write out the distribution with its associated parameters in terms of notation provided, or its associated degrees of freedom (whichever of these is appropriate for this distribution). Y i N ( μ i , σ 2 ) , where μ i = x T i β It is fine if students do not write ” μ i ” in their answer, and immediately write in terms of x T i β . 3
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
3. [25 points] Suppose that y , x and w are each vectors of length n = 50, where each element in y is a number in { 0 , 1 , 2 , . . . , 15 } , and every element of w is strictly positive. (a) Write down the models and assumptions for each of models ”fit1” and ”fit2” in the 3 components of the GLM, and explain the di erences between these models and assumptions. wnew = log(w) fit1 <- glm( y ~ x + wnew, family = binomial( link = "logit" ) ) ## but read blue text below fit2 <- glm( y ~ x + wnew, family = gaussian( link = "identity")) Note: I graded the random component in fit1 very leniently, since I had a typo in the fit1 question above. Since this was my error, I gave you full credit for the fit1 random component (and fit3) if what you wrote made sense, given the typo — even if you wrote a di erent distribution within the random component of fit1 . Specifically, I omitted a revision in the question where I specified y.mat = cbind ( y, 15 - y ) , followed by fit1 <- glm( y.mat ~ x + wnew, family = binomial( link = "logit" ) ) Many students answered as I had planned; however since the error was my own - no points were deducted for anyone if they entered a di erent distribution for fit1. During the midterm, I don’t recall any emails asking specifically about the way the binomial data were entered into R, however please speak with me if you have concerns/questions. The solution I intended was fit1: Random Component : Y i indep Binomial ( m i , i ) . Let U i = Y i /m i , with m i = 15 Let μ i = E ( U i ) . Then μ i = i Systematic Component: i = β 0 + β 1 x i + β 2 ln( w i ) Link function: g ( μ i ) = logit ( μ i ) = ln μ i 1 - μ i = i for i 2 { 1 , . . . , 50 } , where y i , x i and w i are the i th elements of y, x, and w respectively. fit2: Random Component : Y i indep N ( μ i , σ 2 ) where μ i = E ( Y i ) Systematic component: i = β 0 + β 1 x i + β 2 ln( w i ) Link function: g ( μ i ) = μ i = i for i 2 { 1 , . . . , 50 } . 4
(b) Suppose I want to test model fit1 (above) to model fit3 (below) using a nested model hypothesis test. fit3 command with typo corrected: fit3 <- glm( y.mat ~ x, family = binomial( link = "logit" ) ) i. Write down what GLM component/s di er between models fit1 and fit3, and write out the corresponding compondent/s for model fit3. Only the systematic component di ers. In fit3 i = β 0 + β 1 x i ii. Which of the models corresponds to your null hypothesis? fit3 is the null hypothesized model. iii. Would I use a Chi-Square test or a F-test for my nested model hypothesis test? Explain why. Chi- square test since φ is known for Binomial Distribution. iv. Write down the degrees of freedom, and the critical value (from tables) for your rejection region if using = 0 . 01. Tables are appended at end of question sheet. You do not need to write out the test statistic here - only the information requested. DF = 1, Critical value: χ 2 1 , =0 . 01 = 6 . 63. 5
4. [35 points] Suppose 5 randomly selected seniors in high school, and 5 randomly selected freshmen at university, The education level (variable “educ”) of each student is denoted by ”H” (high school) or ”U” (university). Each student recorded the total number of hours they spent on social media in a specified 7 day week (variable “media”), and also the total number of texts they received within the same week (variable “texts”). The data are given in Table 1 , followed by Rcode and results. Table 1: Data on student education, social media time, and texts received student ID educ media texts 1 H 15 142 2 H 12 64 3 H 6 5 4 H 9 27 5 H 14 101 6 U 11 65 7 U 7 22 8 U 12 68 9 U 16 247 10 U 5 13 The following R commands were run, and an edited summary is presented. educ = factor( c( ‘‘H", ‘‘H’’, ‘‘H’’, ‘‘H", ‘‘H", ‘‘U", ‘‘U", ‘‘U", ‘‘U", ‘‘U") ) media = c( 15, 12, 6, 9, 14, 11, 7, 12, 16, 5 ) texts = c( 142, 64, 5, 27, 101, 65, 22, 68, 247, 13 ) fit1 <- glm( texts ~ -1 + media + educ, family=poisson( link = ‘‘log")) > summary(fit1) Coefficients: Estimate Std. Error z value Pr(>|z|) media 0.28378 0.01382 20.528 < 2e-16 *** educH 0.67740 0.19346 3.502 0.000463 *** educU 0.96589 0.19720 4.898 9.68e-07 *** --- (Dispersion parameter for poisson family taken to be 1) Null deviance: 5618.972 on 10 degrees of freedom Residual deviance: 7.499 on 7 degrees of freedom AIC: 70.143 6
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
(a) Write down the model and assumptions corresponding to fit1. Clearly define all the notation you use in the context of the problem. Answer: Let Y i be the random variable giving the texts received by student with i 2 { 1 , . . . , 10 } ID number, within a week, and μ i = E ( Y i ) . Model ”fit1” is: Y i indep Poisson ( μ i ) for i = 1 , . . . , 10 with log e ( μ i ) = ln( μ i ) = H I H ( i ) + U I U ( i ) + β media i where I H ( i ) = 1 if student i in High School 0 otherwise and I U ( i ) = 1 if student i in University 0 otherwise and media i is the number of hours student i spent on social media that week. Notes for grading: There is no intercept β 0 in the systematic component, since I removed this using the ”-1” in the glm formula. They may write Y i as the response random variable provided they define this in the context of the problem, or they may use equivalent well-defined notation — for example they may use texts i as the response, since that variable name is in the question. They need to specify that the Y 0 i s are independent within their answer. They may write log( μ i ) instead of either log e ( μ i ) or ln( μ i ), since we use log to represent natural log throughout this course, unless we specify a di erent base. It is fine if they combine the link (natural log) and systematic components within one equation as I did above, since I didn’t ask them to label the 3 components of the GLM. Alternatively: they may separate and label the 3 components of the GLM: After first defining Y i and i in the context of the problem as done above (or using texts i or other clearly defined notation for the response), the 3 GLM components are Random Component: Y i indep Poisson ( μ i ) for i = 1 , . . . , 10 where i represents student ID (or row in the data table). Systematic Component: i = H I H ( i ) + U I U ( i ) + β media i where I H ( i ) = 1 if student i in High School 0 otherwise and I U ( i ) = 1 if student i in University 0 otherwise Link Function: g ( μ i ) = log( μ i ) = i . 7
(b) Based on model fit1, will the estimated log of expected number of texts (per week) for a university student lie on a straight line when plotted versus hours on social media that week? Explain. Yes, these will lie on a straight line (since on the link scale). Specifically, if a randomly selected student is a university student, then log( μ ) = U + β media in terms of the notation defined earlier, where μ is their expected number of texts received when on social media for media hours. (c) Based on results provided for “fit1”, estimate the expected number of texts that a random selected high school senior who spends 6 hours per week on social media, will receive in that week. Show clear working, including the formulae you use. For a high school student, with 6 hours on social media, we estimate their expected number of texts as ˆ μ H, 6 = exp(ˆ H 1 + ˆ U 0 + ˆ β 6) = exp(0 . 67740 + 0 + 0 . 28378 6) 10 . 806 (d) Based on results provided for “fit1”, estimate the probability that a random selected high school senior who spends 6 hours per week on social media, will receive exactly 5 texts in that week. Show clear working, including the formulae you use. Let W the random variable corresponding to the number of texts this student receives, and = E ( W ). then ˆ = \ E ( W ) = ˆ μ H, 6 calculated in the previous part, and P ( W = y ) = e - y y ! , for any y 2 { 0 , 1 , 2 , 3 , . . . } . For y = 5, we estimate P ( W = 5) by \ P ( W = 5) = e - ˆ ˆ 5 5! , e - 10 . 806 (10 . 806) 5 5! , = 0 . 0249 rounded . (e) Suppose another GLM (fit2) is fitted to the same data, using the same random component, and that fit2 results have AIC = 110. Which model would you prefer between fit1 and fit2, based only on AIC? Prefer fit1, since lower AIC value. 8