W21

pdf

School

University of California, Santa Barbara *

*We aren’t endorsed by this school

Course

127

Subject

Statistics

Date

Jan 9, 2024

Type

pdf

Pages

7

Uploaded by BailiffBoulderPanther4

Report
PSTAT 127, Winter 2021 Sketch solutions to Additional Practice Problems that are posted To Help in Your Midterm Preparation. Note: Every midterm exam is di erent. These exercises should only be considered additional practice problems. All material from lectures, sections, homeworks and reading is examinable, including the pre-requisite PSTAT 120A/B and 126 foundational concepts we have built on (since the Gaussian linear regression model from 126 is a special case of the GLM’s we study in 127). 1. Consider a single random variable Y with probability mass function P ( Y = y ) = (1 - p ) y p , if y 2 { 0 , 1 , 2 , 3 , . . . } for p 2 (0 , 1). Write the above pdf in the exponential family form provided above. Show clear working. P ( Y = y ) = (1 - p ) y p = exp[ ln((1 - p ) y p ) ] = exp[ y ln(1 - p ) + ln( p ) ] see below for the identification of the specific parts of the exponential family form. Let μ = E ( Y ). Answer or fill in the following: (a) Write down the canonical parameter in terms of p : = ln(1 - p ). (b) Now find p in terms of : (i.e., write p as a function of ) = ln(1 - p ) () e = 1 - p () p = 1 - e (c) b ( ) = - ln( p ) = - ln(1 - e ) (d) φ = 1 (e) a ( φ ) = 1 (f) c ( y, φ ) = 0 1 This study source was downloaded by 100000875045266 from CourseHero.com on 10-29-2023 17:08:00 GMT -05:00 https://www.coursehero.com/file/81313248/Solutions-AdditionalPracticeProblemsForMidtermPreppdf/
2. Suppose that the number of dead budworms and number of alive budworms are counted in each of 12 containers exposed to toxins, with log-dosages of the toxin given by ldose. The sex of budworms within each of the containers, and the dosage of toxins are given below. The number of dead budworms are given by “numdead”, and the number of alive budworms are given by “numalive”. ldose <- rep(0:5, 2) numdead <- c(1, 4, 9, 13, 18, 20, 0, 2, 6, 10, 12, 16) sex <- factor(rep(c(‘‘M’’, ‘‘F’’), c(6, 6))) SF <- cbind(numdead, numalive=20-numdead) budworm5 <- glm(SF ~ -1 + sex + ldose, family=binomial) > ldose [1] 0 1 2 3 4 5 0 1 2 3 4 5 > sex [1] M M M M M M F F F F F F Levels: F M > summary(budworm5) Call: glm(formula = SF ~ -1 + sex + ldose, family = binomial) Deviance Residuals: Min 1Q Median 3Q Max -1.10540 -0.65343 -0.02225 0.48471 1.42944 Coefficients: Estimate Std. Error z value Pr(>|z|) sexF -3.4732 0.4685 -7.413 1.23e-13 *** sexM -2.3724 0.3855 -6.154 7.56e-10 *** ldose 1.0642 0.1311 8.119 4.70e-16 *** --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 (Dispersion parameter for binomial family taken to be 1) ... (a) Write down the model fitted in “budworm5”, and all the model assumptions, including distributional assump- tions. Let Z i be the random variable giving the number dead in container i 2 { 1 , . . . , 12 } , and let i be the probability that a randomly selected budworm in container i dies. Model ”budworm5” is: Z i indep Binomial(20 , i ) for i = 1 , . . . , 12 and let Y i = Z i / 20 for i = 1 , . . . , 12 logit( i ) = ln i 1 - i = ln(odds i ) = F I F ( i ) + M I M ( i ) + β ldose i where I F ( i ) = 1 if container i contains Females 0 otherwise and I M ( i ) = 1 if container i contains Males 0 otherwise 2 This study source was downloaded by 100000875045266 from CourseHero.com on 10-29-2023 17:08:00 GMT -05:00 https://www.coursehero.com/file/81313248/Solutions-AdditionalPracticeProblemsForMidtermPreppdf/
(b) Use model “budworm5” to estimate the odds of a female budworm dying if “ldose” equals 3. From (a) , d odds i = exp ˆ F I F ( i ) + ˆ M I M ( i ) + ˆ β ldose i So, for a container of 20 female budworms, at ldose = 3, an estimate of the odds of a randomly selected female budworm dying equals d odds ( Female, ldose = 3) = exp ( - 3 . 4732 + 1 . 0642(3)) = 0 . 7553 (after rounding) (c) Use model “budworm5” to estimate the expected number of budworms that would die in a container containing 20 female budworms, if ldose equals 3. If container i is the container with 20 females at ldose = 3, then b E (number of females dead in Female container at ldose = 3) = 20ˆ i , where ˆ i is the estimated probability of death corresponding to the estimated odds in part (2b). If we use notation ˆ γ i = d odds ( Female, ldose = 3), then ˆ i 1 - ˆ i = ˆ γ i () ˆ i = ˆ γ i (1 - ˆ i ) () ˆ i (1 + ˆ γ i ) = ˆ γ i () ˆ i = ˆ γ i (1 + ˆ γ i ) So ˆ i = 0 . 7553 (1 + 0 . 7553) = 0 . 4303 , and b E (number of females dead in Female container at ldose = 3) = 20ˆ i 20(0 . 4303) 8 . 6059 . Note: the expected value does not need to be an integer. Several students took the additional incorrect step of concluding that the expected number equals 9. Please ask if you have questions about expected values. (d) What is the change in the odds of a female budworm dying if ldose increases from 3 to 4? Show working. odds ( Female, ldose = 3) = exp ( F + β (3)) odds ( Female, ldose = 4) = exp ( F + β (4)) = odds ( Female, ldose = 3) exp( β ) So the odds of a Female dying at ldose = 4 is exp( β ) times the odds for a Female dying at ldose = 3. In terms of the estimates based on these data: exp( ˆ β ) = exp(1 . 0642) = 2 . 8958 (rounded). Alternatively, d odds ( Female, ldose = 4) - d odds ( Female, ldose = 3) = exp ˆ F + ˆ β (3) (exp( ˆ β ) - 1) = 1 . 4340 . (e) What is the change in the probability of a female budworm dying if ldose increases from 3 to 4? Show working. If γ F,j = odds ( Female, ldose = j ), and F,j = the probability a randomly selected Female dies at ldose = j , then F,j 1 - F,j = γ F,j () F,j = γ F,j (1 + γ F,j ) . So F, 4 = γ F, 4 (1 + γ F, 4 ) = ( γ F, 3 ) e β (1 + ( γ F, 3 ) e β ) From (2c), ˆ F, 3 = 0 . 4303. Similarly, ˆ F, 4 = 2 . 1893 (1+2 . 1893) 0 . 6865 . So ˆ F, 4 - ˆ F, 3 = 0 . 6865 - 0 . 4303 = 0 . 2562. (f) Would the di erence between predicted probabilities of death for males and females be the same at ldose=1, and at ldose=3? Why or why not? Draw a sketch/sketches with clear labels to justify your answer. The di erences between the log(odds) between males and females would be the same at ldose=1 and ldose=3. However, the di erences between predicted probabilities of death would not be equal, since logit is a non-linear transformation. In your sketch, the Male and Female prediction lines should be parallel on the logit scale, but not on the probability scale. You must clearly label all axes, not only sketch the pattern. You didn’t need to include the following, but for completeness: the predictions based on the current data show that ˆ M, 1 - ˆ M, 3 6 = ˆ F, 1 - ˆ F, 3 : d odds M, 1 = exp ˆ M + ˆ β (1) = exp( - 2 . 3724 + 1 . 0642(1)) 0 . 2703 , so ˆ M, 1 0 . 2703 (1 + 0 . 2703) = 0 . 2128 d odds M, 3 = exp ˆ M + ˆ β (3) = exp( - 2 . 3724 + 1 . 0642(3)) 2 . 2710 , so ˆ M, 3 2 . 2710 (1 + 2 . 2710) = 0 . 6943 d odds F, 1 = exp ˆ F + ˆ β (1) = exp( - 3 . 4732 + 1 . 0642(1)) 0 . 0899 , so ˆ F, 1 0 . 0899 (1 + 0 . 0899) = 0 . 0825 d odds F, 3 = exp ˆ F + ˆ β (1) = exp( - 3 . 4732 + 1 . 0642(3)) 0 . 7553 , so ˆ F, 3 0 . 7533 (1 + 0 . 7533) = 0 . 4303 (rounded) 3 This study source was downloaded by 100000875045266 from CourseHero.com on 10-29-2023 17:08:00 GMT -05:00 https://www.coursehero.com/file/81313248/Solutions-AdditionalPracticeProblemsForMidtermPreppdf/
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
3. Suppose that Mary has data { ( x i , y i ) : i = 1 , . . . , 200 } , where y i 2 (0 , 1) for all i 2 { 1 , . . . , 200 } . She recently started learning R, and considers fitting two models (fit1 and fit2 below): ynew <- log( y/ (1-y) ) # Note: the default log in R, is the natural log (i.e., log base e) fit1 <- lm( ynew ~ x ) # Note that fit1 is using R command "lm" # ========================== fit2 <- glm( y ~ x, family = gaussian( link="logit") ) Write down the models and assumptions for each of models “fit1” and “fit2”, and explain the di erences between these models and their assumptions. [25 points] Note that the observed y i values all lie in the open interval (0 , 1), i.e., strictly between 0 and 1: 0 < y i < 1 for all i = 1 , . . . , 200, based on the question above. fit1: logit( Y i ) = + β x i + i , where i iid N (0 , σ 2 ) and logit( Y i ) = ln Y i 1 - Y i or equivalently: If Z i = logit( Y i ) , then Z i indep N ( + β x i , σ 2 ) for i 2 { 1 , . . . , 200 } . Note that Z i = logit( Y i ) is Gaussian here, and that Y i = logit - 1 ( + β x i + i ). fit1 could be written as a Gaussian Linear Model (in the sense of 126 material) for transformed response Z i = logit( Y i ), and therefore also as a GLM for transformed response variable Z i with identify link. If you simulated from this model, the resulting y i values would all lie in the open interval (0 , 1). fit2: Y i indep N ( μ i , σ 2 ) where μ i = E ( Y i ) logit( μ i ) = γ + δ x i where logit( μ i ) = ln μ i 1 - μ i for i 2 { 1 , . . . , 200 } . That is, Y i indep N ✓  e ( γ + δ x i ) 1 + e ( γ + δ x i ) , σ 2 . Notice that here Y i = logit - 1 ( γ + δ x i ) + i , with i iid N (0 , σ 2 ) . fit2 is a GLM with link g = logit, and Y i is Gaussian here. Note: if you were simulating from model fit2, some of the resulting simulated y i values might actually take on values outside (0 , 1). A common error: some students sometimes write down the models and assumptions, but forget to expand on di erences between the models as asked. 4 This study source was downloaded by 100000875045266 from CourseHero.com on 10-29-2023 17:08:00 GMT -05:00 https://www.coursehero.com/file/81313248/Solutions-AdditionalPracticeProblemsForMidtermPreppdf/
4. A study was run to investigate whether a statistical model could be used to predict the probability of a household purchasing a new car within a 12-month period, based both on the income of the household and on the age of the oldest car belonging to the household at the start of that 12 month period. Data were collected from a random sample of n households. Each household was asked the age of their oldest automobile (variable labelled ”age” measured in years), and their income (variable labelled ”income”). One year later, a follow-up visit asked if the household had bought a new car in that 12 month period (variable ”purchase” - coded as ”1” if they had bought a new car, and ”0” otherwise”). Several models were fitted using R. Read the statistical analysis results below, and then answer the questions that follow. > fit1 <- glm(purchase ~ income + age, family=binomial(link=logit)) > summary(fit1) Call: glm(formula = purchase ~ income + age, family = binomial(link = logit)) Deviance Residuals: Min 1Q Median 3Q Max -1.6189 -0.8949 -0.5880 0.9653 2.0846 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -4.73931 2.10195 -2.255 0.0242 * income 0.06773 0.02806 2.414 0.0158 * age 0.59863 0.39007 1.535 0.1249 --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 44.987 on 32 degrees of freedom Residual deviance: 36.690 on 30 degrees of freedom AIC: 42.69 Number of Fisher Scoring iterations: 4 (a) What is the value of n (i.e., how many households are in our random sample)? 33 (b) Write down model ”fit1” and the assumptions of this model. Define all notation that you use. Let Y i = the random variable corresponding to ”purchase” for the i th household, for i = 1 , . . . , n where n = 33. Y i indep Bernoulli ( i ) = Binomial (1 , i ), where μ i = E ( Y i ) = i = P ( Y i = 1 | x i 1 , x i 2 ). In model ”fit1”, g ( μ i ) = logit ( μ i ) = logit ( i ) = ln i 1 - i = β 0 + β 1 x i 1 + β 2 x i 2 where x i 1 = income of the owner of the i th household, and x i 2 = age of the oldest automobile in the i th household, for i 2 { 1 , . . . , 33 } . 5 This study source was downloaded by 100000875045266 from CourseHero.com on 10-29-2023 17:08:00 GMT -05:00 https://www.coursehero.com/file/81313248/Solutions-AdditionalPracticeProblemsForMidtermPreppdf/
(c) Based on the parameter estimates in model fit1, explain how the predicted odds of buying a car would di er for 2 households that have identical incomes but cars that di er in ages by one year. Consider household A which has income x 1 = a and oldest automobile x 2 = c years old, then using the parameter estimates ˆ β 0 , ˆ β 1 and ˆ β 2 , the estimated odds of household A buying a new car in the next year is d odds(household A buys a new car) = exp[ ˆ β 0 + ˆ β 1 a + ˆ β 2 c ] . Consider household B which has income x 1 = a and oldest automobile x 2 = ( c +1) years old, then the estimated odds of household B buying a new car is d odds(household B buys a new car) = exp[ ˆ β 0 + ˆ β 1 a + ˆ β 2 ( c + 1)] = d odds(household A buys a new car) exp[ ˆ β 2 ] . That is, the estimated odds that household B buys a new car in the next year, is exp[ ˆ β 2 ] = exp[0 . 59863] 1 . 820 times the odds that household A buys a new car in the next year. It is good if you also note that age is not significant in this model (in addition to giving the answer above). > fit2 <- update(fit1, .~.-age) > summary(fit2) Call: glm(formula = purchase ~ income, family = binomial(link = logit)) Deviance Residuals: Min 1Q Median 3Q Max -1.5883 -0.8430 -0.7121 0.9262 1.7688 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -1.98079 0.85720 -2.311 0.0208 * income 0.04342 0.02011 2.159 0.0308 * --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 44.987 on 32 degrees of freedom Residual deviance: 39.305 on 31 degrees of freedom AIC: 43.305 Number of Fisher Scoring iterations: 4 # ------------------------------------------------------------------------------------------------------ > round( vcov(fit2), digits=5 ) # vcov provides the variance-covariance matrix of the (Intercept) income # main parameters in model fit2 (Intercept) 0.73478 -0.0154 income -0.01540 0.0004 (d) Using the results of model ”fit2” (i.e., assuming model ”fit2” is the correct model for the answer to this part of the question), find a 99% confidence interval estimate for β income . (Note: I would provide you with tables if asking this question in the midterm.) Using the approximate normality of ˆ β , the form of a 99% confidence interval is h ˆ β income ± z 0 . 005 se ( ˆ β income ) i , where z 0 . 005 2 . 575, ˆ β income = 0 . 04342, and se ( ˆ β income ) = 0 . 02011. Thus a 99% approximate confidence interval for β income is [0 . 04342 - 2 . 575(0 . 02011) , 0 . 04342 + 2 . 575(0 . 02011)] [ - 0 . 00836 , 0 . 09520] . 6 This study source was downloaded by 100000875045266 from CourseHero.com on 10-29-2023 17:08:00 GMT -05:00 https://www.coursehero.com/file/81313248/Solutions-AdditionalPracticeProblemsForMidtermPreppdf/
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
(e) i. Explain why I receive the following warning message when I run the following anova command in R. > anova(fit2, fit1, test="F") Analysis of Deviance Table Model 1: purchase ~ income Model 2: purchase ~ income + age Resid. Df Resid. Dev Df Deviance F Pr(>F) 1 31 39.305 2 30 36.690 1 2.6149 2.6149 0.1059 Warning message: using F test with a ’binomial’ family is inappropriate The approximate F test is inappropriate when you know the value of φ . We know that φ = 1 for the binomial distribution. We should do an approximate nested χ 2 test. ii. How should I modify this anova command in order to obtain the p-value from an approximate hypothesis test of H 0 : Model corresponding to ”fit2”, versus H a : Model corresponding to ”fit1” that we studied in lecture (assuming that the random component distribution assumption in the GLM commands in this question is correct)? Replace the command anova(fit2, fit1, test=”F”) by anova(fit2, fit1, test=”Chisq”) since φ is a known constant (1) in the binomial. Note: as part of this question I could have given you tables and asked you also to perform the test you suggested above by hand, based on the R results presented earlier in this question i.e., write down the corresponding appropriate test statistic and its approximate distribution under the null hypothesis, state your decision rule if testing at = 0 . 01, calculate the observed value of this test statistic, and state your conclusion. You must be able to use standard normal, t, chi-squared or F distribution tables to find p-values if asked, and recognize which is the correct table to use. 7 This study source was downloaded by 100000875045266 from CourseHero.com on 10-29-2023 17:08:00 GMT -05:00 https://www.coursehero.com/file/81313248/Solutions-AdditionalPracticeProblemsForMidtermPreppdf/ Powered by TCPDF (www.tcpdf.org)