EC295_assign3SOL

pdf

School

Seneca College *

*We aren’t endorsed by this school

Course

310

Subject

Statistics

Date

Feb 20, 2024

Type

pdf

Pages

11

Uploaded by HighnessEmuMaster882

Report
1 Individual Assignment 3 - SOLUTION EC295 Winter 2022 Assignment Description In this assignment you are asked to manipulate data, estimate statistical relationships, and interpret the findings. The main goal behind the assignment is to help you get more comfortable applying statistical methods and using software, but also to think about a policy-relevant topic that economists actively research today. The questions below guide you through the process of statistical estimation. You are provided the relevant Stata commands you will need, some of which you will not have seen before. It will therefore be useful for you to use the “help” function in Stata, and/or to look up the command in the Stata reference manuals (which are available within Stata as PDFs), or Google. You are also, as always, welcome to ask me for help. I strongly suggest that you start this assignment early because it will not be possible (in my opinion) to do well if you start close to the due date. There are parts that you may find difficult; you will want to identify them and leave enough time to ask questions if necessary. Assignment Instructions Data analysis In mylearningspace, you will find a datafile called “assign3.dta” that contains the data for this assignment. Download the dataset to your computer and make note of the folder where you save it. I have also provided a template dofile that all students must use to write their assignment dofile (if you are using R, you will need to recreate something similar to this). Store it in the same folder where you put your data. You will need to manipulate that template in the following way: - Rename the file from “assign3 template.do” to your last name followed by your student number (no spaces) - After cd, replace INSERT THE PATH TO THE FOLDER WHERE YOU STORE THE DATA with the path to the folder where you stored the datasets. Do not remove the quotation marks. - After log using, replace INSERT YOUR LAST NAME AND STUDENT NUMBER HERE with your last name and student number, with no space between the two. Do not remove the quotation marks - After set seed, replace INSERT YOUR STUDENT NUMBER HERE with your full student number. Leave all other commands and comments untouched. You should type in your Stata commands below the line that says “Insert your stata commands below here”, but above “Insert your stata commands above here”.
2 Note that the set seed and sample commands will take a random 95% subsample of the data that is different for every student. For this reason, the numbers that you get with your output will not be the same for any two students. Be mindful of this if you are comparing your work with your peers. Submission You are required to submit three documents: a) A report containing your answers to all the questions. I outline below how I would like your report to look. The overall goal is that the answers to each question must be easily identifiable in a readable, professional-looking document. Submit to Gradescope . You will receive a 10% penalty if you fail to submit to Gradescope . b) Stata dofile . Submit electronically using the dropbox in mylearningspace. c) Stata log file . Submit electronically using the dropbox in mylearningspace. In the report described in (a) above, please answer all questions in the same order as they are stated on the question sheet. For each question and sub-question, include the relevant Stata code (if any) that you used, the output generated by that command if there was any, and an interpretation if you are asked to provide it. For example, if you were answering the following hypothetical question, it might look like this: ************************************************************************************ 1) Locate the variable y a. Using the tab command, provide a frequency distribution for y Stata commands: tab y; Output: y | Freq. Percent Cum. ------------+----------------------------------- 1 | 23,844 10.05 10.05 2 | 138,568 58.40 68.45 3 | 9,049 3.81 72.26 4 | 63,162 26.62 98.88 5 | 2,651 1.12 100.00 ------------+----------------------------------- Total | 237,274 100.00 ************************************************************************************ You could also format your own output tables rather than copying and pasting Stata output if you find it easier. The key is that as long as the questions are answered in order, and the Stata commands used for each subquestion and associated output are clear, it will be fine.
3 A note on plagiarism: this is an independent assignment, which I expect you to complete on your own. It is pla giarism to copy someone else’s work verbatim, which includes Stata dofiles. Any work you submit should be yours only. Thank you note: I am very grateful to Professor Justin Smith for sharing his class material. This assignment represents a modified version of a STATA homework developed by him. Each sub question is worth 5 points, for a total of 60. 1) (5 Points) Provide a table of basic summary statistics for your data and briefly describe your findings for each variable. . summarize Variable | Obs Mean Std. Dev. Min Max -------------+--------------------------------------------------------- nprevist | 2,999 10.995 3.668144 0 35 alcohol | 2,999 .0193398 .1377392 0 1 birthweight | 2,999 3383.145 592.1488 425 5755 smoker | 2,999 .1940647 .3955449 0 1 educ | 2,999 12.9113 2.154197 3 17 -------------+--------------------------------------------------------- age | 2,999 26.89296 5.358982 14 44 marital_st~s | 2,999 2.166722 .769377 1 3 The data contain the following information: nprevisit: Number of prenatal visits. On average, women attend 11 of these, though some attend none, while others attend a much higher amount. The deviation away from the mean on average is 3.5 visits. Alcohol: This is a dummy variable indicating that the mother drank alcohol during pregnancy. Table above shows that about 2% of women drink while pregnant. birthweight: the weight in grams of the baby when born. The average baby weights about 3400 grams, though babies can weigh as low as 425 grams, and as much as 5755 grams. The variation around the mean on average is about 600 grams. Smoker: A dummy variable indicating that a woman smokes while pregnant. Table shows that about 20% of mothers engage in this behaviour. educ: Number of years of education of the mother. Mothers have about 13 years of schooling, though some have very little and others have many. The mean deviation from the average is about 2 years. age: age in years of the mother. The data indicate that the average mother is just under 27 years old, though some are as young as 14, and others are as old as 44. There is fairly wide variation in age of birth, as given by the standard deviation of 5.35. marital status: distribution of marital status. The table shows that 22.6% of the mothers are single, 28.1% are married, and 39.3% are divorced. 2) Suppose that you are interested in exploring whether mother’s alcohol consumption while pregnant affects a child’s birthweight.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
4 a) (5 Points) Estimate and interpret the parameters in the following baseline specification: ?𝑖𝑟?ℎ𝑤𝑒𝑖𝑔ℎ? 𝑖 = 𝛽 0 + 𝛽 1 ?𝑙?𝑜ℎ𝑜𝑙 𝑖 + ? 𝑖 Assume heteroskedasticity when calculating the standard errors. . regress birthweight alcohol, robust Linear regression Number of obs = 2,999 F(1, 2997) = 3.13 Prob > F = 0.0769 R-squared = 0.0011 Root MSE = 591.91 ------------------------------------------------------------------------------ | Robust birthweight | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- alcohol | -144.8952 81.89 -1.77 0.077 -305.4615 15.67107 _cons | 3385.947 10.90489 310.50 0.000 3364.565 3407.329 ------------------------------------------------------------------------------ The coefficient on alcohol is the difference in average birthweight between mothers who drank while pregnant versus mothers who did not. Mothers who drank while pregnant have children whose weight is, on average, 144 grams lower than those children from mothers who did not drink. The intercept is the average birthweight for mothers who did not drink (3,385). b) (5 Points) Test whether the effect of consuming alcohol on birthweight is statistically significant at the 1% significance level. Using the regression output above, we see that the p-value on the alcohol coefficient is 0.077. Since this is larger than the 1% significance level, the effect of consuming alcohol on birthweight is not statistically significant (we accept the null that it is zero). We could alternatively find the critical value for a 1% significance level. The corresponding significance level is . di invttail(2999-2,0.005) 2.5774708 Since the absolute value of the t-statistic (1.77) is smaller than the critical value, we fail to reject the null hypothesis that the effect is zero. c) Estimate an alternative specification that extends the baseline specification to include the variable smoker. Assume heteroskedasticity when calculating the standard errors. i) (5 Points) Interpret all the parameters of the model. . regress birthweight alcohol smoker, robust Linear regression Number of obs = 2,999
5 F(2, 2996) = 44.85 Prob > F = 0.0000 R-squared = 0.0289 Root MSE = 583.74 ------------------------------------------------------------------------------ | Robust birthweight | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- alcohol | -57.73739 77.38407 -0.75 0.456 -209.4687 93.99389 smoker | -251.0802 26.86998 -9.34 0.000 -303.7656 -198.3947 _cons | 3432.987 11.94117 287.49 0.000 3409.573 3456.401 ------------------------------------------------------------------------------ The coefficient on alcohol is the average birthweight of mothers who drink while pregnant minus the average birthweight of mother who do not drink while pregnant, regardless of whether they smoke (holding smoking status constant). Likewise, the coefficient on smoker is the difference in average birthweight between mothers who smoke while pregnant and those who do not, regardless of whether they drink (holding alcohol status constant). The intercept is the average birthweight for a mother who does not smoke or drink. ii) (5 Points) Compare the estimate of the coefficient on alcohol between specifications and comment on omitted variables bias. In the model that excludes smoker, the effect on alcohol is much more negative, which suggests a downward omitted variables bias. Since we also see from the regression above that smoking has a big negative effect on birthweight, what is happening is that mothers who drink also smoke (i.e. alcohol and smoker are positively correlated), so in the simple regression the coefficient on alcohol partly measures the effect of drinking, but also partly the effect of smoking, making it look like a big effect. iii) (5 Points) Using measures of fit we have discussed in class compare the fit of the model in 2a to that of 2c. The key measures of fit are R2 and the SER. In terms of the R2, notice that it is higher in the multiple regression, reflecting the fact that as you add more variables to a regression, you by definition increase the fraction of the variation in birthweight that is explained by factors in your model. That is, you cannot lower R2 by adding variables to a model. In this multiple regression, variation in smoking and alcohol behavior explain about 2.9% of the movement in birthweight. Using just alcohol, it was 0.1% In terms of the SER, recall that this measures the average distance of data points away from the regression line. You’ll notice that SER shrinks in the multiple regression to 583 from 591. The easiest way to understand this is to remember that SER is the (square root of the) sum of the squared residuals (SSR) divided by the degrees of freedom. When you add a variable to the model, SSR falls because you’ve taken a variable that was previously unobserved (and therefore was part of the residual), and made it observed (drawing it out of the residuals). This means there will be less variation in the residual, causing SSR to fall. This is effectively what makes the SER fall in this case. The outcome is that the datapoints are, on average, closer to the regression line in the multiple regression.
6 d) (5 Points) Estimate another alternative specification that extends the regression in (c) to include measures of education, age, and marital status. In your specification, use the “married” category as the reference category. Compare to the previous specifications and comment further on omitted variables bias. Assume heteroscedasticity when calculating the standard errors. . ta marital, gen(mar) =1 if | single, =2 | if married, | =3 if | divorced | Freq. Percent Cum. ------------+----------------------------------- 1 | 679 22.64 22.64 2 | 1,141 38.05 60.69 3 | 1,179 39.31 100.00 ------------+----------------------------------- Total | 2,999 100.00 . rename mar1 single . rename mar2 married . rename mar3 divorced . regress birthweight alcohol smoker educ age single divorced, robust Linear regression Number of obs = 2,999 F(6, 2992) = 27.84 Prob > F = 0.0000 R-squared = 0.0577 Root MSE = 575.39 ------------------------------------------------------------------------------ | Robust birthweight | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- alcohol | -37.68205 76.76566 -0.49 0.624 -188.2009 112.8368 smoker | -186.6101 28.23914 -6.61 0.000 -241.9802 -131.24 educ | 6.933331 5.632159 1.23 0.218 -4.109966 17.97663 age | -2.276274 2.486453 -0.92 0.360 -7.151604 2.599055 single | -230.7544 34.04825 -6.78 0.000 -297.5147 -163.994 divorced | 26.1848 22.95292 1.14 0.254 -18.82031 71.1899 _cons | 3433.736 86.86598 39.53 0.000 3263.413 3604.059 ------------------------------------------------------------------------------ When adding additional control variables, the effect of alcohol gets even closer to zero. That suggests that even when controlling for smoker, there was still some downward omitted variables bias related to the person’s background. It is likely the case mothers who drink measure lower on several measures of background (e.g. drinking is negatively correlated with background measures), and these background factors tend to be positively related to birthweight, so it makes the alcohol effect look more negative. e) (5 Points) In the previous regression, precisely interpret the coefficients on the single and divorced dummy variables. Test the hypothesis that the effect on birthweight is the same for single and divorced mothers at the 5% significance level. The coefficient on single is the average birthweight of single mothers minus the average birthweight of married mothers, holding everything else constant. Likewise, the coefficient on
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
7 divorced is the difference in average birthweight between divorced mothers minus average birthweight of married mothers, holding everything else constant. The null and alternative hypotheses are 𝑯 ? : ? ?????? = ? ???????? , 𝑯 ? : ? ?????? ≠ ? ???????? . We can perform this test in STATA (F test). . test single=divorced ( 1) single - divorced = 0 F( 1, 2992) = 58.55 Prob > F = 0.0000 We see that the p-value is zero. Since this is smaller than the 1% significance level, we reject the null hypothesis that the effect on birthweight is the same for single and divorced mothers. f) (5 Points) Suppose you decided to use the “single” category as the reference category in question (d); derive by hand the intercept and all slope estimates (coefficients on alcohol, smoker, education, age, married and divorced dummy variables). The previous model was: ??????????? ? = ? ? + ? ? ??????? ? + ? ? ?????? + ? ? ???? + ? ? ??? + ? ? ?????? + ? ? ???????? + ? ? The new model is: ??????????? ? = ? ? + ? ? ??????? ? + ? ? ?????? + ? ? ???? + ? ? ??? + ? ? ??????? + ? ? ???????? + ? ? Let’s first interpret the coefficients from the old model. ? ? : average birthweight for married mothers (the base category), when all other independent variables are zero. ? ? : average birthweight of single mothers minus the average birthweight of married mothers, holding everything else constant. ? ? : average birthweight of divorced mothers minus the average birthweight of married mothers, holding everything else constant. Then, the average birthweight of single mothers is ? ? + ? ? . The average birthweight of divorced mothers is ? ? + ? ? . Let’s now find the coefficients in the new regression: The coefficients on alcohol, educ, age, and smoker are the same as in (d), they do not change. The intercept, ? ? , is the average birthweight for single mothers (the base category), when all other independent variables are zero. Then, ? ̂ ? = ? ̂ ? + ? ̂ ? = ????. ??? − ???. ???? = ????. ?? . The coefficient on married, ? ? , is the average birthweight of married mothers minus the average birthweight of single mothers, holding everything else constant. From the previous regression, it is equal to 𝛄 ̂ ? = − ? ̂ ? = ???. ???? . Lastly, the coefficient on divorced, ? ? , is the average birthweight of divorced mothers minus the average birthweight of single mothers, holding everything else constant. From the previous regression, ? ̂ ? = ? ̂ ? − ? ̂ ? = ??. ???? − (−???. ????) = ???. ??? . We can verify the result using STATA. . regress birthweight alcohol smoker educ age married divorced, robust
8 Linear regression Number of obs = 2,999 F(6, 2992) = 27.84 Prob > F = 0.0000 R-squared = 0.0577 Root MSE = 575.39 ------------------------------------------------------------------------------ | Robust birthweight | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- alcohol | -37.68205 76.76566 -0.49 0.624 -188.2009 112.8368 smoker | -186.6101 28.23914 -6.61 0.000 -241.9802 -131.24 educ | 6.933331 5.632159 1.23 0.218 -4.109966 17.97663 age | -2.276274 2.486453 -0.92 0.360 -7.151604 2.599055 married | 230.7544 34.04825 6.78 0.000 163.994 297.5147 divorced | 256.9392 33.57875 7.65 0.000 191.0994 322.779 _cons | 3202.982 76.84889 41.68 0.000 3052.3 3353.664 ------------------------------------------------------------------------------ g) (5 Points) Test whether drinking and smoking are jointly statistically significant at the 1% significance level. Use the regression in (d). This is a test of joint hypothesis. To test this hypothesis, we need and F- test. The “test” command performs this F-test, and the result is that the p-value is zero, meaning that we strongly reject the null that both effects are zero. Thus, the variables are jointly significant. . test alcohol smoker ( 1) alcohol = 0 ( 2) smoker = 0 F( 2, 2992) = 22.50 Prob > F = 0.0000 h) (5 Points) Re-estimate the regression in (d) assuming homoscedasticity. Test whether any variable in the regression is significant. Derive the homoscedasticity only F-statistic by hand (i.e., do not use STATA to calculate it). Notice that you will have to run two regressions to compute the F-statistic. . regress birthweight alcohol smoker educ age single divorced //unrestricted Source | SS df MS Number of obs = 2,999 -------------+---------------------------------- F(6, 2992) = 30.53 Model | 60640280.1 6 10106713.3 Prob > F = 0.0000 Residual | 990578985 2,992 331075.864 R-squared = 0.0577 -------------+---------------------------------- Adj R-squared = 0.0558 Total | 1.0512e+09 2,998 350640.182 Root MSE = 575.39 ------------------------------------------------------------------------------ birthweight | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- alcohol | -37.68205 77.10625 -0.49 0.625 -188.8687 113.5046 smoker | -186.6101 27.95404 -6.68 0.000 -241.4212 -131.799 educ | 6.933331 5.622676 1.23 0.218 -4.091372 17.95803 age | -2.276274 2.307151 -0.99 0.324 -6.800038 2.247489 single | -230.7544 31.08617 -7.42 0.000 -291.7068 -169.802 divorced | 26.1848 23.90707 1.10 0.273 -20.69117 73.06076 _cons | 3433.736 82.52552 41.61 0.000 3271.924 3595.549 ------------------------------------------------------------------------------
9 We need the R2 from the restricted and unrestricted models to compute the homoscedasticity only F-statistic. However, we know that the R2 from the restricted regression is zero, since the model has no independent variables once we impose the restrictions under the null. Hence, we only need the R2 from the original/ unrestricted regression. The actual value of the F-statistic is: 𝑭 = ?.????/? (?−?.????)/(?,???−?−?) = ??. ?????? The critical value for a 5% significance level is . di invFtail(6,2999-6-1,0.05) 2.101613 Since the F-statistic (30.53) is larger than the critical value (2.10), we reject the null hypothesis. We can verify the answer using the test command in STATA. The p-value is zero, which means that we reject the null hypothesis for a 1%, 5%, and 10% significance levels. . test alcohol smoker educ age single divorced ( 1) alcohol = 0 ( 2) smoker = 0 ( 3) educ = 0 ( 4) age = 0 ( 5) single = 0 ( 6) divorced = 0 F( 6, 2992) = 30.53 Prob > F = 0.0000 i) (5 Points) Using an F-test (approach 1 in the lecture notes), test whether the effect of smoking on birthweight is 6 times the effect of drinking on birthweight. Use the results from the regression in 2.d. This is a single hypothesis test involving two parameters (q=1). We can use the F-test or t-test to answer this. The “test” command performs this F -test. . test smoker=6*alcohol ( 1) - 6*alcohol + smoker = 0 F( 1, 2992) = 0.01 Prob > F = 0.9322 The p-value on this test is 0.932, meaning that we fail to reject the null that the effect of smoking on birthweight is 6 times the effect of drinking on birthweight. j) (5 Points) Using a t-test (approach 2 in the lecture notes), test whether the effect of smoking on birthweight is 6 times the effect of drinking on birthweight. Report the results of any auxiliary regressions you need to run to complete this test. Use the regression in (d). Suppose the regression we are estimating is
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
10 ??????????? ? = ? ? + ? ? ??????? ? + ? ? ?????? + ? ? ???? + ? ? ??? + ? ? ??????? + ? ? ???????? + ? ? The null hypothesis we are testing is therefore that ?? ? = ? ? . Alternatively, we could state this null hypothesis as ? ? − ?? ? = ? . To implement the t-test, define 𝜽 = ? ? − ?? ? , which means our test is now whether 𝜽 = ? versus the alternative that it is not. Rearranging the 𝜽 equation to 𝜽 + ?? ? = ? ? , and then substituting into the regression above, we get ??????????? ? = ? ? + ? ? ??????? ? + (𝜽 + ?? ? )?????? + ? ? ???? + ? ? ??? + ? ? ??????? + ? ? ???????? + ? ? Expanding out the bracket and rearranging we get ??????????? ? = ? ? + ? ? ??????? ? + 𝜽?????? + ?? ? ?????? + ? ? ???? + ? ? ??? + ? ? ??????? + ? ? ???????? + ? ? ??????????? ? = ? ? + ? ? (??????? ? + ???????) + 𝜽?????? + ? ? ???? + ? ? ??? + ? ? ??????? + ? ? ???????? + ? ? Let ???_????? = ??????? ? + ??????? then the regression we need to run is ??????????? ? = ? ? + ? ? (???_????? ? ) + 𝜽?????? + ? ? ???? + ? ? ??? + ? ? ??????? + ? ? ???????? + ? ? Then, do a t-test on the coefficient in front of smoker. . gen alc_smoke = alcohol + 6*smoker . regress birthweight alc_smoke smoker educ age single divorced, robust Linear regression Number of obs = 2,999 F(6, 2992) = 27.84 Prob > F = 0.0000 R-squared = 0.0577 Root MSE = 575.39 ------------------------------------------------------------------------------ | Robust birthweight | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- alc_smoke | -37.68205 76.76566 -0.49 0.624 -188.2009 112.8368 smoker | 39.48222 464.2455 0.09 0.932 -870.7904 949.7549 educ | 6.933331 5.632159 1.23 0.218 -4.109966 17.97663 age | -2.276274 2.486453 -0.92 0.360 -7.151604 2.599055 single | -230.7544 34.04825 -6.78 0.000 -297.5147 -163.994 divorced | 26.1848 22.95292 1.14 0.254 -18.82031 71.1899 _cons | 3433.736 86.86598 39.53 0.000 3263.413 3604.059 ------------------------------------------------------------------------------ Here, we see the estimate is 39.48222. The p-value on this test is 0.932, meaning that we fail to reject the null that ? ? − ?? ? = ? Note that you can alternatively define 𝜽 = ?? ? − ? ? . Rearranging the 𝜽 equation to 𝜽+? ? ? = ? ? , and then substituting into the regression above, we get
11 ??????????? ? = ? ? + 𝜽 + ? ? ? ??????? ? + ? ? ?????? + ? ? ???? + ? ? ??? + ? ? ??????? + ? ? ???????? + ? ? Expanding out the bracket and rearranging we get ??????????? ? = ? ? + 𝜽 ??????? ? ? + ? ? (?????? + ??????? ? ? ) + ? ? ???? + ? ? ??? + ? ? ??????? + ? ? ???????? + ? ? Let ???_????? = ?????? + ??????? ? ? and ???_? = ??????? ? ? then the regression we need to run is ??????????? ? = ? ? + 𝜽(???_? ? ) + ?_????_?????? + ? ? ???? + ? ? ??? + ? ? ??????? + ? ? ???????? + ? ? Then, do a t-test on the coefficient in front of alc_2. The answer is the same, see below. . gen alc_2=alcohol/6 . gen alc_smoke2 = alcohol/6 + smoker . regress birthweight alc_smoke2 alc_2 educ age single divorced, robust Linear regression Number of obs = 2,999 F(6, 2992) = 27.84 Prob > F = 0.0000 R-squared = 0.0577 Root MSE = 575.39 ------------------------------------------------------------------------------ | Robust birthweight | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- alc_smoke2 | -186.6101 28.23914 -6.61 0.000 -241.9802 -131.24 alc_2 | -39.48224 464.2455 -0.09 0.932 -949.7549 870.7904 educ | 6.933331 5.632159 1.23 0.218 -4.109966 17.97663 age | -2.276274 2.486453 -0.92 0.360 -7.151604 2.599055 single | -230.7544 34.04825 -6.78 0.000 -297.5147 -163.994 divorced | 26.1848 22.95292 1.14 0.254 -18.82031 71.1899 _cons | 3433.736 86.86598 39.53 0.000 3263.413 3604.059 ------------------------------------------------------------------------------