ps2_sol_Fall 2023

pdf

School

Columbia University *

*We aren’t endorsed by this school

Course

UN3412

Subject

Economics

Date

Jan 9, 2024

Type

pdf

Pages

17

Uploaded by JudgeMaskYak32

Report
Department of Economics UN3412 Columbia University Fall 2023 SOLUTIONS to Problem Set 2 Introduction to Econometrics (Erden_ Section 1) ______________________________________________________________________________ Please make sure to select the page number for each question while you are uploading your solutions to Gradescope. Otherwise, it is tough to grade your answers, and you may lose points. Part I True, False, Uncertain with Explanation: 1. [graded] The assumption that the errors in a regression model have a normal distribution is necessary to obtain the result that the least squares estimator for the slope is unbiased. FALSE. We do not need normality to have unbiasness. Note that 𝛽 ̂ 𝑂𝐿? = ∑(? ? − ? ̅ )(? ? − ? ̅ ) ∑(? ? − ? ̅ ) 2 = ∑(? ? − ? ̅ )[𝛽(? ? − ? ̅ ) + ? ? ] ∑(? ? − ? ̅ ) 2 Since ? ? − ? ̅ = 𝛽(? ? − ? ̅ ) + (? ? − ? ̅) Hence we can write 𝛽 ̂ 𝑂𝐿? = ∑(? ? − ? ̅ ) 2 ∑(? ? − ? ̅ ) 2 𝛽 + ∑(? ? − ? ̅ )(? ? − ? ̅) ∑(? ? − ? ̅ ) 2 = 𝛽 + ∑(? ? − ? ̅ )? ? ∑(? ? − ? ̅ ) 2 Since ∑(? ? − ? ̅ )? ̅ = ? ̅ ∑ ? ? − ? ? ̅ ? ̅ = ? ̅ 𝑛 𝑛 ∑ ? ? − ?? ̅ ? ̅ = ?? ̅ ? ̅ − ?? ̅ ? ̅ = 0 Or similarly, ∑(? ? − ? ̅ )? ̅ = ? ̅ ∑ ? ? − ? ? ̅ ? ̅ = ? ̅ ∑ ? ? − ? ∑? 𝑖 𝑛 ? ̅ = 0 Then, ?[𝛽 ̂ 𝑂𝐿? ] = 𝛽 + ? [ ∑(? ? − ? ̅ )? ? ∑(? ? − ? ̅ ) 2 ] Applying law of iterated expectations, = 𝛽 + ? [? [ ∑(? ? − ? ̅ )? ? ∑(? ? − ? ̅ ) 2 | ? ? ]]
Under the assumption LSA#2, we can write = 𝛽 + ? [ ∑(? ? − ? ̅ ) ∑(? ? − ? ̅ ) 2 ?[? ? |? ? ]] = 𝛽 by LSA#1 ?[? ? |? ? ] = 0 2. [graded] The correct interpretation of an estimated regression coefficient is the following: 𝛽 ̂ ? is the effect of a unit change in ? ? on ? with all the other regressors moving in concert with ? ? according to their sample correlations. FALSE. In the linear regression model, the correct interpretation is: 𝛽 ? is the marginal effect of ? ? on ? holding all other regressors constant. 3. [graded] There are 20 stores along Madison Avenue. I want to estimate the effect of weather on sales at the stores, so I run one regression for each store. In each I regress sales on weather conditions. Claim: I would expect to get one slope coefficient significantly different from zero at the 5% level even if all the true (population) slope coefficients are zero. TRUE. By definition if we test the null hypothesis that 𝛽 = 0 with a significance level of 5%, we would reject the null 5% of the time if the null is true. 4. [graded] In the above regression, I should also include variables for whether or not each store is running a sale. Otherwise, omitting that variable will bias my coefficient estimates on weather conditions. UNCERTAIN. Not including variables for whether or not each store is running a sale would only be a problem for estimating the effects of weather on sales if: i. Whether or not a store is running a sale is correlated with weather. ii. Whether or not a store is running a sale affects sales. Running a sale should increase sales, so ( ii ) is probably true. Therefore, if running a sale is correlated with weather, then omitting the variables of whether or not a store is running a sale would generate biased estimates for the effect of weather on sales. Part II 1. [graded] In Problem Set 1, last week, you have calculated intercept and slope of the sample regression of lung cancer deaths in 1950 on cigarettes consumed per capita in 1930 for five countries given below: Observation # Country Cigarettes consumed per capita in 1930 ( X ) Lung cancer deaths per million people in 1950 ( Y ) 1 Switzerland 530 250
2 Finland 1115 350 3 Great Britain 1145 465 4 Canada 510 150 5 Denmark 380 165 This week, please calculate the same statistics using STATA. On the STATA output file, find and label the items. i) The sample means of X and Y , X and Y . ii) The standard deviations of X and Y , ? ? and ? ? . iii) The correlation coefficient, r , between X and Y iv) 1 ˆ , the OLS estimated slope coefficient from the regression Y i = 0 + 1 X i + u i v) 0 ˆ , the OLS estimated intercept term from the same regression vi) ˆ i Y , i = 1,…, n , the predicted values for each country from the regression vii) ˆ i u , the OLS residual for each country. STATA HINTS: First load STATA and type “edit,” which brings up something that looks like a spreadsheet. Enter the smoking and cancer values in the first two columns. Double- click the column headers to enter variable names (e.g. “smoke”, “death”). Close the editor window when you are done. The following commands will be useful: list lists the data (to be sure you typed it in correctly) summarize computes sample means and standard deviations (the option “,detail” gives additional statistics, including the sample variance) correlate produces correlation coefficients (with the option “, covariance” this command produces covariances) regress estimates regression by OLS predict compute OLS predicted values and residuals Note that STATA has on-line help. Do not be concerned if you do not yet understand all the statistics shown in the output we will discuss them in class in due course. Answers: a) Listing of the data: +-------------------------+ | country cigs deaths |
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
|-------------------------| 1. | Switz 530 250 | 2. | Finland 1115 350 | 3. | Britain 1145 465 | 4. | Canada 510 150 | 5. | Denmark 380 165 | +-------------------------+ b) Mean and standard deviation: . summarize cigs deaths; Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------- cigs | 5 736 364.4071 380 1145 deaths | 5 276 132.3537 150 465 c) Correlation coefficient: . * ----- compute correlation -----; . correlate cigs deaths; (obs=5) | cigs deaths -------------+------------------ cigs | 1.0000 deaths | 0.9263 1.0000 d) OLS Regression: . regress deaths cigs; Source | SS df MS Number of obs = 5 -------------+------------------------------ F( 1, 3) = 18.12 Model | 60116.1644 1 60116.1644 Prob > F = 0.0238 Residual | 9953.83564 3 3317.94521 R-squared = 0.8579 -------------+------------------------------ Adj R-squared = 0.8106 Total | 70070 4 17517.5 Root MSE = 57.602 ------------------------------------------------------------------------------ deaths | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- cigs | .3364177 .0790347 4.26 0.024 .084894 .5879414 _cons | 28.39656 63.61827 0.45 0.686 -174.0652 230.8583 ------------------------------------------------------------------------------ 0 ˆ = 28.39656 1 ˆ = .3364177 e) Predicted values and residuals . predict dhat; (option xb assumed; fitted values) . generate uhat = deaths - dhat; . list deaths dhat uhat;
+-------------------------------+ | deaths dhat uhat | |-------------------------------| 1. | 250 206.698 43.30205 | 2. | 350 403.5023 -53.50232 | 3. | 465 413.5948 51.40515 | 4. | 150 199.9696 -49.96959 | 5. | 165 156.2353 8.764709 | +-------------------------------+ In this table, the predicted values are dhat and the residuals are uhat. 2. [graded] Using “graph twoway” command in STATA, graph the scatterplot of the five data points and the regression line. Interpret sample slope and sample intercept. Answers: Once we run the cigar.do file the graph is generated, the command for it is Graph twoway (scatter deaths cigs) (lfit deaths cigs) 100 200 300 400 500 400 600 800 1000 1200 cigs deaths Fitted values Great Britain Switzerland Finland Denmark Canada Predicted value for Finland residual for Finland
The estimated intercept, 0 ˆ = 28.4, is the value at which the regression line intercepts the vertical axis. The slope of the regression line is 0.336, so an increase of one cigarette per capita is associated with an increase in the death rate of 0.336 lung cancer deaths per million. cigar.do file clear all ************************************************************* * PS2-cigar.do * STATA calculations for W3412, problem set #2 ************************************************************* log using PS2-cigar.log, replace set more 1 ************************************************************* * read in data input str8 country cigs deaths "Switz" 530 250 "Finland" 1115 350 "Britain" 1145 465 "Canada" 510 150 "Denmark" 380 165 end * list * ---- compute mean and variance ----- summarize cigs deaths * ----- compute correlation ----- correlate cigs deaths * ----- regression of death rate on cigarettes per capita ----- regress deaths cigs * ----- compute predicted values and residuals ----- predict dhat generate uhat = deaths - dhat list deaths dhat uhat * ------ scatterplot and regression line ----- Graph twoway (scatter deaths cigs) (lfit deaths cigs) log close clear exit
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
3. [graded] For many years, housing economists believed that households spend a constant fraction of income on housing , as in housing expenditure = (income) + u The file housing.dta contains housing expenditures ( housing ) and total expenditures ( total ) for a sample of 19 th century Belgian workers collected by Edouard Ducpetiaux 1 . The differences in housing expenditures from one observation to the next are in the variables dhousing ; the differences in total expenditures are in variable dtotal . (a) Compute the means of total expenditure and housing expenditure in this sample (b) Estimate using total expenditure for total income. (c) If income rises by 100 (it averages around 900 in this sample) what change in estimated expected housing expenditure results according to your estimate in (b)? (d) Interpret the R 2 (e) What economic argument would you make against housing absorbing a constant share of income? (f) What are some determinants of housing captured by u? Solution: a) b) c) Housing expenditure is expected to increase by 7.49 1 Edouard Ducpetiaux, Budgets Economiques de Classes de Ouvrieres en Belgique (Brussels, Hayaz 1855) total 162 902.8239 411.6408 377.06 2822.54 housing 162 72.54259 57.26064 7.25 450.52 Variable Obs Mean Std. Dev. Min Max . sum housing total total .0749545 .0043495 17.23 0.000 .0663651 .0835439 housing Coef. Std. Err. t P>|t| [95% Conf. Interval] Total 1380396.94 162 8520.96874 Root MSE = 54.901 Adj R-squared = 0.6463 Residual 485275.167 161 3014.13147 R-squared = 0.6485 Model 895121.769 1 895121.769 Prob > F = 0.0000 F( 1, 161) = 296.98 Source SS df MS Number of obs = 162 . reg housing total, noconstant > 12\Problem Set 2 Fall 2012\housing.dta", clear
d) Since this regression does not contain a constant, we cannot necessarily interpret the R 2 in the usual way (i.e. 64.6% of the variations in housing expenditures can be explained by the variations in income). To see this, run the regression including a constant; the R 2 is now 12.3%! e) The relationship more likely to be non-linear f) Price, mortgage interest rates, location, etc. (answers will vary here) .do file for this problem: clear all ************************************************************* * PS2-housing.do * STATA calculations for W3412, problem set #2 ************************************************************* log using PS2-housing, replace set more 1 ************************************************************* * read in data use housing.dta, clear * ---- compute mean ----- sum housing total * ----- regression of housing expenditure on total income ----- reg housing total, noconstant log close clear exit 4. [graded] Suppose that a researcher, using wage data on 250 randomly selected male workers and 280 female workers, estimates the OLS regression, WAGE = 12.68 + 2.79* MALE , R 2 =0.06, SER = 3.1, where WAGE is measured in $/hour and MALE is a binary variable that is equal to one if the person is male and 0 if the person is a female. The standard errors for the coefficients are 0.18 and 0.84, respectively. Define the wage gender gap as the difference in mean earnings between men and women. a) What is the estimated wage gender gap? (This asks you use the formula for the OLS regression coefficients and apply it to the case of a dummy-regressor, i.e. the interpretation of 𝛽 ̂ 1 is not “slope” here, what is it in terms of conditional expectations?) Let MALE be X and WAGE be Y When X i = 0, Y i = 0 + u i the mean of Y i is 0 that is, E(Y i |X i =0) = 0 When X i = 1, Y i = 0 + 1 + u i the mean of Y i is 0 + 1 that is, E(Y i |X i =1) = 0 + 1
so: 1 = E(Y i |X i =1) E(Y i |X i =0) = population difference in group means. Hence, the dollar value of the gender gap is $2.79 b) Is the estimated gender gap significantly different from zero? (Compute the p-value for testing the null-hypothesis that there is no gender gap). t-stat = (2.79-0)/0.84 = 3.32, check from standard normal distribution table. 1 is significantly different from 0. c) In the sample, what is the mean wage of women? Of men? $12.68 for women and $12.68+2.79=15.47 for men d) Another researcher uses the same data, but regresses WAGES on FEMALE , a variable that is equal to one if the person is female and zero if the person is a male. What are the regression estimates (slope and constant) calculated from this regression? What is the R 2 ? WAGE = 15.47 - 2.79*FEMALE, R 2 would stay the same because ESS(explained sum of square) does not change. 5. A professor decides to run an experiment to measure the effect of time pressure on final exam scores. He gives each of the 400 students in his course the same final exam, but some students have 90 minutes to complete the exam while others have 120 minutes. Each student is randomly assigned one of the examination times based on the flip of a coin. Let ? ? denote the number of points scored on the exam by the 𝑖 𝑡ℎ student (𝑖 = 1, … 400) , and let ? ? be the amount of minutes that the student had to complete the exam (so ? ? = 90 or ? ? = 120 ). Consider the linear regression model ? ? = 𝛽 0 + 𝛽 1 ? ? + ? ? (a) Explain what the term ? ? represents here. Why will different students have different values of ? ? ? ? ? is the error term of the regression. It contains all the other factors that affect the regression different from ? ? . For example, in the case above, different students have different aptitude for the subject, different preparation times, different teaching assistants, etc. All these different factors are part of the error term. (b) Explain why ?(? ? |? ? ) = 0 in this setting. Because the value of ? ? for each student was completely randomized with the coin flip. This ensure the groups with the two different treatment variables are roughly equal. As such, any other variable that might cause ? ? is uncorrelated with ? ? . (c) Suppose the estimated regression line is ? ? = 49 + 0.24? ? . Compute the predicted score of a student that was given 90 minutes to complete the exam. ? = 49 + 0.24 ∗ 90 = 70.6 (d) Based on the estimates in part (c), compute the estimated gain in score for a student who is given an additional 10 minutes on the exam.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
?[?|? + 10] − ?[?|?] = 0.24 ∗ 10 = 2.4 (e) Suppose the university administration would consider a policy change that requires all exams to take 240 minutes. What would the estimates in (c) literally suggest happens to students’ average scores in this case? Should we take this prediction seriously? Why or why not? The estimates in regression (c) would tell us that the expected value of Y is 106.6. The estimate is not sound because applying the regression coefficient (c) for this scenario is an erroneous extrapolation of the linear trend between 90 and 120 minutes. In other words, the previous result in (c) lacks external validity. We do not know how to trend extrapolates beyond this point. It is possible for example that after a threshold time, there are no further gains in test scores. Following questions will not be graded, they are for you to practice and will be discussed at the recitation: 6. [Practice question, not graded] SW Problem 4.1 (a) The predicted average test score is ?𝑒?? ????𝑒 ̂ = 520.4 − 5.82x22 = 392.36 (b) The predicted decrease in the classroom average test score is ∆?𝑒??????𝑒 ̂ = (−5.82x19) − (−5.82x23) = 23.28 or the predicted change is ∆?𝑒??????𝑒 ̂ = (−5.82x23) − (−5.82x19) = −23.28 (c) Using the formula for 𝛽 ̂ 0 , we know the sample average of the test scores across the 100 classroom is ?𝑒??????𝑒 ̅̅̅̅̅̅̅̅̅̅̅̅̅̅ = 𝛽 ̂ 0 + 𝛽 ̂ 1 x?? ̅̅̅̅ = 520.4 − 5.82x21.4 = 395.85 (d) Use the formula for the standard error of the regression (SER) to get the sum of squared residuals: ??? = (? − 2)??? 2 = (100 − 2)x11.5 2 = 12961
Use the formula for ? 2 to get the total sum of squares: ??? = ??? 1 − ? 2 = 12961 1 − 0.08 = 14088 The sample variance is ? ? 2 = ??? 𝑛−1 = 14088 99 = 142.3. Thus, the standard deviation is ? ? = √? ? 2 = 11.9 7. [Practice question, not graded] SW Problem 4.3 8. [Practice question, not graded] Let 𝐾𝐼?? denote the number of children born to a woman, and let ???? denote years of education for the woman. A simple model relating fertility to years of education is 𝐾𝐼?? = ? + ? ∗ ???? + ?, where u is the unobserved residual. a) What kinds of factors are contained in u ? Are these likely to be correlated with level of education? Income, age, and family background (such as number of siblings) are just a few possibilities. It seems that each of these could be correlated with years of education.(Income and education are probably positively correlated; age and education may be negatively correlated because women in more recent cohorts have, on average, more education; and number of siblings and education are probably negatively correlated.) b) Will simple regression of kids on EDUC uncover the ceteris paribus (‘all else equal’) effect of education on fertility? Explain.
Not if the factors we listed in part (i) are correlated with EDUC. Because we would like to hold these factors fixed, they are part of the error term. But if u is correlated with EDUC, then E(u|EDUC) is not zero, and thus OLS Assumption (A2) fails. 9. [Practice question, not graded] SW Problem 4.9 10. [Practice question, not graded] SW Empirical Exercise 4.1 (a) Yes, there appears to be a weak positive relationship. (b) Malta is the “outlying” observation with a trade share of 2. (c) = 0.64 + 2.31 Tradeshare
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Predicted growth (Trade Share = 1) = 0.64 + 2.31 1 = 2.95 Predicted growth (Trade Share = 0.5) = 0.64 + 2.31 0.50 = 1.80 (d) = 0.96 + 1.68 Tradeshare Predicted growth (Trade Share = 1) = 0.96 + 1.68 1 = 2.64 Predicted growth (Trade Share = 0.5) = 0.96 + 1.68 0.50 = 1.80 (e) Malta is an island nation in the Mediterranean Sea, south of Sicily. Malta is a freight transport site, which explains its large “trade share.” Many goods coming into Malta (imports into Malta) and are immediately transported to other countries (as ex ports from Malta). Thus, Malta’s imports and exports are unlike the imports and exports of most other countries. Malta should not be included in the analysis. Calculations for this exercise are carried out in the STATA do file: # delimit ; set more off; set linesize 200; clear all; *************************************************************; * Empirical Exercise 4.1; *; * (Note: Change path name so that it is appropriate for your computer); log using EE_4_1.log,replace; *************************************************************; * (Note: Change path name so that it is appropriate for your computer); use /Users/mwatson/Dropbox/TB/4E//EE_Datasets/Growth.dta; desc; *************************************************************; summ; graph twoway scatter growth tradeshare; reg growth tradeshare, r; *graph twoway (lfit growth tradeshare) (scatter growth tradeshare); drop if tradeshare > 1.5; reg growth tradeshare if (tradeshare < 1.5), r; *graph twoway (lfit growth tradeshare if (tradeshare < 1.5)) (lfit growth tradeshare) (scatter growth tradeshare); log close;
11. [Practice question, not graded] SW Empirical Exercises 4.2 (a) The median height in the sample is 67 inches. (b) The following table shows average annual earnings for taller and shorter workers Height Annual Earnings ($) x SE( x ) 95% Confidence Interval for mean Height ≤ 67 inches 44,488 265.5 43,968 45,009 Height > 67 inches 49,988 305.4 49,389 50,587 Difference (Tall Short) 5,499 404.7 4,706 6,293 The estimated average annual earnings for shorter workers is $44,488, is $49,988 for taller workers, for a difference of $5,499. The 95% confidence interval is $4,706 to $6,293. The difference is large (more than 10% of average earnings), precisely estimated (a standard error of $404) and statistically significantly different from zero. (c) Scatterplot
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
The data documentation reports that individual earnings were reported in 23 brackets, and a single average value is reported for earnings in the same bracket. Thus, the dataset contains 23 distinct values of earnings. (d) The estimated regression is = 512.7 + 707.7× Height , R 2 = 0.011 The estimated slope is 707.7 (Dollars per year). The estimated earnings are Height (in inches) (in Dollars per year) 65 45,486 67 46,901 70 49,204 (e) Recall that 1 cm = 0.394 inches. The estimated regression in (d), with units shown, is ($) = 512.7($) + 707.7($/inch)× Height (inches), R 2 (unit free) = 0.011, and SER = 26777($). Note that 707.7($/inch)× Height (inches) = 707.7($/inch)×(0.394inch/cm)× Height (cm) = 278.8($/cm)× Height (cm) So the regression is ($) = 512.7($) + 278.8($/cm)× Height (cm), R 2 (unit free) = 0.011, and SER = 26777($) (f) The regression for females is = 12650 + 511.2× Height , R 2 = 0.002 A women who is one inch taller than average is predicted to have earnings that are $511.2 per year higher than average.
(g) The regression for males is = -43130 + 1306.9× Height , R 2 = 0.021 A man who is one inch taller than average is predicted to have earnings that are $1306.9 per year higher than average. (h) Height may be correlated with other factors that cause earnings. For example, height may be correlated with “strength,” and in some occupations, stronger workers may by more productive. There are many other potential factors that may be correlated with height and cause earnings and you will investigate of these in future exercises. Calculations for this exercise are carried out in the STATA do file: # delimit ; set more off; set linesize 200; clear all; *************************************************************; * Empirical Exercise 4.2; *; log using EE_4_2.log,replace; *************************************************************; * (Note: Change path name so that it is appropriate for your computer); use /Users/mwatson/Dropbox/TB/4E//EE_Datasets/earnings_and_height.dta; desc; *************************************************************; summ; summarize height, detail; gen split = height > 67; ttest earnings, by(split) unequal unpaired; graph twoway (lfit earnings height) (scatter earnings height); ** All workers; reg earnings height, r; **females; reg earnings height if sex==0, r ; **males; reg earnings height if sex==1, r ; log close;