Assignment 3

pdf

School

New York University *

*We aren’t endorsed by this school

Course

MISC

Subject

Economics

Date

Feb 20, 2024

Type

pdf

Pages

9

Uploaded by AgentGoldfishMaster833

Report
ECON-UB 251 Assignment 3, Fall 2023 Cathy Chen knitr :: opts_chunk $ set ( echo = TRUE , eval = TRUE , out.width = "50%" , fig.asp = 0.5 , fig.align = "center" , comment = "" , message = FALSE , warning = FALSE ) options ( scipen = 100 ) 1. Panel Data 1.1 [ 12.5% ] Save the data file that is available in Brightspace , load it in your rmarkdown document, and answer the following questions: Make a scatter plot of density and vio with the color of the dots varying by state (NB: in R use ggplot2 ). Discuss the results and identify the state that stands out relative to the rest (show the code that you used to identify the state). Do the same scatter plot for log(density) vs log(vio) and discuss whether there could be any advantage from transforming the variables setwd ( "/Users/catherinechen/Desktop" ) # change this location to the folder where you saved the data ... mysample <- read_delim ( "/Users/catherinechen/Desktop/assignment5_part1.csv" ) library (ggplot2) ggplot (mysample) + geom_point ( aes ( x = density, y = vio, color = state)) 0 1000 2000 3000 0 3000 6000 9000 density vio Arizona Arkansas California Colorado Connecticut Delaware District of Columbia Florida Georgia Hawaii Idaho Illinois Indiana Iowa Maine Maryland Massachusetts Michigan Minnesota Mississippi Missouri Montana Nebraska Nevada New Hampshire New Jersey New Mexico New York Oklahoma Oregon Pennsylvania Rhode Island South Carolina South Dakota Tennessee Texas Utah Vermont Virginia Washington West Virginia Wisconsin density_zscore <- scale (mysample $ density) vio_zscore <- scale (mysample $ vio) outliers <- mysample[ abs (density_zscore) > 3 | abs (vio_zscore) > 3 ,] print (outliers) 1
# A tibble: 38 x 18 state year stateid vio mur rap aga bur auto shall density rpcpi <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 Distr~ 1977 11 1427. 27.8 58.3 376. 1680. 397. 0 11102. 15906. 2 Distr~ 1978 11 1412. 28.0 66.3 378. 1854. 474. 0 10902. 15900. 3 Distr~ 1979 11 1609. 27.4 74.5 452. 2051. 550. 0 10656 15460. 4 Distr~ 1980 11 2011. 31.5 69.1 509. 2560. 562. 0 10132. 14828. 5 Distr~ 1981 11 2275. 35.1 65.1 540. 2647. 592. 0 10109. 14721. 6 Distr~ 1982 11 2123. 30.7 66.7 578. 2341. 648. 0 10066. 15098. 7 Distr~ 1983 11 1915. 29.4 65.2 585. 2004. 635. 0 10039. 15441. 8 Distr~ 1984 11 1722. 28.1 58.7 658. 1758. 702. 0 10054. 16348. 9 Distr~ 1985 11 1625. 23.5 53.8 712. 1598. 803. 0 10072. 16776. 10 Distr~ 1986 11 1505. 31.0 52.4 668. 1728. 975. 0 10131. 17274. # i 28 more rows # i 6 more variables: rpcui <dbl>, rpcim <dbl>, pbm1019 <dbl>, pbm2029 <dbl>, # pwm1019 <dbl>, pwm2029 <dbl> ggplot (mysample) + geom_point ( aes ( x = log (density), y = log (vio), color = state)) 4 5 6 7 8 0.0 2.5 5.0 7.5 log(density) log(vio) Arizona Arkansas California Colorado Connecticut Delaware District of Columbia Florida Georgia Hawaii Idaho Illinois Indiana Iowa Maine Maryland Massachusetts Michigan Minnesota Mississippi Missouri Montana Nebraska Nevada New Hampshire New Jersey New Mexico New York Oklahoma Oregon Pennsylvania Rhode Island South Carolina South Dakota Tennessee Texas Utah Vermont Virginia Washington West Virginia Wisconsin The state that stands out the most is the District of Columbia which I found to be an outlier through using the z score to determine results that are more than 3 standard deviations from the mean which turns out to be the District of Columbia. The rest of the datapoints in the scatterplot are clustered around 0 to 1000 population per square mile of land area divided by 1000 (density) and 0 to 1500 incidents per 100,000 members of the population (violent crime rate). District of Columbia on the other hand has high density and high crime rates which makes it an outlier. There looks to be an advantage by transforming the variables into log form since the distance between the large cluster of datapoints and of District of Columbia’s is less far away. By logging density and vio, the extreme values are pulled in or compressed because the outliers appear less extreme now. 1.2 [12.5%] Define the dataset as a panel with indexes state and year . Denote by vio it the violent crime rate in state i in year t and similarly for shall it . Estimate the pooled regression model log( vio it ) = α + β 1 shall it + u it Interpret the coefficient estimate of shall and discuss its significance. Which of the views about the effectiveness of the shall-carry laws discussed earlier does the estimate seem to support? library (plm) mysample.pd <- pdata.frame (mysample, index= c ( "state" , "year" ), drop.index= TRUE , row.names= TRUE ) mysamplepool <- plm ( log (vio) ~ shall, data = mysample.pd, model = "pooling" ) summary (mysamplepool) Pooling Model Call: 2
plm(formula = log(vio) ~ shall, data = mysample.pd, model = "pooling") Balanced Panel: n = 51, T = 38, N = 1938 Residuals: Min. 1st Qu. Median 3rd Qu. Max. -2.139114 -0.383608 0.038437 0.406258 1.858520 Coefficients: Estimate Std. Error t-value Pr(>|t|) (Intercept) 6.121434 0.017676 346.315 < 0.00000000000000022 *** shall -0.314943 0.026351 -11.952 < 0.00000000000000022 *** --- Signif. codes: 0 ' *** ' 0.001 ' ** ' 0.01 ' * ' 0.05 ' . ' 0.1 ' ' 1 Total Sum of Squares: 684.78 Residual Sum of Squares: 637.73 R-Squared: 0.068716 Adj. R-Squared: 0.068235 F-statistic: 142.85 on 1 and 1936 DF, p-value: < 0.000000000000000222 A one-unit increase in shall (more fraction of the year a state has a shall-issue law in place) is associated with a 31.49% decrease in the violent crime rate, all else being equal. It is significant because the p-value is very very small. This interpretation supports the view that less guns leads to less crime in that the more strict a state is about issuing guns, the less the violent crime rate is. The view is “opponents believe that it might cause more crime given their wide availability”. 1.3 [12.5%] Estimate the panel data model with state and time fixed effects , that is, log( vio it ) = α i + λ t + β 1 shall it + u it Discuss the estimate of β 1 and its significance Did the coefficient estimate of shall change significantly relatively to the estimate for the pooled model? why or why not? mysamplete <- plm ( log (vio) ~ shall, data = mysample.pd, model = "within" , effect = "twoways" ) summary (mysamplete) Twoways effects Within Model Call: plm(formula = log(vio) ~ shall, data = mysample.pd, effect = "twoways", model = "within") Balanced Panel: n = 51, T = 38, N = 1938 Residuals: Min. 1st Qu. Median 3rd Qu. Max. -0.7235560 -0.1094477 0.0022864 0.1071006 1.2091013 Coefficients: Estimate Std. Error t-value Pr(>|t|) shall 0.028157 0.016754 1.6806 0.093 . --- Signif. codes: 0 ' *** ' 0.001 ' ** ' 0.01 ' * ' 0.05 ' . ' 0.1 ' ' 1 3
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Total Sum of Squares: 76.326 Residual Sum of Squares: 76.21 R-Squared: 0.0015253 Adj. R-Squared: -0.045995 F-statistic: 2.82454 on 1 and 1849 DF, p-value: 0.093002 A one unit increase in shall is associated with a 2.82% increase in violent crime rate, all else being equal, after accounting for both time and state fixed effects. pFtest (mysamplete, mysamplepool) F test for twoways effects data: log(vio) ~ shall F = 156.59, df1 = 87, df2 = 1849, p-value < 0.00000000000000022 alternative hypothesis: significant effects Through a F test for twoway effects, the p-value is really small so there was a significant change when added state and time fixed effects so they do matter. 1.4 [12.5%] Add the rpcpi + rpcui + rpcim + density + pbm1019 + pbm2029 + pwm1019 + pwm2029 regressors to the state/time fixed effect models. + Discuss the change in the estimate of β 1 and its significance relative to the model with only shall model2 <- plm ( log (vio) ~ shall + rpcpi + rpcui + rpcim + density + pbm1019 + pbm2029 + pwm1019 + pwm2029, summary (model2) Twoways effects Within Model Call: plm(formula = log(vio) ~ shall + rpcpi + rpcui + rpcim + density + pbm1019 + pbm2029 + pwm1019 + pwm2029, data = mysample.pd, effect = "twoways", model = "within") Balanced Panel: n = 51, T = 38, N = 1938 Residuals: Min. 1st Qu. Median 3rd Qu. Max. -0.6476782 -0.1057866 0.0049937 0.1052378 0.9846603 Coefficients: Estimate Std. Error t-value Pr(>|t|) shall -0.0316694987 0.0170833455 -1.8538 0.063924 . rpcpi -0.0000162700 0.0000056304 -2.8897 0.003902 ** rpcui -0.0013795233 0.0001478639 -9.3297 < 0.00000000000000022 *** rpcim 0.0002747859 0.0001801103 1.5257 0.127268 density -0.0003531568 0.0000709601 -4.9768 0.000000706596149376 *** pbm1019 0.0628062419 0.0407523442 1.5412 0.123448 pbm2029 0.2199863712 0.0355789390 6.1831 0.000000000772520286 *** pwm1019 -0.0857563205 0.0162239895 -5.2858 0.000000140061783484 *** pwm2029 0.1098768314 0.0139636497 7.8688 0.000000000000006058 *** --- Signif. codes: 0 ' *** ' 0.001 ' ** ' 0.01 ' * ' 0.05 ' . ' 0.1 ' ' 1 Total Sum of Squares: 76.326 Residual Sum of Squares: 66.014 4
R-Squared: 0.13511 Adj. R-Squared: 0.090005 F-statistic: 31.9536 on 9 and 1841 DF, p-value: < 0.000000000000000222 pFtest (model2, mysamplete) F test for twoways effects data: log(vio) ~ shall + rpcpi + rpcui + rpcim + density + pbm1019 + ... F = 35.542, df1 = 8, df2 = 1841, p-value < 0.00000000000000022 alternative hypothesis: significant effects Compared to the model with only shall, the coefficient of shall has decreased from 0.028 to -0.032. A one unit increase in shall is associated with a 3.17% decrease in violent crime rate, holding constant all other variables in the model and accounting for state and time fixed effects. The p-value of the F test is very small so there is a significant change when adding the other variables into the model compared to when there is only shall. pFtest (model2,mysamplepool) F test for twoways effects data: log(vio) ~ shall + rpcpi + rpcui + rpcim + density + pbm1019 + ... F = 167.83, df1 = 95, df2 = 1841, p-value < 0.00000000000000022 alternative hypothesis: significant effects Similarly, the p-value for the F test comparing model with added variables to the pooled model is very small. So again there is a significant change when adding all the other variables. 5
2. Binary Dependent Variables 2.1 [10%] Produce a Table with the percentage of each employment status in April 2009 for the 5,412 individuals that were employed in April 2008. Discuss the results. NB: there are 435 individuals that appear neither employed or unemployed in 2009 (both variables are equal to zero meaning that these workers dropped out of the labor force); filter these observations out in the analysis setwd ( "/Users/catherinechen/Desktop" ) # change this location to the folder where you saved the data ... library (readxl) mysample2 <- read_xlsx ( "/Users/catherinechen/Desktop/assignment5_part2.xlsx" ) library (dplyr) filtered_data <- mysample2 %>% filter ( ! (employed == 0 & unemployed == 0 )) employed2008 <- filtered_data %>% filter (private == 1 | government == 1 | self == 1 ) employmentpercentage <- employed2008 %>% group_by (employed) %>% summarise ( percentage = n () / nrow (employ print (employmentpercentage) # A tibble: 2 x 2 employed percentage <dbl> <dbl> 1 0 4.80 2 1 95.2 The percentage of people who were employed in 2008 and is now unemployed in 2009 is 4.8% and the percentage of people who were employed in 2008 and is still employed in 2009 is 95.20%. 2.2 [10%] Regress the Employed variable on age and the square of age using the Linear Probability Model (LPM) Plot the effect of age on the probability to be employed based on the estimated coefficients. What is the effect of increasing age by 1 year? Is there evidence of a nonlinear effect of age on the probability of employment? library (dplyr) ; library (ggplot2); library (sandwich); library (lmtest) lpm <- glm (employed ~ age + I (age ˆ 2 ), data = filtered_data, family = binomial ( link = "probit" )) coeftest (lpm) z test of coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 0.01560905 0.34551870 0.0452 0.964 age 0.07998691 0.01778385 4.4977 0.000006868 *** I(age^2) -0.00089708 0.00021477 -4.1770 0.000029540 *** --- Signif. codes: 0 ' *** ' 0.001 ' ** ' 0.01 ' * ' 0.05 ' . ' 0.1 ' ' 1 library (dplyr) lpm.df <- select (filtered_data, age, employed) %>% mutate ( fit = predict (lpm, type = "response" )) library (ggplot2) ggplot (lpm.df) + geom_point ( aes (age, employed)) + geom_line ( aes (age, fit, color = "tomato2" )) + geom_hli 6
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
0.00 0.25 0.50 0.75 1.00 20 30 40 50 60 age employed colour tomato2 A 1 year increase in age is associated with a positive change in probability of being employed seeing as the coefficient of age is positive. Yes there is evidence of a nonlinear effect of age on the probability of employment because of the inclusion of ageˆ2 in the model which is negative. Thus, there is a downward curving relationship between age and the log-odds of employment. The z-value, -4.1770, is statistically significant and the p-value is very low. 2.3 [10%] Estimate a logit model for employed conditional on age and its square, earnwke, race1, married, female, ne_states, so_states, ce_states, we_states, educ_lths, educ_hs, educ_somecol, educ_aa, educ_bac, and educ_adv Estimate the model and discuss the worker’s characteristics that are most relevant to explain the probability of being employed in 2009 logit <- glm (employed ~ age + I (age ˆ 2 ) + earnwke + factor (race) + married + female + ne_states + so_stat summary (logit) Call: glm(formula = employed ~ age + I(age^2) + earnwke + factor(race) + married + female + ne_states + so_states + ce_states + we_states + educ_lths + educ_hs + educ_somecol + educ_aa + educ_bac + educ_adv, family = binomial(link = "logit"), data = filtered_data) Coefficients: (2 not defined because of singularities) Estimate Std. Error z value Pr(>|z|) (Intercept) 0.5582521 0.8137291 0.686 0.49269 age 0.1192797 0.0403031 2.960 0.00308 ** I(age^2) -0.0014368 0.0004857 -2.958 0.00310 ** earnwke 0.0001216 0.0001569 0.775 0.43845 factor(race)2 -0.3775007 0.2172423 -1.738 0.08226 . factor(race)3 0.0112688 0.2702360 0.042 0.96674 married 0.3087617 0.1494363 2.066 0.03881 * female 0.4326084 0.1457969 2.967 0.00301 ** ne_states 0.4371615 0.2096615 2.085 0.03706 * so_states 0.2679124 0.1838481 1.457 0.14505 ce_states 0.3895423 0.1958349 1.989 0.04669 * we_states NA NA NA NA educ_lths -1.3715746 0.3523556 -3.893 0.0000992 *** educ_hs -0.8692520 0.3105094 -2.799 0.00512 ** educ_somecol -0.4336497 0.3292246 -1.317 0.18778 educ_aa -0.3287853 0.3700409 -0.889 0.37427 educ_bac -0.4516460 0.3197963 -1.412 0.15786 educ_adv NA NA NA NA --- Signif. codes: 0 ' *** ' 0.001 ' ** ' 0.01 ' * ' 0.05 ' . ' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) 7
Null deviance: 1806.1 on 4406 degrees of freedom Residual deviance: 1730.2 on 4391 degrees of freedom (570 observations deleted due to missingness) AIC: 1762.2 Number of Fisher Scoring iterations: 6 The worker characteristic that is most relevant to explain the probability of being employed in 2009 is educ_lths which is if the highest level of education is less than a high schoool graduate. 2.4 [10%] Model the probability of being unemployed (instead of being employed as in the previous question); estimate the logit model with the demographic variables for the unemployed variable and discuss the characteristics of workers that significantly affected the probability of unemployment logit_unemployed <- glm (unemployed ~ age + I (age ˆ 2 ) + earnwke + factor (race) + married + female + ne_sta summary (logit_unemployed) Call: glm(formula = unemployed ~ age + I(age^2) + earnwke + factor(race) + married + female + ne_states + so_states + ce_states + we_states + educ_lths + educ_hs + educ_somecol + educ_aa + educ_bac + educ_adv, family = binomial(link = "logit"), data = filtered_data) Coefficients: (2 not defined because of singularities) Estimate Std. Error z value Pr(>|z|) (Intercept) -0.5582521 0.8137291 -0.686 0.49269 age -0.1192797 0.0403031 -2.960 0.00308 ** I(age^2) 0.0014368 0.0004857 2.958 0.00310 ** earnwke -0.0001216 0.0001569 -0.775 0.43845 factor(race)2 0.3775007 0.2172423 1.738 0.08226 . factor(race)3 -0.0112688 0.2702360 -0.042 0.96674 married -0.3087617 0.1494363 -2.066 0.03881 * female -0.4326084 0.1457969 -2.967 0.00301 ** ne_states -0.4371615 0.2096615 -2.085 0.03706 * so_states -0.2679124 0.1838481 -1.457 0.14505 ce_states -0.3895423 0.1958349 -1.989 0.04669 * we_states NA NA NA NA educ_lths 1.3715746 0.3523556 3.893 0.0000992 *** educ_hs 0.8692520 0.3105094 2.799 0.00512 ** educ_somecol 0.4336497 0.3292246 1.317 0.18778 educ_aa 0.3287853 0.3700409 0.889 0.37427 educ_bac 0.4516460 0.3197963 1.412 0.15786 educ_adv NA NA NA NA --- Signif. codes: 0 ' *** ' 0.001 ' ** ' 0.01 ' * ' 0.05 ' . ' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 1806.1 on 4406 degrees of freedom Residual deviance: 1730.2 on 4391 degrees of freedom (570 observations deleted due to missingness) AIC: 1762.2 Number of Fisher Scoring iterations: 6 8
Again, the characteristic that most significantly affected unemployment was educ_lths which is less than a high school graduate which had the lowest p-value. 2.5 [10%] What is the probability of being employed in 2009 for an individual that is 35 year-old, with weekly earnings of 865 dollars, white, married, female, living in the North-East, and with a Bachelor degree? What is the expected difference in probability for an individual with the same characteristics, except for being a single male? Discuss. modifieddata <- data.frame ( age = 35 , earnwke = 865 , race = 1 , married = 1 , female = 1 , ne_states = 1 , s prob_employed <- predict (logit, newdata = modifieddata, type = "response" ) prob_employed 1 0.978225 The probability of being employed in 2009 for an individual that is 35 year-old, with weekly earnings of 865 dollars, white, married, female, living in the North-East, and with a Bachelor degree is 97.82%. modifieddata2 <- data.frame ( age = 35 , earnwke = 865 , race = 1 , married = 0 , female = 0 , ne_states = 1 , prob_unemployed <- predict (logit, newdata = modifieddata2, type = "response" ) prob_unemployed 1 0.9553663 expected_difference <- prob_employed - prob_unemployed expected_difference 1 0.02285865 The expected difference is 0.02286 which is 2.286%. 9
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help