Assignment 3
pdf
keyboard_arrow_up
School
New York University *
*We aren’t endorsed by this school
Course
MISC
Subject
Economics
Date
Feb 20, 2024
Type
Pages
9
Uploaded by AgentGoldfishMaster833
ECON-UB 251
Assignment 3, Fall 2023
Cathy Chen
knitr
::
opts_chunk
$
set
(
echo
=
TRUE
,
eval
=
TRUE
,
out.width =
"50%"
,
fig.asp
=
0.5
,
fig.align =
"center"
,
comment
=
""
,
message
=
FALSE
,
warning
=
FALSE
)
options
(
scipen =
100
)
1.
Panel Data
1.1 [
12.5%
] Save the data file that is available in
Brightspace
, load it in your rmarkdown document, and
answer the following questions:
•
Make a scatter plot of
density
and
vio
with the color of the dots varying by state (NB: in
R
use
ggplot2
). Discuss the results and identify the state that stands out relative to the rest (show the code
that you used to identify the state).
•
Do the same scatter plot for
log(density)
vs
log(vio)
and discuss whether there could be any
advantage from transforming the variables
setwd
(
"/Users/catherinechen/Desktop"
)
# change this location to the folder where you saved the data ...
mysample
<-
read_delim
(
"/Users/catherinechen/Desktop/assignment5_part1.csv"
)
library
(ggplot2)
ggplot
(mysample)
+
geom_point
(
aes
(
x =
density,
y =
vio,
color =
state))
0
1000
2000
3000
0
3000
6000
9000
density
vio
Arizona
Arkansas
California
Colorado
Connecticut
Delaware
District of Columbia
Florida
Georgia
Hawaii
Idaho
Illinois
Indiana
Iowa
Maine
Maryland
Massachusetts
Michigan
Minnesota
Mississippi
Missouri
Montana
Nebraska
Nevada
New Hampshire
New Jersey
New Mexico
New York
Oklahoma
Oregon
Pennsylvania
Rhode Island
South Carolina
South Dakota
Tennessee
Texas
Utah
Vermont
Virginia
Washington
West Virginia
Wisconsin
density_zscore
<-
scale
(mysample
$
density)
vio_zscore
<-
scale
(mysample
$
vio)
outliers
<-
mysample[
abs
(density_zscore)
>
3
|
abs
(vio_zscore)
>
3
,]
print
(outliers)
1
# A tibble: 38 x 18
state
year stateid
vio
mur
rap
aga
bur
auto shall density
rpcpi
<chr>
<dbl>
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
<dbl>
<dbl>
1 Distr~
1977
11 1427.
27.8
58.3
376. 1680.
397.
0
11102. 15906.
2 Distr~
1978
11 1412.
28.0
66.3
378. 1854.
474.
0
10902. 15900.
3 Distr~
1979
11 1609.
27.4
74.5
452. 2051.
550.
0
10656
15460.
4 Distr~
1980
11 2011.
31.5
69.1
509. 2560.
562.
0
10132. 14828.
5 Distr~
1981
11 2275.
35.1
65.1
540. 2647.
592.
0
10109. 14721.
6 Distr~
1982
11 2123.
30.7
66.7
578. 2341.
648.
0
10066. 15098.
7 Distr~
1983
11 1915.
29.4
65.2
585. 2004.
635.
0
10039. 15441.
8 Distr~
1984
11 1722.
28.1
58.7
658. 1758.
702.
0
10054. 16348.
9 Distr~
1985
11 1625.
23.5
53.8
712. 1598.
803.
0
10072. 16776.
10 Distr~
1986
11 1505.
31.0
52.4
668. 1728.
975.
0
10131. 17274.
# i 28 more rows
# i 6 more variables: rpcui <dbl>, rpcim <dbl>, pbm1019 <dbl>, pbm2029 <dbl>,
#
pwm1019 <dbl>, pwm2029 <dbl>
ggplot
(mysample)
+
geom_point
(
aes
(
x =
log
(density),
y =
log
(vio),
color =
state))
4
5
6
7
8
0.0
2.5
5.0
7.5
log(density)
log(vio)
Arizona
Arkansas
California
Colorado
Connecticut
Delaware
District of Columbia
Florida
Georgia
Hawaii
Idaho
Illinois
Indiana
Iowa
Maine
Maryland
Massachusetts
Michigan
Minnesota
Mississippi
Missouri
Montana
Nebraska
Nevada
New Hampshire
New Jersey
New Mexico
New York
Oklahoma
Oregon
Pennsylvania
Rhode Island
South Carolina
South Dakota
Tennessee
Texas
Utah
Vermont
Virginia
Washington
West Virginia
Wisconsin
The state that stands out the most is the District of Columbia which I found to be an outlier through using
the z score to determine results that are more than 3 standard deviations from the mean which turns out
to be the District of Columbia.
The rest of the datapoints in the scatterplot are clustered around 0 to
1000 population per square mile of land area divided by 1000 (density) and 0 to 1500 incidents per 100,000
members of the population (violent crime rate). District of Columbia on the other hand has high density and
high crime rates which makes it an outlier.
There looks to be an advantage by transforming the variables into log form since the distance between the
large cluster of datapoints and of District of Columbia’s is less far away. By logging density and vio, the
extreme values are pulled in or compressed because the outliers appear less extreme now.
1.2 [12.5%] Define the dataset as a panel with indexes
state
and
year
. Denote by
vio
it
the violent crime
rate in state
i
in year
t
and similarly for
shall
it
. Estimate the
pooled regression model
log(
vio
it
) =
α
+
β
1
shall
it
+
u
it
•
Interpret the coefficient estimate of
shall
and discuss its significance. Which of the views about the
effectiveness of the shall-carry laws discussed earlier does the estimate seem to support?
library
(plm)
mysample.pd
<-
pdata.frame
(mysample,
index=
c
(
"state"
,
"year"
),
drop.index=
TRUE
,
row.names=
TRUE
)
mysamplepool
<-
plm
(
log
(vio)
~
shall,
data =
mysample.pd,
model =
"pooling"
)
summary
(mysamplepool)
Pooling Model
Call:
2
plm(formula = log(vio) ~ shall, data = mysample.pd, model = "pooling")
Balanced Panel: n = 51, T = 38, N = 1938
Residuals:
Min.
1st Qu.
Median
3rd Qu.
Max.
-2.139114 -0.383608
0.038437
0.406258
1.858520
Coefficients:
Estimate Std. Error t-value
Pr(>|t|)
(Intercept)
6.121434
0.017676 346.315 < 0.00000000000000022 ***
shall
-0.314943
0.026351 -11.952 < 0.00000000000000022 ***
---
Signif. codes:
0
'
***
'
0.001
'
**
'
0.01
'
*
'
0.05
'
.
'
0.1
' '
1
Total Sum of Squares:
684.78
Residual Sum of Squares: 637.73
R-Squared:
0.068716
Adj. R-Squared: 0.068235
F-statistic: 142.85 on 1 and 1936 DF, p-value: < 0.000000000000000222
A one-unit increase in shall (more fraction of the year a state has a shall-issue law in place) is associated with
a 31.49% decrease in the violent crime rate, all else being equal. It is significant because the p-value is very
very small. This interpretation supports the view that less guns leads to less crime in that the more strict a
state is about issuing guns, the less the violent crime rate is. The view is “opponents believe that it might
cause more crime given their wide availability”.
1.3 [12.5%] Estimate the panel data model with
state and time fixed effects
, that is,
log(
vio
it
) =
α
i
+
λ
t
+
β
1
shall
it
+
u
it
•
Discuss the estimate of
β
1
and its significance
•
Did the coefficient estimate of
shall
change significantly relatively to the estimate for the pooled
model? why or why not?
mysamplete
<-
plm
(
log
(vio)
~
shall,
data =
mysample.pd,
model =
"within"
,
effect =
"twoways"
)
summary
(mysamplete)
Twoways effects Within Model
Call:
plm(formula = log(vio) ~ shall, data = mysample.pd, effect = "twoways",
model = "within")
Balanced Panel: n = 51, T = 38, N = 1938
Residuals:
Min.
1st Qu.
Median
3rd Qu.
Max.
-0.7235560 -0.1094477
0.0022864
0.1071006
1.2091013
Coefficients:
Estimate Std. Error t-value Pr(>|t|)
shall 0.028157
0.016754
1.6806
0.093 .
---
Signif. codes:
0
'
***
'
0.001
'
**
'
0.01
'
*
'
0.05
'
.
'
0.1
' '
1
3
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Total Sum of Squares:
76.326
Residual Sum of Squares: 76.21
R-Squared:
0.0015253
Adj. R-Squared: -0.045995
F-statistic: 2.82454 on 1 and 1849 DF, p-value: 0.093002
A one unit increase in shall is associated with a 2.82% increase in violent crime rate, all else being equal,
after accounting for both time and state fixed effects.
pFtest
(mysamplete, mysamplepool)
F test for twoways effects
data:
log(vio) ~ shall
F = 156.59, df1 = 87, df2 = 1849, p-value < 0.00000000000000022
alternative hypothesis: significant effects
Through a F test for twoway effects, the p-value is really small so there was a significant change when added
state and time fixed effects so they do matter.
1.4 [12.5%] Add the
rpcpi + rpcui + rpcim + density + pbm1019 + pbm2029 + pwm1019 + pwm2029
regressors to the state/time fixed effect models.
+ Discuss the change in the estimate of
β
1
and its significance relative to the model with only
shall
model2
<-
plm
(
log
(vio)
~
shall
+
rpcpi
+
rpcui
+
rpcim
+
density
+
pbm1019
+
pbm2029
+
pwm1019
+
pwm2029,
summary
(model2)
Twoways effects Within Model
Call:
plm(formula = log(vio) ~ shall + rpcpi + rpcui + rpcim + density +
pbm1019 + pbm2029 + pwm1019 + pwm2029, data = mysample.pd,
effect = "twoways", model = "within")
Balanced Panel: n = 51, T = 38, N = 1938
Residuals:
Min.
1st Qu.
Median
3rd Qu.
Max.
-0.6476782 -0.1057866
0.0049937
0.1052378
0.9846603
Coefficients:
Estimate
Std. Error t-value
Pr(>|t|)
shall
-0.0316694987
0.0170833455 -1.8538
0.063924 .
rpcpi
-0.0000162700
0.0000056304 -2.8897
0.003902 **
rpcui
-0.0013795233
0.0001478639 -9.3297 < 0.00000000000000022 ***
rpcim
0.0002747859
0.0001801103
1.5257
0.127268
density -0.0003531568
0.0000709601 -4.9768
0.000000706596149376 ***
pbm1019
0.0628062419
0.0407523442
1.5412
0.123448
pbm2029
0.2199863712
0.0355789390
6.1831
0.000000000772520286 ***
pwm1019 -0.0857563205
0.0162239895 -5.2858
0.000000140061783484 ***
pwm2029
0.1098768314
0.0139636497
7.8688
0.000000000000006058 ***
---
Signif. codes:
0
'
***
'
0.001
'
**
'
0.01
'
*
'
0.05
'
.
'
0.1
' '
1
Total Sum of Squares:
76.326
Residual Sum of Squares: 66.014
4
R-Squared:
0.13511
Adj. R-Squared: 0.090005
F-statistic: 31.9536 on 9 and 1841 DF, p-value: < 0.000000000000000222
pFtest
(model2, mysamplete)
F test for twoways effects
data:
log(vio) ~ shall + rpcpi + rpcui + rpcim + density + pbm1019 +
...
F = 35.542, df1 = 8, df2 = 1841, p-value < 0.00000000000000022
alternative hypothesis: significant effects
Compared to the model with only shall, the coefficient of shall has decreased from 0.028 to -0.032. A one unit
increase in shall is associated with a 3.17% decrease in violent crime rate, holding constant all other variables
in the model and accounting for state and time fixed effects. The p-value of the F test is very small so there
is a significant change when adding the other variables into the model compared to when there is only shall.
pFtest
(model2,mysamplepool)
F test for twoways effects
data:
log(vio) ~ shall + rpcpi + rpcui + rpcim + density + pbm1019 +
...
F = 167.83, df1 = 95, df2 = 1841, p-value < 0.00000000000000022
alternative hypothesis: significant effects
Similarly, the p-value for the F test comparing model with added variables to the pooled model is very small.
So again there is a significant change when adding all the other variables.
5
2.
Binary Dependent Variables
2.1 [10%] Produce a Table with the percentage of each employment status in April 2009 for the 5,412
individuals that were employed in April 2008. Discuss the results.
•
NB: there are 435 individuals that appear neither employed or unemployed in 2009 (both variables are
equal to zero meaning that these workers dropped out of the labor force); filter these observations out
in the analysis
setwd
(
"/Users/catherinechen/Desktop"
)
# change this location to the folder where you saved the data ...
library
(readxl)
mysample2
<-
read_xlsx
(
"/Users/catherinechen/Desktop/assignment5_part2.xlsx"
)
library
(dplyr)
filtered_data
<-
mysample2
%>%
filter
(
!
(employed
==
0
&
unemployed
==
0
))
employed2008
<-
filtered_data
%>%
filter
(private
==
1
|
government
==
1
|
self
==
1
)
employmentpercentage
<-
employed2008
%>%
group_by
(employed)
%>%
summarise
(
percentage =
n
()
/
nrow
(employ
print
(employmentpercentage)
# A tibble: 2 x 2
employed percentage
<dbl>
<dbl>
1
0
4.80
2
1
95.2
The percentage of people who were employed in 2008 and is now unemployed in 2009 is 4.8% and the
percentage of people who were employed in 2008 and is still employed in 2009 is 95.20%.
2.2 [10%] Regress the
Employed
variable on
age
and the square of
age
using the Linear Probability Model
(LPM)
•
Plot the effect of
age
on the probability to be employed based on the estimated coefficients. What is
the effect of increasing
age
by 1 year? Is there evidence of a nonlinear effect of age on the probability
of employment?
library
(dplyr) ;
library
(ggplot2);
library
(sandwich);
library
(lmtest)
lpm
<-
glm
(employed
~
age
+
I
(age
ˆ
2
),
data =
filtered_data,
family =
binomial
(
link =
"probit"
))
coeftest
(lpm)
z test of coefficients:
Estimate
Std. Error z value
Pr(>|z|)
(Intercept)
0.01560905
0.34551870
0.0452
0.964
age
0.07998691
0.01778385
4.4977 0.000006868 ***
I(age^2)
-0.00089708
0.00021477 -4.1770 0.000029540 ***
---
Signif. codes:
0
'
***
'
0.001
'
**
'
0.01
'
*
'
0.05
'
.
'
0.1
' '
1
library
(dplyr)
lpm.df
<-
select
(filtered_data, age, employed)
%>%
mutate
(
fit =
predict
(lpm,
type =
"response"
))
library
(ggplot2)
ggplot
(lpm.df)
+
geom_point
(
aes
(age, employed))
+
geom_line
(
aes
(age, fit,
color =
"tomato2"
))
+
geom_hli
6
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
0.00
0.25
0.50
0.75
1.00
20
30
40
50
60
age
employed
colour
tomato2
A 1 year increase in age is associated with a positive change in probability of being employed seeing as
the coefficient of age is positive. Yes there is evidence of a nonlinear effect of age on the probability of
employment because of the inclusion of ageˆ2 in the model which is negative. Thus, there is a downward
curving relationship between age and the log-odds of employment.
The z-value, -4.1770, is statistically
significant and the p-value is very low.
2.3 [10%] Estimate a logit model for employed conditional on age and its square, earnwke, race1, married,
female, ne_states, so_states, ce_states, we_states, educ_lths, educ_hs, educ_somecol, educ_aa, educ_bac,
and educ_adv
•
Estimate the model and discuss the worker’s characteristics that are most relevant to explain
the probability of being employed in 2009
logit
<-
glm
(employed
~
age
+
I
(age
ˆ
2
)
+
earnwke
+
factor
(race)
+
married
+
female
+
ne_states
+
so_stat
summary
(logit)
Call:
glm(formula = employed ~ age + I(age^2) + earnwke + factor(race) +
married + female + ne_states + so_states + ce_states + we_states +
educ_lths + educ_hs + educ_somecol + educ_aa + educ_bac +
educ_adv, family = binomial(link = "logit"), data = filtered_data)
Coefficients: (2 not defined because of singularities)
Estimate Std. Error z value
Pr(>|z|)
(Intercept)
0.5582521
0.8137291
0.686
0.49269
age
0.1192797
0.0403031
2.960
0.00308 **
I(age^2)
-0.0014368
0.0004857
-2.958
0.00310 **
earnwke
0.0001216
0.0001569
0.775
0.43845
factor(race)2 -0.3775007
0.2172423
-1.738
0.08226 .
factor(race)3
0.0112688
0.2702360
0.042
0.96674
married
0.3087617
0.1494363
2.066
0.03881 *
female
0.4326084
0.1457969
2.967
0.00301 **
ne_states
0.4371615
0.2096615
2.085
0.03706 *
so_states
0.2679124
0.1838481
1.457
0.14505
ce_states
0.3895423
0.1958349
1.989
0.04669 *
we_states
NA
NA
NA
NA
educ_lths
-1.3715746
0.3523556
-3.893 0.0000992 ***
educ_hs
-0.8692520
0.3105094
-2.799
0.00512 **
educ_somecol
-0.4336497
0.3292246
-1.317
0.18778
educ_aa
-0.3287853
0.3700409
-0.889
0.37427
educ_bac
-0.4516460
0.3197963
-1.412
0.15786
educ_adv
NA
NA
NA
NA
---
Signif. codes:
0
'
***
'
0.001
'
**
'
0.01
'
*
'
0.05
'
.
'
0.1
' '
1
(Dispersion parameter for binomial family taken to be 1)
7
Null deviance: 1806.1
on 4406
degrees of freedom
Residual deviance: 1730.2
on 4391
degrees of freedom
(570 observations deleted due to missingness)
AIC: 1762.2
Number of Fisher Scoring iterations: 6
The worker characteristic that is most relevant to explain the probability of being employed in 2009 is
educ_lths which is if the highest level of education is less than a high schoool graduate.
2.4 [10%] Model the probability of being unemployed (instead of being employed as in the previous question);
estimate the logit model with the demographic variables for the
unemployed
variable and discuss the
characteristics of workers that significantly affected the probability of unemployment
logit_unemployed
<-
glm
(unemployed
~
age
+
I
(age
ˆ
2
)
+
earnwke
+
factor
(race)
+
married
+
female
+
ne_sta
summary
(logit_unemployed)
Call:
glm(formula = unemployed ~ age + I(age^2) + earnwke + factor(race) +
married + female + ne_states + so_states + ce_states + we_states +
educ_lths + educ_hs + educ_somecol + educ_aa + educ_bac +
educ_adv, family = binomial(link = "logit"), data = filtered_data)
Coefficients: (2 not defined because of singularities)
Estimate Std. Error z value
Pr(>|z|)
(Intercept)
-0.5582521
0.8137291
-0.686
0.49269
age
-0.1192797
0.0403031
-2.960
0.00308 **
I(age^2)
0.0014368
0.0004857
2.958
0.00310 **
earnwke
-0.0001216
0.0001569
-0.775
0.43845
factor(race)2
0.3775007
0.2172423
1.738
0.08226 .
factor(race)3 -0.0112688
0.2702360
-0.042
0.96674
married
-0.3087617
0.1494363
-2.066
0.03881 *
female
-0.4326084
0.1457969
-2.967
0.00301 **
ne_states
-0.4371615
0.2096615
-2.085
0.03706 *
so_states
-0.2679124
0.1838481
-1.457
0.14505
ce_states
-0.3895423
0.1958349
-1.989
0.04669 *
we_states
NA
NA
NA
NA
educ_lths
1.3715746
0.3523556
3.893 0.0000992 ***
educ_hs
0.8692520
0.3105094
2.799
0.00512 **
educ_somecol
0.4336497
0.3292246
1.317
0.18778
educ_aa
0.3287853
0.3700409
0.889
0.37427
educ_bac
0.4516460
0.3197963
1.412
0.15786
educ_adv
NA
NA
NA
NA
---
Signif. codes:
0
'
***
'
0.001
'
**
'
0.01
'
*
'
0.05
'
.
'
0.1
' '
1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1806.1
on 4406
degrees of freedom
Residual deviance: 1730.2
on 4391
degrees of freedom
(570 observations deleted due to missingness)
AIC: 1762.2
Number of Fisher Scoring iterations: 6
8
Again, the characteristic that most significantly affected unemployment was educ_lths which is less than a
high school graduate which had the lowest p-value.
2.5 [10%] What is the probability of being employed in 2009 for an individual that is 35 year-old, with weekly
earnings of 865 dollars, white, married, female, living in the North-East, and with a Bachelor degree? What
is the expected difference in probability for an individual with the same characteristics, except for being a
single male? Discuss.
modifieddata
<-
data.frame
(
age =
35
,
earnwke =
865
,
race =
1
,
married =
1
,
female =
1
,
ne_states =
1
,
s
prob_employed
<-
predict
(logit,
newdata =
modifieddata,
type =
"response"
)
prob_employed
1
0.978225
The probability of being employed in 2009 for an individual that is 35 year-old, with weekly earnings of 865
dollars, white, married, female, living in the North-East, and with a Bachelor degree is 97.82%.
modifieddata2
<-
data.frame
(
age =
35
,
earnwke =
865
,
race =
1
,
married =
0
,
female =
0
,
ne_states =
1
,
prob_unemployed
<-
predict
(logit,
newdata =
modifieddata2,
type =
"response"
)
prob_unemployed
1
0.9553663
expected_difference
<-
prob_employed
-
prob_unemployed
expected_difference
1
0.02285865
The expected difference is 0.02286 which is 2.286%.
9
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Related Documents
Recommended textbooks for you

Managerial Economics: A Problem Solving Approach
Economics
ISBN:9781337106665
Author:Luke M. Froeb, Brian T. McCann, Michael R. Ward, Mike Shor
Publisher:Cengage Learning

Managerial Economics: Applications, Strategies an...
Economics
ISBN:9781305506381
Author:James R. McGuigan, R. Charles Moyer, Frederick H.deB. Harris
Publisher:Cengage Learning



Recommended textbooks for you
- Managerial Economics: A Problem Solving ApproachEconomicsISBN:9781337106665Author:Luke M. Froeb, Brian T. McCann, Michael R. Ward, Mike ShorPublisher:Cengage LearningManagerial Economics: Applications, Strategies an...EconomicsISBN:9781305506381Author:James R. McGuigan, R. Charles Moyer, Frederick H.deB. HarrisPublisher:Cengage Learning

Managerial Economics: A Problem Solving Approach
Economics
ISBN:9781337106665
Author:Luke M. Froeb, Brian T. McCann, Michael R. Ward, Mike Shor
Publisher:Cengage Learning

Managerial Economics: Applications, Strategies an...
Economics
ISBN:9781305506381
Author:James R. McGuigan, R. Charles Moyer, Frederick H.deB. Harris
Publisher:Cengage Learning


