Question B (35 points)
Choose one of the methodologies we learned in class to calculate a causal effect under conditional ignorability.
What estimand are you targeting and why? Explain why you made your choice, and discuss the assumptions
that are needed to apply your method of choice to this dataset. State if and why you think these assumptions
hold in this dataset. In addition, choose a method to compute variance estimates (i.e., robust standard errors
or bootstrapping), and discuss the reasons behind your choice in the context of this dataset.
As learnt in class, Propensity Score Matching (PSM) highlight its use for estimating causal effects under
conditional ignorability.
The key estimand targeted here is the Average Treatment Effect on the Treated
(ATT), which is suitable for this scenario because it focuses on the effect of treatment (family income) on
those who received the treatment (those with higher family incomes). PSM is chosen due to its nonparametric
nature, requiring milder assumptions compared to model-based approaches. It reduces the dimensionality
of covariates into a single scalar value (the propensity score), allowing for matching units with similar
probabilities of treatment.
To elaborate more, following are the reasons to choose PSM and ATT:
1. Nature of the Treatment (Income): In the ‘college.csv’ dataset, ‘income’ is not randomly assigned;
instead, it varies across households. PSM is suitable for observational data where treatment assignment
could be biased and dependent on observed characteristics.
2. Focus on Treated Population: The primary interest might be in understanding the effect of being in a
higher income bracket on the likelihood of attending college. The ATT provides this specific insight,
which is policy-relevant and practical.
3. Heterogeneity of Treatment Effects: Given the diverse socio-economic backgrounds likely represented
in the ‘college.csv’ dataset, treatment effects (impact of income on college attendance) might vary
across different segments of the population. PSM helps to compare like-with-like, focusing on those
who are treated.
4. Data Richness: Assuming that ‘college.csv’ includes comprehensive data on relevant covariates (like
parental education, urban/rural living, state wages, etc.), PSM can effectively control for these con-
founders.
The following assumptions must be taken into account:
1. Conditional Independence Assumption/ Ignorability:
This assumption states that once you control
for covariates included in the propensity score, the assignment to treatment (family income level)
is independent of potential outcomes (college enrollment). Given the dataset includes comprehensive
variables like test scores, father’s education, and urban living, it’s plausible that these variables capture
the relevant confounders, making this assumption reasonable.
2. Balancing Condition There must be an overlap in the characteristics (covariates) of the treated and
control groups; for every individual in the treatment group, there should be similar individuals in the
control group. Considering the broad range of income levels and other covariates in a large dataset of
high school seniors, this condition is likely to be met.
3. Stable Unit Treatment Value Assumption (SUTVA): This assumes that the potential outcomes for any
individual are unaffected by the treatment status of other individuals, and that there’s only one version
of the treatment. This assumption might be reasonable in the context of the dataset, as the treatment
(family income level) is individual-specific, and its effect on college attendance is likely independent of
other individuals’ income levels.
For variance estimation in this dataset, bootstrapping is a suitable choice. This method involves repeatedly
resampling the data with replacement and recalculating the estimator (in this case, the ATT from Propensity
4