final23

pdf

School

New York University *

*We aren’t endorsed by this school

Course

MISC

Subject

Economics

Date

Feb 20, 2024

Type

pdf

Pages

7

Uploaded by ProfStar12479

Report
DS-UA 201: Final Exam Devyani Rastogi - dr3158 - 001 Due December 20, 2023 at 5pm Instructions You should submit your write-up (as a knitted .pdf along with the accompanying .rmd file) to the course website before 5pm EST on Wednesday, Dec 20th Please upload your solutions as a .pdf file saved as Yourlastname_Yourfirstname_final.pdf .In addition, an electronic copy of your .Rmd file (saved as Yourlastname_Yourfirstname_final.Rmd ) should accompany this submission. Late finals will not be accepted, so start early and plan to finish early . Remember that exams often take longer to finish than you might expect. This exam has 3 parts and is worth a total of 100 points . Show your work in order to receive partial credit. Also, we will penalize uncompiled .rmd files and missing pdf or rmd files by 5 points. In general, you will receive points (partial credit is possible) when you demonstrate knowledge about the questions we have asked, you will not receive points when you demonstrate knowledge about questions we have not asked, and you will lose points when you make inaccurate statements (whether or not they relate to the question asked). Be careful, however, that you provide an answer to all parts of each question. You may use your notes, books, and internet resources to answer the questions below. However, you are to work on the exam by yourself. You are prohibited from corresponding with any human being regarding the exam (unless following the procedures below). The TAs and I will answer clarifying questions during the exam. We will not answer statistical or computational questions until after the exam is over. If you have a question, send email to all of us. If your question is a clarifying one, we will reply. Do not attempt to ask questions related to the exam on the discussion board. 1
Problem 1 (100 points) In this problem, you will examine whether family income affects an individual’s likelihood to enroll in college by analyzing a survey of approximately 4739 high school seniors that was conducted in 1980 with a follow-up survey taken in 1986. This dataset is based on a dataset from Rouse, Cecilia Elena. “Democratization or diversion? The effect of community colleges on educational attainment.” Journal of Business & Economic Statistics 13, no. 2 (1995): 217-224. The dataset is college.csv and it contains the following variables: college Indicator for whether an individual attended college. (Outcome) income Is the family income above USD 25,000 per year (Treatment) distance distance from 4-year college (in 10s of miles). score These are achievement tests given to high school seniors in the sample in 1980. fcollege Is the father a college graduate? tuition Average state 4-year college tuition (in 1000 USD). wage State hourly wage in manufacturing in 1980. urban Does the family live in an urban area? Question A (35 points) Draw a DAG of the variables included in the dataset, and explain why you think arrows between variables are present or absent. You can use any tool you want to create an image of your DAG, but make sure you embed it on your compiled .pdf file. Assuming that there are no unobserved confounders, what variables should you condition on in order to estimate the effect of the treatment on the outcome, according to the DAG you drew? Explain your decision in detail. In your explanation, provide a definition of confounding. library (dagitty) # Define the DAG dag <- dagitty ( "dag { income -> college fcollege -> income fcollege -> college income -> score score -> college distance -> income distance -> college urban -> distance urban -> income urban -> college wage -> income tuition -> college }" , layout = TRUE ) # Plot the DAG plot (dag) 2
college distance fcollege income score tuition urban wage 1. Income College: This arrow suggests that family income directly affects the likelihood of college enrollment, which is the primary causal effect we are interested in. 2. Fcollege Income & College: Having a father who attended college (fcollege) likely influences both family income and the child’s likelihood of attending college, due to factors like higher earning potential and valuing education. 3. Income Score: This suggests that higher family income can lead to better academic scores, possibly due to better educational resources. 4. Score College: High scool academic scores directly impact college admission chances due to the college cut-off percentages. 5. Distance Income & College: Geographic distance from college might affect family income (e.g., rural vs. urban income disparities) and the practicality of attending college. 6. Urban Distance, Income & College: Urban living affects the distance to college, family income levels, and access to college education. 7. Wage Income: Indicates that state minimum wages are a component of or influence on family income. 8. Tuition College: The cost of tuition directly impacts the feasibility of college enrollment. By the backdoor criterion, all the backdoor paths from the ‘income’ (treatment) to ‘college’ (outcome) must be blocked in order for the effect of family income on college enrollment to be identifiable. A path is blocked when either it contains a non-collider that we condition on, or when it contains a collider that we do not condition on. In our case, the variables ‘fcollege’, ‘distance’, and ‘urban’ are non-colliders or confounders, i.e. variables that influences both the treatment and the outcome. Therefore, we must condition on these variables to understand the causal effect. 3
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Question B (35 points) Choose one of the methodologies we learned in class to calculate a causal effect under conditional ignorability. What estimand are you targeting and why? Explain why you made your choice, and discuss the assumptions that are needed to apply your method of choice to this dataset. State if and why you think these assumptions hold in this dataset. In addition, choose a method to compute variance estimates (i.e., robust standard errors or bootstrapping), and discuss the reasons behind your choice in the context of this dataset. As learnt in class, Propensity Score Matching (PSM) highlight its use for estimating causal effects under conditional ignorability. The key estimand targeted here is the Average Treatment Effect on the Treated (ATT), which is suitable for this scenario because it focuses on the effect of treatment (family income) on those who received the treatment (those with higher family incomes). PSM is chosen due to its nonparametric nature, requiring milder assumptions compared to model-based approaches. It reduces the dimensionality of covariates into a single scalar value (the propensity score), allowing for matching units with similar probabilities of treatment. To elaborate more, following are the reasons to choose PSM and ATT: 1. Nature of the Treatment (Income): In the ‘college.csv’ dataset, ‘income’ is not randomly assigned; instead, it varies across households. PSM is suitable for observational data where treatment assignment could be biased and dependent on observed characteristics. 2. Focus on Treated Population: The primary interest might be in understanding the effect of being in a higher income bracket on the likelihood of attending college. The ATT provides this specific insight, which is policy-relevant and practical. 3. Heterogeneity of Treatment Effects: Given the diverse socio-economic backgrounds likely represented in the ‘college.csv’ dataset, treatment effects (impact of income on college attendance) might vary across different segments of the population. PSM helps to compare like-with-like, focusing on those who are treated. 4. Data Richness: Assuming that ‘college.csv’ includes comprehensive data on relevant covariates (like parental education, urban/rural living, state wages, etc.), PSM can effectively control for these con- founders. The following assumptions must be taken into account: 1. Conditional Independence Assumption/ Ignorability: This assumption states that once you control for covariates included in the propensity score, the assignment to treatment (family income level) is independent of potential outcomes (college enrollment). Given the dataset includes comprehensive variables like test scores, father’s education, and urban living, it’s plausible that these variables capture the relevant confounders, making this assumption reasonable. 2. Balancing Condition There must be an overlap in the characteristics (covariates) of the treated and control groups; for every individual in the treatment group, there should be similar individuals in the control group. Considering the broad range of income levels and other covariates in a large dataset of high school seniors, this condition is likely to be met. 3. Stable Unit Treatment Value Assumption (SUTVA): This assumes that the potential outcomes for any individual are unaffected by the treatment status of other individuals, and that there’s only one version of the treatment. This assumption might be reasonable in the context of the dataset, as the treatment (family income level) is individual-specific, and its effect on college attendance is likely independent of other individuals’ income levels. For variance estimation in this dataset, bootstrapping is a suitable choice. This method involves repeatedly resampling the data with replacement and recalculating the estimator (in this case, the ATT from Propensity 4
Score Matching). It’s particularly advantageous in complex models or when the sample size isn’t large enough to rely on asymptotic properties of estimators. Bootstrapping is beneficial in context of our dataset because: 1. Flexibility: It doesn’t make strict assumptions about the distribution of the data or the sampling process. 2. Complex Models: PSM can be complex, and bootstrapping accommodates this by providing more accurate standard error estimates. 3. Robustness: It offers robustness against model misspecification and potential violations of underlying assumptions. Given the observational nature of the dataset and the complexity of estimating causal effects, bootstrapping provides a more reliable and robust way to estimate the variability of the ATT. Question C (30 points) Using the methodology you chose in Question B to control for the confounders you have selected in Question A, as well as the relevant R packages, provide your estimate of the causal effect of the treatment on the outcome. Using your variance estimator of choice, report standard errors and 95% confidence intervals around your estimates. Interpret your results and discuss both their statistical significance and their substantive implications. Be as specific and detailed as possible. library (MatchIt) library (estimatr) # Load your data data <- read.csv ( "college.csv" ) # Replace with your actual file path # Rename the existing ' distance ' variable in your dataset to avoid conflict with the MatchIt library names (data)[ names (data) == "distance" ] <- "distance_college" # Perform Propensity Score Matching matched_data <- matchit (income ~ fcollege + urban + distance_college, data = data, method = "nearest" , estimand = "ATT" ) matched_data ## A matchit object ## - method: 1:1 nearest neighbor matching without replacement ## - distance: Propensity score ## - estimated with logistic regression ## - number of obs.: 4739 (original), 2730 (matched) ## - target estimand: ATT ## - covariates: fcollege, urban, distance_college # Extract matched data matched_set <- match.data (matched_data) # Bootstrapping 5
set.seed ( 123 ) nBoot <- 2000 # Number of bootstrap iterations boot_results <- rep ( NA , nBoot) # Prepare a vector to store the bootstrap results for (iter in 1 : nBoot) { # Resample the matched data with replacement bootstrap_sample <- matched_set[ sample ( nrow (matched_set), replace = TRUE ), ] # Fit a robust linear model to the treatment group treatment_model_boot <- lm (college ~ fcollege + urban + distance_college, data = subset (bootstrap_sample, income == 1 )) # Fit a robust linear model to the control group control_model_boot <- lm (college ~ fcollege + urban + distance_college, data = subset (bootstrap_sample, income == 0 )) # Predict the potential outcome under treatment for all units bootstrap_sample $ predicted_treatment <- predict (treatment_model_boot, newdata = bootstrap_sample) # Predict the potential outcome under control for all units bootstrap_sample $ predicted_control <- predict (control_model_boot, newdata = bootstrap_sample) # Calculate and store the ATT for this bootstrap iteration boot_results[iter] <- mean (bootstrap_sample $ predicted_treatment[bootstrap_sample $ income == 1 ]) - mean (bootstrap_sample $ predicted_control[bootstrap_sample $ income == 1 ]) } # Calculate the standard error of the bootstrapped ATT estimates se_hat <- sd (boot_results) # Calculate the 95% confidence interval for the ATT conf_interval <- quantile (boot_results, probs = c ( 0.025 , 0.975 )) # Output the estimated ATT, its standard error, and the confidence interval list ( ATT = mean (boot_results), SE = se_hat, ConfInt = conf_interval) ## $ATT ## [1] 0.1101 ## ## $SE ## [1] 0.01741 ## ## $ConfInt ## 2.5% 97.5% ## 0.07502 0.14371 The analysis indicates a notable positive impact of the treatment on the outcome. With an ATT of 0.1101, it suggests that the treatment (family income is above USD 25,000 per year) is associated with an 11.01% higher likelihood of the outcome (college enrollment). The precision of this estimate is high, as indicated by the standard error of 0.01741. The 95% confidence interval, ranging from 0.07502 to 0.14371, not encompassing 6
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
zero, strongly supports the statistical significance of this effect. This result is particularly insightful for understanding the socioeconomic factors influencing educational access. The fact that a family income above USD 25,000 per year increases the likelihood of college enrollment by 11.01% underscores the critical role of economic stability in educational opportunities. It highlights a potential area of intervention for educational policy, where increasing family incomes, or providing support to those below this income threshold, could have a meaningful impact on college attendance rates. This effect size is both statistically significant and practically important, suggesting that even moderate changes in family income can have a tangible impact on educational outcomes, potentially leading to long-term benefits in workforce participation and economic mobility. 7