Assignment4

pdf

School

Georgia State University *

*We aren’t endorsed by this school

Course

225

Subject

Business

Date

Apr 3, 2024

Type

pdf

Pages

5

Uploaded by DoctorFogButterfly43

Report
Assignment 4 – Business Analytics Instructor: Pan Li SUBMISSION & COLLABORATION RULES Homework Assignment 4 – due 11:59PM March 25, in the Assignments section of CANVAS • This is an individual assignment. The GT honor code applies to this and other assignments! • There are plenty of resources to get help on using R. You can also ask your instructor/TA/classmates for help on using R. • On CANVAS I have posted the R script files that I have used in my PPT decks for the Experiments, Variable Selection, and Cross Validation modules • This assignment also has my R code to run Parts A and B. You will have to interpret the output to answer the questions in those parts. • You should submit your assignment as a Word file with your answers. • Please name your file as lastname_firstname_HW4.docx • You should submit your homework in the Assignments section of CANVAS • You will need to install and use the following R libraries for this homework o tidyverse o psych o Ecdat PURPOSE Understand how to apply and interpret the following concepts: Randomized Controlled Experiments Natural Experiments Prediction – Variable Selection Prediction – Cross Validation Part A (20 points) Part A is aimed to help students understand Randomized Controlled Experiments (RCE) and how to interpret output correctly. You will need to interpret the coefficients of Regression used for a RCE and estimate treatment effect. Data set to be used for Part C: Ecdat::Star You will need to download and run HW4.PartA.R For questions (A1 and A2) use the star dataset in the Ecdat package. Create a new dataset mydata that only has records for the “small” and “regular.with.aide” classes. Note: We are not interested in regular sized classes. The R code for selecting these two type of classes is mydata <- dplyr::filter(Ecdat::Star, classk=="small.class"|classk=="regular.with.aide") Create a dummy variable called small which is 1 for a student in a “small” size class, and is 0 for a student in a “regular.with.aide” class. Create totalscore which is the sum of the math and the reading scores for each record.
Please type ? Ecdat::Star in R to get a description of the Star dataset. Question A1 (10 points) Run a linear regression model reg_1 using all the data in mydata, using totalscore as the response variable and small as the predictor. • Show the output of summary(reg_1) • What is the estimated coefficient of small ? • What is its p-value? • Is small statistically significant? • What is the interpretation of the coefficient of small? Question A2 (10 points) Please run a linear regression model reg-2 using all the data in mydata, using totalscore as the response variable and small and teacher experience as the predictors. • Show the output of summary(reg_2) • What are the estimated coefficients of small and teacher experience? • What are their p-values? • Are they statistically significant? • What is the interpretation of the coefficient of small and teacher? Part B (70 points) Part B is aimed to help students understand Natural Experiments and how to interpret output correctly. You will need to interpret the coefficients of Regression used for a natural experiment and estimate treatment effect. Data set to be used for Part B: Ecdat::Treatment You will need to download and run HW4.PartB.R We will use the dataset named “Treatment” (in the Ecdat package). Create a new data frame , mydata, that is a copy of Treatment for your analysis, i.e., mydata <- Ecdat::Treatment . You will use this dataframe in questions B1-B3. The National Supported Work (NSW) demonstration project, conducted in the 1970s, measured the impact of training on earnings by a randomized experiment that assigned some individuals to receive training (a treatment group) and others to receive no training (a control group). Dehejia and Wahba (1999, 2002) adapted this data by creating a different “control” sample from national surveys that allowed them to compare experimental data with methods used in non-experimental settings. We have adapted the dataset from Dehejia and Wahba (1999) to compare different methods for determining causal impact of training on wages. We will consider 1978 to be post-treatment ( AFTER ) and 1975 as the pre-treatment period (BEFORE). Please enter ?Ecdat::Treatment in R-studio to get a description of the Treatment dataset. Description
a cross-section from 1974 number of observations : 2675 country : United States Format A dataframe containing : treat , was the individual in the treated group? Age , age of individual educ, education in years of the individual ethn , a factor with levels (“other”,”black”,”hispanic”) married , is the individual married ? re74 , the individual real annual earnings in 1974 (pre-treatment) re75 , the individual real annual earnings in 1975 (pre-treatment) re78, the individual real annual earnings in 1978 (post-treatment) u74, was the individual unemployed in 1974? u75, was the individual unemployed in 1975? Use the mutate function (in the dplyr package) to create a dummy variable Treated using this code: mydata <- mydata %>% mutate(Treated = as.numeric(treat)) So, if a record has Treated = 1 then it belongs to the treatment group; otherwise, if a record has Treated = 0 then it belongs to the control group Question B1: (10 points) Use the regression model, re75 = b0 + b1*Treated, to calculate the mean value of the pre- treatment period (BEFORE) annual earnings (re75) for the control and treatment groups, respectively. Please show you got your answers with the relevant R output. Question B2: (10 points) Use the regression model, re78 = b0 + b1*Treated, to calculate the mean value of the post-treatment period (AFTER) annual earnings (re78) for the control and treatment groups, respectively. Please show you got your answers with the relevant R output. Question B3 (10 Points) Please enter the values for A and B from Question B1 and C and D from Question B2 into the following table: • A = • B = • C =
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
• D = Question B4: (10 points) Using the table in B3, what is the difference between average re78 values and average re75 values for the Control group? Please show your computation. Question B5: (10 points) Using the table in B3, what is the difference between average re78 values and average re75 values for the Treatment group? Please show your computation. Question B6: (10 points) Enter the values for C - A from Question B4 and D - B from Question B6 into the following table: • C – A = • D – B = Question B7: (10 points) What is the difference-in-difference estimate for the impact of training on wages? Please show your computation. Diff-in-Diff = (D – B) – (C – A) = Part C (10 points; 2 points per question – Please select the correct answer(s)) 1. We have several linear regression models that are being considered for prediction and we have training data. If we focus only on how a model’s performance is measured by its Mean Square Error on the training data alone, we would always get the best linear regression model for prediction. Is this statement True or False? A. True B. False 2. Which of the following statements are TRUE about Best Subset Selection? (Select all Correct Answer(s))
A. If you have p predictors, there are p separate regressions possible that need to be evaluated. B. If you have p predictors, there are 2p separate regressions possible that need to be evaluated. C. Best Subset selection is not practical with p (# of predictors) over 30. Instead, we have to resort to alternative efficient methods. 3. Which of the following statements are TRUE? (Select all Correct Answer(s)) A. Stepwise methods – Forward and Backward – are alternative approaches to best subset selection that involve a considerably smaller number of models compared to Best Subset Selection. B. Both Forward and Backward stepwise selection methods involve consideration of 1 + p(p + 1)/2 models which is far more than the 2 𝑝 models being considered in best subset selection. C. Both Forward and Backward stepwise selection methods involve consideration of 1 + p(p + 1)/2 models which is far less than the 2 𝑝 models being considered in best subset selection. 4. Both Best Subset and Forward stepwise will always select the identical set of variables for their best models with 5 variables (assume that p = 10) A. TRUE B. FALSE 5. Which of the following statements are TRUE, when comparing Ridge Regression and Lasso? ( Select all Correct Answer(s)) A. Lasso has a very useful advantage as it produces simpler models which are more interpretable. B. Ridge Regression will always dominate Lasso over all data sets. C. As λ becomes very large, the penalty impact grows and more of the Lasso regression coefficients are set to zero *** End of Homework Assignment #4***