Assignment-5

pdf

School

McMaster University *

*We aren’t endorsed by this school

Course

2B03

Subject

Statistics

Date

Apr 3, 2024

Type

pdf

Pages

5

Uploaded by MinisterAnt14343

Report
2B03 Assignment 5 Regression Modeling and Prediction (Chapters 10 & 11) Matthew Musulin 400329990 Due Thursday December 2 2021 Instructions: You are to use R Markdown for generating your assignment output file. You begin with the R Markdown script downloaded from A2L, and need to pay attention to information provided via introductory material posted to A2L on working with R, R Markdown, and downloading data from ODESI. Having downloaded all necessary files, placed them in the same folder/directory, and added your answers to the R Markdown script, you then are to generate your output file using “Knit to PDF” and, when complete, upload both your R Markdown file and your PDF file to the appropriate folder on A2L. 1. Define the following terms in a sentence (or short paragraph) and state a formula if appropriate (this question is worth 5 marks). i. Test of Significance A formal procedure for comparing observed data with a hypothesis we want to prove true. The hypothesis is usually a statement about the population parameters. The results of the test are expressed in terms of a probability that measures how well the data and the hypothesis agree. ii. Coefficient of Determination The measure of how well an estimated regression line fits the sample data. R 2 = SSM/SST iii. Multiple Regression Analysis When there are several explanatory variables this is known as multiple regression analysis. iv. Individual Prediction Interval An estimate of an interval in which future observations will fall, with a certain probability, given what has already been observed. Prediction intervals are often used in regression analysis. 2. An economist is studying the relationship between unemployment and inflation, and has collected the following data. Inflation appears in columns, unemployment in rows (this question is worth 5 marks). Unemployment Abated Inflation Unchanged Accelerated Total Lower 5 5 10 20 Unchanged 5 35 20 60 Higher 20 0 0 20 Total 30 40 30 100 The data in the table above summarize the relationship between unemployment and inflation based on 100 months of data. For instance, for 35 months, inflation and unemployment were unchanged, while for 10 months inflation had accelerated and unemployment was lower. 1
Using the data in the table above, conduct an appropriate hypothesis test of independence between inflation and unemployment at the 5% level of significance. row1 = c( 5 , 5 , 10 ) row2 = c( 5 , 35 , 20 ) row3 = c( 20 , 0 , 0 ) Matrix = matrix(c(row1, row2, row3), nrow= 3 , byrow= TRUE) chisq.test(Matrix, correct= TRUE) ## ## Pearson ' s Chi-squared test ## ## data: Matrix ## X-squared = 65.278, df = 4, p-value = 2.249e-13 Therefore, because the p-value 2.49e-13 < 0.05, we reject H0. There is sufficient evidence that indicates a dependence between unemployment and inflation and the result is statistically significant. 3. Consider the following dataset on the final grade received in a particular course ( grade ) and attendance ( attend , number of times present when work was handed back during the semester out of a maximum of six times). Note that R has the ability to read datafiles directly from a URL, so here (unlike the odesi data that you manually retrieved) you do not have to manually download the data providing you are connected to the internet (this question is worth 5 marks). course <- read.table( "https://socialsciences.mcmaster.ca/racinej/files/attend.RData" ) attach(course) i. Run a regression of grade on attend using the R command lm() . What is the impact of a 1 unit increase of attend on the expected grade based on your model? attach(course) ## The following objects are masked from course (pos = 3): ## ## attend, grade lm(attend~grade) ## ## Call: ## lm(formula = attend ~ grade) ## ## Coefficients: ## (Intercept) grade ## -0.53515 0.05971 Therefore, a one unit increase in attendance will increase the probability of getting a higher mark. i. In class we distinguished between correlation and causation and cautioned against inferring causation from statistical correlation. Do your results suggest that an individual who increased their attendance by 1 unit would also experience an increase in their expected grade? Why or why not? Explain the roles of confounders in this context (e.g. along the lines of Sir. R. A. Fisher’s concerns). A correlation between variables does not immediately mean that the change in one variable is the cause of the change in the values of the other variable. Causation shows that one event is the result of the occurrence of the other event. In this case, the results show that the increased attendance would also experience a higher expected grade. 2
4. This question requires you to download data obtained from Statistics Canada. If you are working on campus go to www.odesi.ca (off campus users must first sign into the McMaster library via libaccess at library.mcmaster.ca/libaccess, search for odesi via the library search facilities then select odesi from these search results). Next, select the “Find data” field in odesi and search for “Labour Force Survey June, 2021”, then scroll down and select the Labour Force Survey, June 2021 [Canada] . Next click on the “Explore & Download” icon, then click on the download icon (i.e., the diskette icon, square, along the upper right of the browser pane) and then click on “Select Data Format” then scroll down and select “Comma Separated Value file” (csv) which, after a brief pause, will download the data to your hard drive (you may have to extract the file from a zip archive depending on which operating system you are using). Finally, make sure that you place this csv file in the same directory as your R code file (this file ought to have the name LFS-71M0001-E-2021-June_F1.csv , and in RStudio select the menu item Session -> Set Working Directory -> To Source File Location). There will be another file with (almost) the same name but with the extension .pdf that is the pdf documentation that describes the variables in this data set. Note that it would be prudent to retain this file as we will use it in future assignments (this question is worth 10 marks). lfp <- read.csv( "LFS-71M0001-E-2021-June_F1.csv" ) Next, open RStudio, make sure this csv file and your R Markdown script are in the same directory (in RStudio open the Files tab (lower right pane by default) and refresh the file listing if necessary). Then read the file as follows: This data set contains some interesting variables on the labour force status of a random subset of Canadians. We will focus on the variable HRLYEARN (hourly earnings) described on page 22 of the pdf file LFS-71M0001-E-2021-June.pdf . We will also consider other variables so that we can conduct multiple regression analysis. i. Following assignment 1, consider hourly earnings and highest educational attainment for people in the survey and consider both high school graduates ( EDUC==2 ) and those holding a bachelors degree ( EDUC==5 ). To construct these subsets we can use the R command subset as follows (the ampersand is the logical operator and - see ?subset for details on the subset command): hs <- subset(lfp, FTPTMAIN== 1 & EDUC== 2 & HRLYEARN > 0 )$HRLYEARN ba <- subset(lfp, FTPTMAIN== 1 & EDUC== 5 & HRLYEARN > 0 )$HRLYEARN These commands simply tell R to take a subset of the data frame lfp for full-time workers having either a high school diploma or university bachelors degree for those reporting positive earnings, and then retain only the variable HRLYEARN and store these in the variables named hs (hourly earnings for high-school graduates) or ba (hourly earnings for university graduates). Conduct a t -test of the hypothesis that the average wage is equal for the two groups using the R function t.test() (see ?t.test() for details). t.test(hs,ba) ## ## Welch Two Sample t-test ## ## data: hs and ba ## t = -52.007, df = 13783, p-value < 2.2e-16 ## alternative hypothesis: true difference in means is not equal to 0 ## 95 percent confidence interval: ## -11.81520 -10.95693 ## sample estimates: ## mean of x mean of y ## 24.87665 36.26271 Therefore because the p-value is 2.2e-16 (0.000) we reject H0. 3
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
ii. Using the PDF that accompanies the data titled "Canada Labour Statistics Division, Statistics Canada Labour Force Survey, June 2021 [Canada], Study Documentation, present a description of each variable that we select below (i.e., HRLYEARN, EDUC, SEX, AGE_12, MARSTAT, UNION ; see pages 15 and on for details). foo <- subset(lfp, FTPTMAIN== 1 & HRLYEARN > 0 , select = c(HRLYEARN,EDUC,SEX,AGE_12,MARSTAT,UNIO HRLYEARN: Usual hourly wages, employees only. EDUC: Highest educational attainment. SEX: Sex of respondent. AGE_12: Five-year age group of respondent. MARSTAT: Marital status of respondant. UNION: Union status, employees only. iii. Estimate the multivariate linear regression model and produce a model summary via summary(model) . model <- lm(HRLYEARN~EDUC+AGE_12+factor(SEX)+factor(MARSTAT)+factor(UNION), data= foo) summary(model) ## ## Call: ## lm(formula = HRLYEARN ~ EDUC + AGE_12 + factor(SEX) + factor(MARSTAT) + ## factor(UNION), data = foo) ## ## Residuals: ## Min 1Q Median 3Q Max ## -36.754 -8.336 -1.973 6.206 80.860 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 18.84178 0.33855 55.655 < 2e-16 *** ## EDUC 3.65616 0.04863 75.187 < 2e-16 *** ## AGE_12 0.62894 0.02961 21.244 < 2e-16 *** ## factor(SEX)2 -4.41012 0.13528 -32.599 < 2e-16 *** ## factor(MARSTAT)2 -1.36427 0.18953 -7.198 6.22e-13 *** ## factor(MARSTAT)3 -3.36602 0.66500 -5.062 4.18e-07 *** ## factor(MARSTAT)4 -1.31162 0.43317 -3.028 0.002464 ** ## factor(MARSTAT)5 -1.16748 0.32739 -3.566 0.000363 *** ## factor(MARSTAT)6 -4.79729 0.18546 -25.867 < 2e-16 *** ## factor(UNION)2 0.43059 0.49255 0.874 0.382011 ## factor(UNION)3 -3.28675 0.14345 -22.912 < 2e-16 *** ## --- ## Signif. codes: 0 ' *** ' 0.001 ' ** ' 0.01 ' * ' 0.05 ' . ' 0.1 ' ' 1 ## ## Residual standard error: 12.65 on 36208 degrees of freedom ## Multiple R-squared: 0.2184, Adjusted R-squared: 0.2182 ## F-statistic: 1012 on 10 and 36208 DF, p-value: < 2.2e-16 What is the coefficient of determination? Interpret this result. The coefficient of determination is the measure of how well an estimated regression line fits the 4
sample data. This correlation, known as the “goodness of fit,” is represented as a value between 0.0 and 1.0. In our case, R 2 = 0 . 2184 = 21 . 84 which is low and we can say that the model is not a good fit. Interpret the coefficient estimates. Note that factor(SEX)2 is the dummy variable that represents the difference between wages between group 2 and the base group group 1, other things equal (i.e., 2=Female, 1=Male). Also, note that AGE_12 contains five year ranges, so the coefficient is the change in expected wages for an additional five years of age, other things equal. HRLYEARN is the dependent variable and EDUC, AGE_12, SEX, MARSTAT, and UNION are independent variables. Independent Variable Interpretation: The estimated value of EDUC is 3.65616. This means, with the increase in 1% of EDUC, HRLYEARN will increase by 3.65616% The estimated value of AGE_12 is 0.62894. This means, with the increase in 1% of AGE_12, HRLYEARN will increase by 0.62894%. The estimated value of factor(SEX)2 is -4.41012. This means, with the increase in 1% of fac- tor(SEX)2, HRLYEARN will decrease by -4.41012%. The estimated value of factor(MARSTAT)2 is -1.36427. This means, with the increase in 1% of (MARSTAT)2, HRLYEARN will decrease by 1.36427%. The estimated value of factor factor(MARSTAT)3 is -3.36602. This means, with the increase in 1% of (MARSTAT)3, HRLYEARN will decrease by 3.36602%. The estimated value of factor(MARSTAT)4 is -1.31162. This means, with the increase in 1% of (MARSTAT)4, HRLYEARN will decrease by 1.31162%. The estimated value of factor(MARSTAT)5 is -1.16748. This means, with the increase in 1% of factor(MARSTAT)5, HRLYEARN will decrease by 1.16748%. The estimated value of factor(MARSTAT)6 is -4.79729. This means, with the increase in 1% of factor(MARSTAT)6, HRLYEARN will decrease by 4.79729%. The estimated value of factor(UNION)2 is 0.43059. This means, with the increase in 1% of factor(UNION)2, HRLYEARN will increase by 0.43059%. The estimated value of factor(UNION)3 is -3.28675. This means, with the increase in 1% of factor(UNION)3, HRLYEARN will decrease by 3.28675%. What is the impact of either being covered but not unionized or being non-unionized relative to unionized workers? Are they both significant? Does this make sense? Significance of factor(UNION)2: H0 = (UNION)2 = 0 HA = (UNION)2 =/= 0 The p-value of factor (UNION)2 is 0.382011. Therefore we accept H0. This means factor(UNION)2 is non-significant. factor(UNION)2 has no effect on the dependent variable HRLYEARN. Significance of factor(UNION)3: H0 = (UNION)3 = 0 HA = (UNION)3 =/= 0 The p-value of factor (UNION)3 is 2e-16. Therefore we reject H0. This means factor(UNION)3 is significant. factor(UNION)3 has a negative effect on the dependent variable HRLYEARN because the coefficient value is -3.36436. Conduct a test of significance for the variable EDUC . Show all steps and interpret the results. Significance of EDUC: H0 = EDUC = 0 HA = EDUC =/= 0 The p-value of EDUC is 2e-16. Therefore we reject H0. This means EDUC is significant. EDUC has a positive effect on the dependent variable HRLYEARN because the coefficient value is 3.45415. 5