Statistical Analysis of Armspan and Test Scores in Iowa

Stats 101A HW 5 Ian Zhang UID: 205702810 2023-05-05 Question 4 rse <- 2.418 df <- 33 r2 <- . 7254 fstat <- 87.17 RSS <- df * rseˆ 2 RSS ## [1] 192.9419 SSreg <- fstat * (RSS/df) SSreg ## [1] 509.6589 meanSSreg <- SSreg / df meanSSreg ## [1] 15.44421 totalSS <- SSreg + RSS totalSS ## [1] 702.6008 r <- sqrt(SSreg / totalSS) r ## [1] 0.8516977 Question 1 arm <- read.csv( "armspans2022_gender.csv" ) mean(arm$is.female) ## [1] 0.3478261 m1 <- lm(armspan ~ is.female, data = arm) summary(m1) ## ## Call: ## lm(formula = armspan ~ is.female, data = arm) ## ## Residuals: 1

## Min 1Q Median 3Q Max ## -9.7586 -2.0248 0.2414 2.2414 8.2414 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 69.7586 0.7399 94.284 < 2e-16 *** ## is.female -7.7338 1.2408 -6.233 1.68e-07 *** ## --- ## Signif. codes: 0 ' *** ' 0.001 ' ** ' 0.01 ' * ' 0.05 ' . ' 0.1 ' ' 1 ## ## Residual standard error: 3.984 on 43 degrees of freedom ## (1 observation deleted due to missingness) ## Multiple R-squared: 0.4746, Adjusted R-squared: 0.4624 ## F-statistic: 38.85 on 1 and 43 DF, p-value: 1.676e-07 plot(armspan ~ is.female, data = arm) 0.0 0.2 0.4 0.6 0.8 1.0 55 60 65 70 75 is.female armspan b) the intercept is 69.758, which is the estimated mean armspan of males. c) the slope is -7.7338, which means that the difference in the estimated mean armspan of males and females is 7.7338. this means that females armspans are 7.7338 shorter than male armspans on average d) the t statistic and p value is testing if there is a difference in mean armspans between males and females. The null hypothesis would be that the slope = 0, which means that there is no difference. The alternative would be that the slope != 0. The p-value for the slope is 1.68e-7, which means that we can reject the null hypothesis, meaning that there is a significant difference between the mean armspans between males and females Question 2 iowa <- read.delim( "iowatest.txt" ) temp <- ifelse(iowa$City== "Iowa City" , 1 , 0 ) iowa$is.iowa <- temp 2

m2 <- lm(Test ~ is.iowa, data = iowa) plot(Test ~ is.iowa, data = iowa) abline(m2) 0.0 0.2 0.4 0.6 0.8 1.0 20 30 40 50 60 70 80 90 is.iowa Test summary(m2) ## ## Call: ## lm(formula = Test ~ is.iowa, data = iowa) ## ## Residuals: ## Min 1Q Median 3Q Max ## -29.353 -9.353 -0.353 7.647 31.647 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 49.353 1.347 36.626 < 2e-16 *** ## is.iowa 14.705 3.769 3.902 0.000152 *** ## --- ## Signif. codes: 0 ' *** ' 0.001 ' ** ' 0.01 ' * ' 0.05 ' . ' 0.1 ' ' 1 ## ## Residual standard error: 14.51 on 131 degrees of freedom ## Multiple R-squared: 0.1041, Adjusted R-squared: 0.09727 ## F-statistic: 15.22 on 1 and 131 DF, p-value: 0.000152 Looking at the mean of all the cities that aren’t Iowa City, we see that the mean test score is 49.353, while the mean test score for Iowa City is 49.353 + 14.705 = 64.058, from observation alone, this shows that Iowa City has a higher mean test score than the other cities. Additionally, if you look at the p-value (0.000152), this is less than 0.05, which means that we can reject the null hypothesis that the difference in means is 0, allowing us to conclude that Iowa City does have higher test scores than other cities. 3

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version

Access to all documents
Unlimited textbook solutions
24/7 expert homework help

Question 3 m3 <- lm(Test ~ Poverty, data= iowa) plot(Test ~ Poverty, data = iowa) abline(m3) 0 20 40 60 80 100 20 30 40 50 60 70 80 90 Poverty Test summary(m3) ## ## Call: ## lm(formula = Test ~ Poverty, data = iowa) ## ## Residuals: ## Min 1Q Median 3Q Max ## -27.2812 -6.2097 0.5058 4.8252 22.3610 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 74.60578 1.61325 46.25 <2e-16 *** ## Poverty -0.53578 0.03262 -16.43 <2e-16 *** ## --- ## Signif. codes: 0 ' *** ' 0.001 ' ** ' 0.01 ' * ' 0.05 ' . ' 0.1 ' ' 1 ## ## Residual standard error: 8.766 on 131 degrees of freedom ## Multiple R-squared: 0.6731, Adjusted R-squared: 0.6707 ## F-statistic: 269.8 on 1 and 131 DF, p-value: < 2.2e-16 From the regression line, we can see that there is a strong linear association between test scores and poverty — as poverty increases, test scores decrease. The line appears to be a good fit to the data, and as we look at the summary, the p value for the slope of the line is <2e-16, which means that we are able to reject the null hypothesis of 0 slope, and we can assume that the slope is not 0, meaning that there is a linear association. 4

Question 4 plot(m3) 20 30 40 50 60 70 -30 -20 -10 0 10 20 Fitted values Residuals lm(Test ~ Poverty) Residuals vs Fitted 70 43 47 -2 -1 0 1 2 -3 -2 -1 0 1 2 3 Theoretical Quantiles Standardized residuals lm(Test ~ Poverty) Normal Q-Q 70 47 43 5

20 30 40 50 60 70 0.0 0.5 1.0 1.5 Fitted values Standardized residuals lm(Test ~ Poverty) Scale-Location 70 47 43 0.00 0.01 0.02 0.03 0.04 0.05 -3 -2 -1 0 1 2 3 Leverage Standardized residuals lm(Test ~ Poverty) Cook's distance Residuals vs Leverage 47 90 7 When looking at the residuals vs fitted plot, we can see that first, there is no clear pattern or trend in the residual plot. The points look like they are scattered randomly. There also isn’t a fan shape, which supports constant variance and ultimately validates the model. When looking at the qq-norm plot, we can see that the points are linear. The qq-norm plot tells us that the normal distribution condition is not violated, as the points follow the straight line upwards. When looking at the scale-location plot, we can determine that it does not violate the constant variance 6

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version

Access to all documents
Unlimited textbook solutions
24/7 expert homework help

condition. Since the points are randomly scattered and there is no clear trend and the red line is basically horizontal and the values are equally spread around the line, this proves that the model is valid. Ultimately, based off of the 3 residual plots, we can determine that neither the constant variance or normal distribution conditions were violated and the model is valid. Question 5 leverage <- hatvalues(m3) leverage[leverage == max(leverage)] ## 27 ## 0.04997855 #row 27 highlev <- leverage > ( 4 / 133 ) standardRes <- rstandard(m3) standardRes[highlev] < - 2 ## 7 27 46 64 67 89 109 120 126 ## FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE standardRes[highlev] > 2 ## 7 27 46 64 67 89 109 120 126 ## FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE Since there are no high leverage points that are < -2 or > 2, so there are no bad leverage points in this data. Question 6 The f test in question 3 tests whether there is a significant relationship between poverty and test scores. The null hypothesis is that the slope of the regression line is 0, and the alternate hypothesis is that it != 0. If the slope of the line is 0, that means there is not a linear association, whereas if it isn’t 0, then there would be a relationship between the 2. The p-value for this test is 2.2e-16, which is less than 0.05, thus we have significant evidence to reject the null hypothesis. This means that we can conclude there is a linear association between poverty and test scores. 7

Stats-101A-HW-5

Related Documents