STAT311W24Week6_LinReg

Rmd

School

University of Washington *

*We aren’t endorsed by this school

Course

200

Subject

Statistics

Date

Feb 20, 2024

Type

Rmd

Pages

Uploaded by CommodoreSeahorseMaster945

--- title: "Week 5 Lab Key" author: "STAT 311 -- Winter 2024" date: "Tuesday, February 6, 2024" output: openintro::lab_report --- # Intro to Linear Regression We will be considering the Human Freedom Index (HFI) report in Lab 3. Below are the key facts about this report. * attempts to summarize the idea of "freedom" through different variables for various countries * serves as a rough objective measure for the relationship between different types of freedom and other social and economic circumstances. The four types of freedom were: * Political * Religious * Economical * Personal In this lab, we will be analyzing data from the HFI reports. Our aim is to graphically and numerically summarize relationships in data to determine which variables can tell us a story about freedom.   *** # Getting Started We will start by loading the data and necessary packages. In this lab, we're using the `hfi` data set, which is part of the `openintro` package. ```{r load-packages, message = FALSE, warning = FALSE} #install.packages("statsr") #install.packages("broom") library(tidyverse) library(openintro) library(statsr) library(broom) data(hfi) ```   *** ## Exercise 1 ##### What are the dimensions of the `hfi` data set?

  ```{r code-chunk-label} dim(hfi) ``` The `hfi` data set has `r dim(hfi)[1]` rows and `r dim(hfi)[2]` columns.   *** ## Exercise 2 #### The data set spans a lot of years, but we are only interested in data from year 2016. Filter the `hfi` data frame for year 2016, select the six variables, and assign the result to a data frame named `hfi_2016`.   If we look at the data set in our environment in the panel on the right side of the RStudio window, *we can see that there is a variable called `year`*. This is what we will condition on. ```{r} # list of columns we want to include in our data set columns <- c( "year", "countries", "region", "pf_expression_control", "hf_score", "pf_score" ) ``` ```{r} hfi_2016 <- hfi %>% # selecting observations from 2016 filter(year == "2016") %>% # removing all columns not listed above select(all_of(columns)) ``` ```{r} hfi_2016 ``` **Note:** In the above code chunk, we utilized the `select` function, which allows us to select which variables we want to use for whatever operations we are performing with `%>%`. The `all_of` function tells R that we want to choose all the items in the `columns` list. Filtering the data narrowed our data set down to `r dim(hfi_2016)[1]` rows and `r dim(hfi_2016)[2]` columns. We narrowed the `r dim(hfi_2016)[2]` variables in the data down to six. Below are

these variable and their descriptions: * `year`: the year * `countries`: name of country * `region`: region where the country is located * `pf_expression_control`: political pressures and controls on media content * `hf_score`: human freedom score * `pf_score`: personal freedom score You can find a full list of the variables and their descriptions by entering `?hfi` in the console.   *** ## Exercise 3 #### What type of plot would you use to display the relationship between the personal freedom score, `pf_score`, and `pf_expression_control`? Plot this relationship using the variable `pf_expression_control` as the predictor. Does the relationship look linear? If you knew a country's `pf_expression_control`, or its score out of 10, with 0 being the most, of political pressures and controls on media content, would you be comfortable using a linear model to predict the personal freedom score?   The `pf_score` variable and the `pf_expression_control` variable are both numeric (quantitative). Therefore, a scatterplot is a good option for visualizing the relationship between these two variables. When we plot this relationship, we will put `pf_expression_control` on the x-axis because it is the predictor. ```{r, message = FALSE, warning = FALSE, out.width = "90%", out.height = "90%", fig.align='center'} ggplot(hfi_2016) + aes(x = pf_expression_control, y = pf_score, ymin = 0) + geom_point() + geom_smooth(method = "lm", se = FALSE) + ggtitle("Personal Freedom vs. Political Controls on Media Content") + xlab("Political Pressures and Controls on Media Content") + ylab("Personal Freedom (score)") ``` The data seems to be linear, as it is tightly grouped around a straight trend line and there is no evidence of any patterns in the data. Because the data appears linear, we could be comfortable using a linear model to make predictions about the personal freedom score `pf_score`. ***

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version

Access to all documents
Unlimited textbook solutions
24/7 expert homework help

  If the relationship looks linear, we can quantify the strength and direction of the linear association using the correlation coefficient. Recall that the correlation takes values between -1 (perfect negative) and +1 (perfect positive). A value of 0 indicates no linear association. ```{r} R = cor(hfi_2016$pf_expression_control,hfi_2016$pf_score) #another way r <- hfi_2016 %>% summarize( R = cor(pf_expression_control, pf_score) ) r ``` The correlation coefficient for this data is `r R`.   *** # Sum of Squared Residuals When we describe the distribution of a single variable, we discuss characteristics such as center, spread, and shape. It's useful to be able to describe the relationship of two numerical variables, such as `pf_expression_control` and `pf_score` above.   *** ## Exercise 4 #### Looking at your plot from the previous exercise, describe the relationship between these two variables. Make sure to discuss the form, direction, and strength of the relationship, as well as any unusual observations.   **Form:** When we talk about the form of a relationship between two variables, we are talking about whether the relationship is linear. In this case, the answer is yes, there appears to be a linear relationship between `pf_expression_control` and `pf_score`. This conjecture is backed by the high magnitude of the correlation coefficient, which can be seen above. **Direction:** This is a positive linear relationship. That means that as `pf_expression_control` increases, we expect `pf_score` to increase as well. This is backed by the positive value of the correlation coefficient.

**Strength:** This is a strong linear relationship. This is evident from the tight grouping of the data points around the trend line in the plot. We can also infer this from the large value (relative to how close to 1) of the correlation coefficient. ***   Remember that, up until this point, we have mostly used mean and standard deviation to summarize a *single* variable. We can also summarize the relationship between *multiple* variables. If we want to summarize the relationship between two variables, we can find the line that best follows their association. The `plot_ss` function is useful for this. To see how this function works, follow the below steps. 1. Run the following command in your console. plot_ss(x = pf_expression_control, y = pf_score, data = hfi_2016) 2. Click two points on the plot to define a line. 3. The line you specified will be shown in black and the residuals in blue. Recall the formula for residuals $e_i$, $e_i = y_i - \hat{y}_i$. Note that the `plot_ss` function should only be run in the console. It will not work if you try to run it in your Markdown document. The `plot_ss` function returns three values: (1) the slope of the line, (2) the intercept of the line, and (3) the sum of squares.   **Visualizing the Squared Residuals:** Suppose we want to select the line that minimizes the sum of squared residuals, which is given by the equation $$ SSR = \sum_{i=1}^{n} e_i^2 = \sum_{i=1}^{n} (y_i - \hat{y_i})^2. $$ **Recall:** $\hat{y}_i$ is the predicted value of $y$ at point $i$, given the regression equation. We can visualize these squared residuals by rerunning the `plot_ss` command with the added argument `showSquares = TRUE`. plot_ss(x = pf_expression_control, y = pf_score, data = hfi_2016, showSquares = TRUE) Summing the areas of each box represents the sum of squared residuals.   ***

## Exercise 5 #### Using `plot_ss`, choose a line that does a good job of minimizing the sum of squares. Run the function several times. What was the smallest sum of squares that you got?   Run the below command 5 times in the console, making note of the sum of squares from the output each time you run the command. plot_ss(x = pf_expression_control, y = pf_score, data = hfi_2016) Remember that you will have to select two points on the plot before any output will display. ***   # The Linear Model You may have noticed that the above method is very inefficient if our goal is to get the correct least squares line. **Note:** The least squares line is the line that minimizes the sum of squared residuals. We can use the `lm` (**l**inear **m**odel) function in R to fit a regression line on the data. ```{r} # fit a model and give it the name "model" model <- lm(pf_score ~ pf_expression_control, data = hfi_2016) ``` This function returns an `lm` object with various pieces of information about the model we have fit. **Note (1):** Giving the `lm` object a name is a requirement if we want to use the model to create summaries and plots. **Note (2):** * The first argument is the `lm` function is used to specify the variables in our model, while also specifying which variable is the dependent variable. This argument takes the form `y ~ x`, where $y$ is the response variable (dependent) and $x$ is the explanatory variable / predictor (independent). So $y =$ `pf_score` and $x =$ `pf_expression_control`. * The second argument indicates the data set that contains the specified variables. In our case, the `pf_score` and `pf_expression_control` variables are contained in `hfi` and `hfi_2016`. To use the variables we have specified, we must reference one of those two data sets in the `data = ` argument. **Note (3):** We cannot use piping to fit a linear model because the data is not

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version

Access to all documents
Unlimited textbook solutions
24/7 expert homework help

the first argument. In order for piping to work, the data must be the first argument in the function. We can access the information from the linear model using the `summary` function. (The OpenIntro lab online says we should use the `tidy` function, but this is not the correct function.) The below code will display a numerical summary for the model we fit above. ```{r} summary(han) ``` We can use the values in this numerical summary to write our regression equation: $$\hat{y} = 4.2838 + 0.5419 x$$ where $x =$ `pf_expression_control` and $\hat{y} =$ `pf_score`. **Interpretation of intercept ($4.2838$):** When `pf_expression_control` is 0, we expect that the mean `pf_score` will be 4.28. **Interpretation of the slope ($0.5419$)**: For every one unit increase in the value of `pf_expression_control`, we expect that the mean personal freedom score will increase by 0.5419 units. We can determine how well the model fits the data using the $R^2$ value. $R^2$ *represents the proportion of variability in the response variable that is explained by the explanatory variable.* We can use the `glance` function to access this information. ```{r} glance(model) ``` You may have noticed that there are actually two $R^2$ values in this output: `r.squared` and `adj.r.squared`. The section value, `adj.r.squared`, is the Adjusted $R^2$ value. This is not the $R^2$ value we will be using. We will instead be using the standard $R^2$ value, `r.squared`. **Interpretation of $R^2$:** 71.4134% of the variability in `pf_score` is explained by `pf_expression_control`.   *** ## Exercise 6 #### Fit a new model that uses `pf_expression_control` to predict `hf_score`, or the total human freedom score. Using the estimates from the R output, write the equation of the regression line. What does the slope tell us in the context of the relationship between human freedom and the amount of political pressure on media content?   ```{r}

# fit the model # print a summary ``` Based on the above output, the equation for the regression line is $$ \hat{y} = ? + ? x $$ where $x =$ `pf_expression_control` and $\hat{y} =$ predicted `hf_score`. **Interpretation of the Slope:** For every one unit increase in the value of `pf_expression_control`, we expect that the mean human freedom score will increase by ? units. ***   # Prediction and Prediction Errors Let's start by making a scatter plot that includes the least squares line for our first `model`. ```{r, warning = FALSE, message = FALSE, out.width = "90%", out.height = "90%", fig.align='center'} ggplot(data = hfi_2016) + aes(x = pf_expression_control, y = pf_score) + geom_point() + geom_smooth(method = "lm", se = FALSE) + xlab("Political Pressures and Controls on Media Content") + ylab("Personal Freedom (score)") ``` The trend line is added by geom_smooth(method = "lm", se = FALSE) which we are familiar with from earlier lab assignments. When predictions are made for values of $x$ that are beyond the range of the observed data, it is called **extrapolation**. Extrapolation can have very negative consequences in many cases. Predictions made within the range of the data are more reliable, and we can use them to compute residuals as well.   *** ## Exercise 7 #### If someone saw the least squares regression line and not the actual data, how would they predict a country's personal freedom score for one with a 3 rating for `pf_expression_control`? Is this an overestimate or an underestimate, and by how much? In other words, what is the residual for this prediction?

  Recall that the equation for predicting $y$ based on $x$ is $$\hat{y} = 4.2838 + 0.5419 x.$$ We can use R to calculate this value when $x = 3$. ```{r} # the value of pf_expression_control # regression equation ``` Therefore, the value of `hf_score` when `pf_expression_control` $= 3$ is 5.9095. Recall that the residual equation for the $i^{th}$ observation is $$ e_i = y_i - \hat{y}_i. $$ This means that we determine whether something is an overestimate or an underestimate as follows. * If $y_i > \hat{y}_i$, this value will be positive, meaning that the predicted value is smaller than the actual value (an underestimate). * If $y_i < \hat{y}_i$, this value will be negative, meaning that the predicted value is larger than the actual value (an overestimate). We still don't know the value of `pf_score` when the value of `pf_expression_control` is 3. We can find this out using piping. ```{r} #true value of `pf_score` with `pf_expression_control`=3 ``` From the above output, we can see that the true value of `pf_score` is 5.4756. Now we can find the residual. $$ e_i = y_i - \hat{y}_i = 5.465632 - 5.9095 = \textbf{-0.4439} $$ This means the residual value for predicting `pf_score` when `pf_expression_control` $= 3$ is -0.4439. Judging based on our criteria above, it looks like our prediction was an overestimate. ***   # Model Diagnostics Whenever we fit a model, we need to make sure it's reliable, which requires us to check for

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version

Access to all documents
Unlimited textbook solutions
24/7 expert homework help

1. **Linearity:** We want our data to follow (roughly) a straight line when plotted on a scatterplot. To an extent, we can check linearity by checking the value of $R^2$ for the model.We could also verify this condition with a plot of the *residuals* vs. fitted (predicted) values, and check if there is no pattern (or a "linear pattern") in the residuals around the red y=0 line. 2. **Nearly normal residuals:** We want our residual terms to follow a normal distribution. 3. **Constant variability:** We don't want the standard deviation of our model to vary between observations. We want the variability of residuals around the 0 line should be roughly constant as well (no pattern). We can check the three requirements above using the `augment` function. ```{r, out.width = "90%", out.height = "90%", fig.align='center'} model.aug <- augment(model) #model.aug ``` We will examine these three requirements in the next 3 exercises. ***   **Linearity:** Want lack of pattern (or a "linear pattern") in the residuals along the red y=0 line.   *** ## Exercise 8 #### Plot the residuals vs. fitted (predicted) values. What does this indicate about the linearity of the relationship, between the two variables?   ```{r} #check for linearirty and constant variability ggplot(data = model.aug) + aes(x = .fitted, y = .resid) + geom_point() + geom_hline(yintercept = 0, linetype="dashed",color = "red") + ggtitle("Residuals vs. Fitted") + xlab("Fitted Values") + ylab("Residuals") ```

***   **Nearly normal residuals:** We want our residual terms to follow a normal distribution.   *** ## Exercise 9 #### Make a histogram of the residuals. Based on the histogram, does the nearly normal residuals condition appear to be violated? Why or why not?   ```{r, out.width = "90%", out.height = "90%", fig.align='center'} ggplot(data = model.aug) + aes(x = .resid) + geom_histogram(binwidth = 0.4) + ggtitle("Frequency of Residuals") + xlab("Residuals") ``` ***   **Constant variability:** We check for constant variability using the residuals vs. fitted plot. A lack of pattern in the residuals implies constant variance.   *** ## Exercise 10 #### Refer back to our residuals vs. fitted plot. Does the constant variability condition appear to be violated? Why or why not?   ```{r} #check for linearirty and constant variability ggplot(data = model.aug) + aes(x = .fitted, y = .resid) + geom_point() + geom_hline(yintercept = 0, linetype="dashed",color = "red") + ggtitle("Residuals vs. Fitted") + xlab("Fitted Values") + ylab("Residuals") ```

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version