STAT311W24Week5_LinReg_0201

Rmd

School

University of Washington *

*We aren’t endorsed by this school

Course

200

Subject

Statistics

Date

Feb 20, 2024

Type

Rmd

Pages

7

Uploaded by CommodoreSeahorseMaster945

Report
--- title: "Week 5 Lab" author: "STAT 311 -- Winter 2024" date: "Thursday, February 1, 2024" output: openintro::lab_report --- # Intro to Linear Regression We will be considering the Human Freedom Index (HFI) report in Lab 3. Below are the key facts about this report. * attempts to summarize the idea of "freedom" through different variables for various countries * serves as a rough objective measure for the relationship between different types of freedom and other social and economic circumstances. The four types of freedom were: * Political * Religious * Economical * Personal In this lab, we will be analyzing data from the HFI reports. Our aim is to graphically and numerically summarize relationships in data to determine which variables can tell us a story about freedom.   *** # Getting Started We will start by loading the data and necessary packages. In this lab, we're using the `hfi` data set, which is part of the `openintro` package. ```{r load-packages, message = FALSE, warning = FALSE} #install.packages("statsr) #install.packages("broom) library(tidyverse) library(openintro) library(statsr) library(broom) data(hfi) ```   *** ## Exercise 1 ##### What are the dimensions of the `hfi` data set?
&nbsp; ```{r code-chunk-label} ``` &nbsp; *** ``` ## Exercise 2 #### The data set spans a lot of years, but we are only interested in data from year 2016. Filter the `hfi` data frame for year 2016, select the six variables, and assign the result to a data frame named `hfi_2016`. &nbsp; If we look at the data set in our environment in the panel on the right side of the RStudio window, *we can see that there is a variable called `year`*. This is what we will condition on. ```{r} # list of columns we want to include in our data set columns <- c( "year", "countries", "region", "pf_expression_control", "hf_score", "pf_score" ) ``` ```{r} #create the data frame hfi_2016 hfi_2016 <- hfi %>% # selecting observations from 2016 filter(year == "2016") %>% # removing all columns not listed above select(all_of(columns)) ``` ```{r} hfi_2016 ``` **Note:** In the above code chunk, we utilized the `select` function, which allows us to select which variables we want to use for whatever operations we are performing with `%>%`. The `all_of` function tells R that we want to choose all the items in the `columns` list. Filtering the data narrowed our data set down to `r dim(hfi_2016)[1]` rows and `r
dim(hfi_2016)[2]` columns. We narrowed the `r dim(hfi_2016)[2]` variables in the data down to six. Below are these variable and their descriptions: * `year`: the year * `countries`: name of country * `region`: region where the country is located * `pf_expression_control`: political pressures and controls on media content * `hf_score`: human freedom score * `pf_score`: personal freedom score You can find a full list of the variables and their descriptions by entering `?hfi` in the console. &nbsp; *** ## Exercise 3 #### What type of plot would you use to display the relationship between the personal freedom score, `pf_score`, and `pf_expression_control`? Plot this relationship using the variable `pf_expression_control` as the predictor. Does the relationship look linear? If you knew a country's `pf_expression_control`, or its score out of 10, with 0 being the most, of political pressures and controls on media content, would you be comfortable using a linear model to predict the personal freedom score? &nbsp; The `pf_score` variable and the `pf_expression_control` variable are both numeric (quantitative). Therefore, a scatterplot is a good option for visualizing the relationship between these two variables. When we plot this relationship, we will put `pf_expression_control` on the x-axis because it is the predictor. ```{r, message = FALSE, warning = FALSE, out.width = "90%", out.height = "90%", fig.align='center'} ``` *** &nbsp; If the relationship looks linear, we can quantify the strength and direction of the linear association using the correlation coefficient. Recall that the correlation takes values between -1 (perfect negative) and +1 (perfect positive). A value of 0 indicates no linear association.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
```{r} #calculcate correlation coefficient r ``` &nbsp; *** # Sum of Squared Residuals When we describe the distribution of a single variable, we discuss characteristics such as center, spread, and shape. It's useful to be able to describe the relationship of two numerical variables, such as `pf_expression_control` and `pf_score` above. &nbsp; *** ## Exercise 4 #### Looking at your plot from the previous exercise, describe the relationship between these two variables. Make sure to discuss the form, direction, and strength of the relationship, as well as any unusual observations. &nbsp; **Form:** ()From the scatterplot, there appears to be a linear relationship between pf_expression control and pf_score **Direction:** **Strength:** *** &nbsp; Remember that, up until this point, we have mostly used mean and standard deviation to summarize a *single* variable. We can also summarize the relationship between *multiple* variables. If we want to summarize the relationship between two variables, we can find the line that best follows their association. The `plot_ss` function is useful for this. To see how this function works, follow the below steps. 1. Run the following command in your console.
plot_ss(x = pf_expression_control, y = pf_score, data = hfi_2016) 2. Click two points on the plot to define a line. 3. The line you specified will be shown in black and the residuals in blue. Recall the formula for residuals $e_i$, $e_i = y_i - \hat{y}_i$. Note that the `plot_ss` function should only be run in the console. It will not work if you try to run it in your Markdown document. The `plot_ss` function returns three values: (1) the slope of the line, (2) the intercept of the line, and (3) the sum of squares. &nbsp; **Visualizing the Squared Residuals:** Suppose we want to select the line that minimizes the sum of squared residuals, which is given by the equation $$ SSR = \sum_{i=1}^{n} e_i^2 = \sum_{i=1}^{n} (y_i - \hat{y_i})^2. $$ **Recall:** $\hat{y}_i$ is the predicted value of $y$ at point $i$, given the regression equation. We can visualize these squared residuals by rerunning the `plot_ss` command with the added argument `showSquares = TRUE`. plot_ss(x = pf_expression_control, y = pf_score, data = hfi_2016, showSquares = TRUE) Summing the areas of each box represents the sum of squared residuals. &nbsp; *** ## Exercise 5 #### Using `plot_ss`, choose a line that does a good job of minimizing the sum of squares. Run the function several times. What was the smallest sum of squares that you got? &nbsp; Run the below command 5 times in the console, making note of the sum of squares from the output each time you run the command. plot_ss(x = pf_expression_control, y = pf_score, data = hfi_2016) Remember that you will have to select two points on the plot before any output will display. *** &nbsp;
# The Linear Model You may have noticed that the above method is very inefficient if our goal is to get the correct least squares line. **Note:** The least squares line is the line that minimizes the sum of squared residuals. The formula of the least squares line is: $$\hat{y} = \beta_0 + \beta_1x$$ where $\hat{y}$ is the predicted response y, and x is the explanatory variable. $\ beta_0$ is the intercept, and $\beta_1$ is the slope of the line. We can use the `lm` (**l**inear **m**odel) function in R to fit a regression line on the data. ```{r} # fit a model and give it a name ``` This function returns an `lm` object with various pieces of information about the model we have fit. **Note (1):** Giving the `lm` object a name is a requirement if we want to use the model to create summaries and plots. **Note (2):** * The first argument is the `lm` function is used to specify the variables in our model, while also specifying which variable is the dependent variable. This argument takes the form `y ~ x`, where $y$ is the response variable (dependent) and $x$ is the explanatory variable / predictor (independent). So $y =$ `pf_score` and $x =$ `pf_expression_control`. * The second argument indicates the data set that contains the specified variables. In our case, the `pf_score` and `pf_expression_control` variables are contained in `hfi` and `hfi_2016`. To use the variables we have specified, we must reference one of those two data sets in the `data = ` argument. **Note (3):** We cannot use piping to fit a linear model because the data is not the first argument. In order for piping to work, the data must be the first argument in the function. We can access the information from the linear model using the `summary` function. (The OpenIntro lab online says we should use the `tidy` function, but this is not the correct function.) The below code will display a numerical summary for the model we fit above. ```{r} # summary(model) ``` We can use the values in this numerical summary to write our regression equation:
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
<!-- write regression equation here --> where $x =$ `pf_expression_control` and $\hat{y} =$ `pf_score`. **Interpretation of intercept ($4.2838$):** **Interpretation of the slope ($0.5419$):** We can determine how well the model fits the data using the $R^2$ value. $R^2$ *represents the proportion of variability in the response variable that is explained by the explanatory variable.* We can use the `glance` function to access this information. ```{r} # glance(model) ``` You may have noticed that there are actually two $R^2$ values in this output: `r.squared` and `adj.r.squared`. The section value, `adj.r.squared`, is the Adjusted $R^2$ value. This is not the $R^2$ value we will be using. We will instead be using the standard $R^2$ value, `r.squared`. **Interpretation of $R^2$:** &nbsp; ***