STAT311W24Week5_LinReg_0201
Rmd
keyboard_arrow_up
School
University of Washington *
*We aren’t endorsed by this school
Course
200
Subject
Statistics
Date
Feb 20, 2024
Type
Rmd
Pages
7
Uploaded by CommodoreSeahorseMaster945
---
title: "Week 5 Lab"
author: "STAT 311 -- Winter 2024"
date: "Thursday, February 1, 2024"
output: openintro::lab_report
---
# Intro to Linear Regression
We will be considering the Human Freedom Index (HFI) report in Lab 3. Below are the
key facts about this report.
* attempts to summarize the idea of "freedom" through different variables for various countries
* serves as a rough objective measure for the relationship between different types of freedom and other social and economic circumstances. The four types of freedom were:
* Political
* Religious
* Economical
* Personal
In this lab, we will be analyzing data from the HFI reports. Our aim is to graphically and numerically summarize relationships in data to determine which variables can tell us a story about freedom.
***
# Getting Started
We will start by loading the data and necessary packages. In this lab, we're using the `hfi` data set, which is part of the `openintro` package.
```{r load-packages, message = FALSE, warning = FALSE}
#install.packages("statsr)
#install.packages("broom)
library(tidyverse)
library(openintro)
library(statsr)
library(broom)
data(hfi)
```
***
## Exercise 1
##### What are the dimensions of the `hfi` data set?
```{r code-chunk-label}
```
***
```
## Exercise 2
#### The data set spans a lot of years, but we are only interested in data from year 2016. Filter the `hfi` data frame for year 2016, select the six variables, and
assign the result to a data frame named `hfi_2016`.
If we look at the data set in our environment in the panel on the right side of the
RStudio window, *we can see that there is a variable called `year`*. This is what we will condition on.
```{r}
# list of columns we want to include in our data set
columns <- c(
"year",
"countries",
"region",
"pf_expression_control",
"hf_score",
"pf_score"
)
```
```{r}
#create the data frame hfi_2016
hfi_2016 <- hfi %>%
# selecting observations from 2016
filter(year == "2016") %>% # removing all columns not listed above
select(all_of(columns))
```
```{r}
hfi_2016
```
**Note:** In the above code chunk, we utilized the `select` function, which allows us to select which variables we want to use for whatever operations we are performing with `%>%`. The `all_of` function tells R that we want to choose all the
items in the `columns` list.
Filtering the data narrowed our data set down to `r dim(hfi_2016)[1]` rows and `r
dim(hfi_2016)[2]` columns.
We narrowed the `r dim(hfi_2016)[2]` variables in the data down to six. Below are these variable and their descriptions:
* `year`: the year
* `countries`: name of country
* `region`: region where the country is located
* `pf_expression_control`: political pressures and controls on media content
* `hf_score`: human freedom score
* `pf_score`: personal freedom score
You can find a full list of the variables and their descriptions by entering `?hfi`
in the console.
***
## Exercise 3
#### What type of plot would you use to display the relationship between the personal freedom score, `pf_score`, and `pf_expression_control`? Plot this relationship using the variable `pf_expression_control` as the predictor. Does the relationship look linear? If you knew a country's `pf_expression_control`, or its score out of 10, with 0 being the most, of political pressures and controls on media content, would you be comfortable using a linear model to predict the personal freedom score?
The `pf_score` variable and the `pf_expression_control` variable are both numeric (quantitative). Therefore, a scatterplot is a good option for visualizing the relationship between these two variables.
When we plot this relationship, we will put `pf_expression_control` on the x-axis because it is the predictor.
```{r, message = FALSE, warning = FALSE, out.width = "90%", out.height = "90%", fig.align='center'}
```
***
If the relationship looks linear, we can quantify the strength and direction of the
linear association using the correlation coefficient. Recall that the correlation takes values between -1 (perfect negative) and +1 (perfect positive). A value of 0 indicates no linear association.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
```{r}
#calculcate correlation coefficient r
```
***
# Sum of Squared Residuals
When we describe the distribution of a single variable, we discuss characteristics such as center, spread, and shape. It's useful to be able to describe the relationship of two numerical variables, such as `pf_expression_control` and `pf_score` above.
***
## Exercise 4
#### Looking at your plot from the previous exercise, describe the relationship between these two variables. Make sure to discuss the form, direction, and strength
of the relationship, as well as any unusual observations.
**Form:** ()From the scatterplot, there appears to be a linear relationship between
pf_expression control and pf_score
**Direction:** **Strength:** ***
Remember that, up until this point, we have mostly used mean and standard deviation
to summarize a *single* variable. We can also summarize the relationship between *multiple* variables. If we want to summarize the relationship between two variables, we can find the line that best follows their association. The `plot_ss` function is useful for this.
To see how this function works, follow the below steps.
1. Run the following command in your console.
plot_ss(x = pf_expression_control, y = pf_score, data = hfi_2016)
2. Click two points on the plot to define a line.
3. The line you specified will be shown in black and the residuals in blue.
Recall the formula for residuals $e_i$, $e_i = y_i - \hat{y}_i$. Note that the `plot_ss` function should only be run in the console. It will not work if you try to run it in your Markdown document.
The `plot_ss` function returns three values: (1) the slope of the line, (2) the intercept of the line, and (3) the sum of squares.
**Visualizing the Squared Residuals:** Suppose we want to select the line that minimizes the sum of squared residuals, which is given by the equation
$$ SSR = \sum_{i=1}^{n} e_i^2 = \sum_{i=1}^{n} (y_i - \hat{y_i})^2. $$
**Recall:** $\hat{y}_i$ is the predicted value of $y$ at point $i$, given the regression equation.
We can visualize these squared residuals by rerunning the `plot_ss` command with the added argument `showSquares = TRUE`.
plot_ss(x = pf_expression_control, y = pf_score, data = hfi_2016, showSquares = TRUE)
Summing the areas of each box represents the sum of squared residuals.
***
## Exercise 5
#### Using `plot_ss`, choose a line that does a good job of minimizing the sum of squares. Run the function several times. What was the smallest sum of squares that you got?
Run the below command 5 times in the console, making note of the sum of squares from the output each time you run the command.
plot_ss(x = pf_expression_control, y = pf_score, data = hfi_2016)
Remember that you will have to select two points on the plot before any output will
display.
***
# The Linear Model
You may have noticed that the above method is very inefficient if our goal is to get the correct least squares line.
**Note:** The least squares line is the line that minimizes the sum of squared residuals.
The formula of the least squares line is:
$$\hat{y} = \beta_0 + \beta_1x$$
where $\hat{y}$ is the predicted response y, and x is the explanatory variable. $\
beta_0$ is the intercept, and $\beta_1$ is the slope of the line.
We can use the `lm` (**l**inear **m**odel) function in R to fit a regression line on the data.
```{r}
# fit a model and give it a name
```
This function returns an `lm` object with various pieces of information about the model we have fit.
**Note (1):** Giving the `lm` object a name is a requirement if we want to use the model to create summaries and plots.
**Note (2):** * The first argument is the `lm` function is used to specify the variables in our model, while also specifying which variable is the dependent variable. This argument takes the form `y ~ x`, where $y$ is the response variable (dependent) and
$x$ is the explanatory variable / predictor (independent). So $y =$ `pf_score` and $x =$ `pf_expression_control`.
* The second argument indicates the data set that contains the specified variables.
In our case, the `pf_score` and `pf_expression_control` variables are contained in `hfi` and `hfi_2016`. To use the variables we have specified, we must reference one
of those two data sets in the `data = ` argument.
**Note (3):** We cannot use piping to fit a linear model because the data is not the first argument. In order for piping to work, the data must be the first argument in the function.
We can access the information from the linear model using the `summary` function. (The OpenIntro lab online says we should use the `tidy` function, but this is not the correct function.)
The below code will display a numerical summary for the model we fit above.
```{r}
# summary(model)
```
We can use the values in this numerical summary to write our regression equation:
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
<!-- write regression equation here -->
where $x =$ `pf_expression_control` and $\hat{y} =$ `pf_score`.
**Interpretation of intercept ($4.2838$):** **Interpretation of the slope ($0.5419$):** We can determine how well the model fits the data using the $R^2$ value. $R^2$ *represents the proportion of variability in the response variable that is explained by the explanatory variable.*
We can use the `glance` function to access this information.
```{r}
# glance(model)
```
You may have noticed that there are actually two $R^2$ values in this output: `r.squared` and `adj.r.squared`. The section value, `adj.r.squared`, is the Adjusted $R^2$ value. This is not the $R^2$ value we will be using. We will instead
be using the standard $R^2$ value, `r.squared`.
**Interpretation of $R^2$:**
***
Related Documents
Related Questions
Using SAS, draw a scatterplot between variables CRIME_RATE and PROP_CHANGE_INCOME. Attach the
scatterplot. Are those two variables good candidates to be analyzed using linear regression? Explain why or why
not.
crime_rate
150
100
50
O
15
O
O
20
O
O
O
25
O
8
O
O
8
O
O
o
O
prop_change_income
O
O
O
30
O
O
O
35
O
O
O
40
arrow_forward
Does the Regression line give information about all the data points in the data set? Does the Regression line usually have all the points in the data set on it?
arrow_forward
What is A REGRESSION LINE?
arrow_forward
Identify six problems that can arise in data collection for a regression analysis.
arrow_forward
Write short note on regression line?
arrow_forward
What is Instrumental Variables Regression?
arrow_forward
What are some examples of ways in which linear regression to create a beneficial statistical outcome, in a business setting?
arrow_forward
What does the regression line represent?
arrow_forward
Explain the conceot of using Regression Models for Forecasting?
arrow_forward
All analysis, calculations, and explanations must be done in a single Excel file (use separate Excel sheets for each question). Upload the completed Excel file using the file extension format Lastname_Firstname_RegressionProblem.
Regression Problem
Sarah Anderson, the business analyst at TV Revolution, is conducting research on the dealership’s various television brands. She has collected data over the past year (2022) on the manufacturer, screen size, and price of various television brands. The data is given in the file below.
You have been hired as an intern to run analyses on the data and report the results back to Sarah; the five questions that Sarah needs you to address are given below.
Does there appear to be a positive or negative relationship between price and screen size? Use a scatter plot to examine the relationship.
Determine and interpret the correlation coefficient between the two variables. In your interpretation, discuss the direction of the relationship (positive,…
arrow_forward
when a regression is used as a method of predicting dependent variables from one or more independent variables. How are the independent variables different from each other yet related to the dependent variable?
arrow_forward
Please
*Find the equation of the least-squares regression line that models the data.
*Graph the data and the regression line in the same viewing window using the parameters given below the graph choices. Choose the correct graph below.
*Estimate the tuition and fees in 2005.
arrow_forward
What is Regression Analysis?
arrow_forward
Can you answer A,B,C with clear answers. You can use the data in the second photo
arrow_forward
SEE MORE QUESTIONS
Recommended textbooks for you
Glencoe Algebra 1, Student Edition, 9780079039897...
Algebra
ISBN:9780079039897
Author:Carter
Publisher:McGraw Hill
Holt Mcdougal Larson Pre-algebra: Student Edition...
Algebra
ISBN:9780547587776
Author:HOLT MCDOUGAL
Publisher:HOLT MCDOUGAL
Big Ideas Math A Bridge To Success Algebra 1: Stu...
Algebra
ISBN:9781680331141
Author:HOUGHTON MIFFLIN HARCOURT
Publisher:Houghton Mifflin Harcourt
Related Questions
- Using SAS, draw a scatterplot between variables CRIME_RATE and PROP_CHANGE_INCOME. Attach the scatterplot. Are those two variables good candidates to be analyzed using linear regression? Explain why or why not. crime_rate 150 100 50 O 15 O O 20 O O O 25 O 8 O O 8 O O o O prop_change_income O O O 30 O O O 35 O O O 40arrow_forwardDoes the Regression line give information about all the data points in the data set? Does the Regression line usually have all the points in the data set on it?arrow_forwardWhat is A REGRESSION LINE?arrow_forward
- All analysis, calculations, and explanations must be done in a single Excel file (use separate Excel sheets for each question). Upload the completed Excel file using the file extension format Lastname_Firstname_RegressionProblem. Regression Problem Sarah Anderson, the business analyst at TV Revolution, is conducting research on the dealership’s various television brands. She has collected data over the past year (2022) on the manufacturer, screen size, and price of various television brands. The data is given in the file below. You have been hired as an intern to run analyses on the data and report the results back to Sarah; the five questions that Sarah needs you to address are given below. Does there appear to be a positive or negative relationship between price and screen size? Use a scatter plot to examine the relationship. Determine and interpret the correlation coefficient between the two variables. In your interpretation, discuss the direction of the relationship (positive,…arrow_forwardwhen a regression is used as a method of predicting dependent variables from one or more independent variables. How are the independent variables different from each other yet related to the dependent variable?arrow_forwardPlease *Find the equation of the least-squares regression line that models the data. *Graph the data and the regression line in the same viewing window using the parameters given below the graph choices. Choose the correct graph below. *Estimate the tuition and fees in 2005.arrow_forward
arrow_back_ios
SEE MORE QUESTIONS
arrow_forward_ios
Recommended textbooks for you
- Glencoe Algebra 1, Student Edition, 9780079039897...AlgebraISBN:9780079039897Author:CarterPublisher:McGraw HillHolt Mcdougal Larson Pre-algebra: Student Edition...AlgebraISBN:9780547587776Author:HOLT MCDOUGALPublisher:HOLT MCDOUGALBig Ideas Math A Bridge To Success Algebra 1: Stu...AlgebraISBN:9781680331141Author:HOUGHTON MIFFLIN HARCOURTPublisher:Houghton Mifflin Harcourt
Glencoe Algebra 1, Student Edition, 9780079039897...
Algebra
ISBN:9780079039897
Author:Carter
Publisher:McGraw Hill
Holt Mcdougal Larson Pre-algebra: Student Edition...
Algebra
ISBN:9780547587776
Author:HOLT MCDOUGAL
Publisher:HOLT MCDOUGAL
Big Ideas Math A Bridge To Success Algebra 1: Stu...
Algebra
ISBN:9781680331141
Author:HOUGHTON MIFFLIN HARCOURT
Publisher:Houghton Mifflin Harcourt