Lab05_Simple_Regression
docx
keyboard_arrow_up
School
Truman State University *
*We aren’t endorsed by this school
Course
210
Subject
Statistics
Date
Feb 20, 2024
Type
docx
Pages
8
Uploaded by PresidentKnowledge8934
You can copy-and-paste all of the text from this document into a blank RMarkdown file and be ready to start. Remember to change the author to your name, and submit your homework as a .pdf file (as needed, knit to .html first, then print to a .pdf).
---
title: 'RLab 5 -- Regression and Correlation'
author: “(you)”
date: “March 01, 2022”
output: pdf_document
---
# Introduction
This lab will introduce you to
- Looking at a bunch of scatterplots to think about linear regression
- Using `pairs`, without `ggplot` or `dplyr`
- Using `ggplot` and `dplyr` to make panels of scatterplots
- Using the `lm` command to make simple linear regression models
- Understanding the output
- Thinking about data science v. statistics interpretations of linear models
We'll be using the `Salaries` data set that's part of the `carData` package,
so the first thing to do will be to load the `carData` package using the
`library` command. We'll also need tidyverse. The `head` command will remind
you of the data set.
```{r}
library(carData)
library(tidyverse)
head(Salaries)
```
# Scatterplots to Explore Data Sets
## Scatterplots with the `pairs` Command
Suppose you've never seen a particular data set before, and you'd like to quickly get a sense for which variables are related. The `pairs` command helps you do that by creating a _grid_ of scatter plots where each variable in a data frame is plotted against each other variable. The `pairs` command is part of base R and not part of `ggplot`. It's syntax is simple:
`pairs(your data frame)`
You can copy-and-paste all of the text from this document into a blank RMarkdown file and be ready to start. Remember to change the author to your name, and submit your homework as a .pdf file (as needed, knit to .html first, then print to a .pdf).
Let's look at the `Salaries` data frame:
```{r}
pairs(Salaries)
```
Tracing vertically and horizontally from a particular graph to the variable names on the diagonal tells you which variables are on the vertical and horizontal axes of the graphs themselves. You can see some graphs are more useful than others, depending on what
kind of variables you started with.
If there are variables you know you don't want to graph, you could use the tidyverse `select` command to create a new data frame with only those variables before you use `pairs`.
### Question 1 (25 points)
a) In a sentence, describe the relationship between `yrs.service` and `yrs.since.phd`. (5 points)
b) Why do you think there is a sharp edge on one side of the scatter plot of `yrs..service` and `yrs.since.phd`?(5 points)
c) Which variable seems to have a stronger relationship (graphically) to `salary`--
`yrs..service` or `yrs.since.phd`? Explain in a sentence. (5 points)
d) Which graph appears to be least helpful? What variables are being graphed in that graph, and what kind of variables are they? (10 points)
**Type Answers Below**
## Scatterplots with `ggplot`
We've seen how to do something similar to what the `pairs` command does using `ggplot` and the `pivot_longer` command. If we know we always want the same "y" variable, but different "x" variables, we can put all the variables we want
as "x" and then use faceting to create several graphs. In this case, let's
make scatter plots of `salary` versus `yrs.since.phd` and `yrs.service`, then box
plots of `salary` versus `rank`, `discipline` and `sex`.
```{r}
You can copy-and-paste all of the text from this document into a blank RMarkdown file and be ready to start. Remember to change the author to your name, and submit your homework as a .pdf file (as needed, knit to .html first, then print to a .pdf).
Salaries %>%
pivot_longer(c(rank, discipline, sex),names_to="xvar", values_to="xvalue") %>%
ggplot() +
geom_point(mapping=aes(x=xvalue, y=salary)) +
facet_wrap(facets="xvar")
```
```{r}
Salaries %>%
pivot_longer(c(rank, discipline, sex),names_to="xvar", values_to="xvalue") %>%
ggplot() +
geom_boxplot(mapping=aes(x=xvalue, y=salary)) +
facet_wrap(facets="xvar")
```
This last one doesn't work so well with factor variables, since each graph
includes _all_ factor variables on each graph, but we can still get the idea.
It might be better, however, to make individual graphs.
Let's focus just on `yrs.since.phd` and `salary` right now. We can use
`stat_smooth` to add a line of best fit, as was discussed in the lecture. The last
option, `method="lm"`, ensures that we get a line and not a curve.
```{r}
Salaries %>%
ggplot() + geom_point(mapping=aes(x=yrs.since.phd, y=salary)) +
stat_smooth(mapping=aes(x=yrs.since.phd, y=salary), method="lm")
```
It seems that salary is related to yrs.since.phd (and to yrs.service). Is that
relationship different for male and female faculty members? We can add a
`color` element to the mapping in `geom_point` and `stat_smooth` to find out:
```{r}
Salaries %>% ggplot() + geom_point(mapping=aes(x=yrs.since.phd, y=salary, color=sex)) +
stat_smooth(mapping=aes(x=yrs.since.phd, y=salary, color=sex), method="lm")
```
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
You can copy-and-paste all of the text from this document into a blank RMarkdown file and be ready to start. Remember to change the author to your name, and submit your homework as a .pdf file (as needed, knit to .html first, then print to a .pdf).
### Question 2 (20 points)
You might note in the last plot that there are no women in the oldest segment of faculty. No female faculty member received their PhD more than 40 years ago. Modify the code from the last code chunk to limit the graph to only those faculty members for whom it has been less than 40 years since they received their PhD. Then answer these two questions: (a) After accounting for years since PhD, does it appear that there is a difference in pay between male and female faculty members?(10 points) (b) Do the gray
uncertainty regions for both lines overlap, and what does that tell you about the statistical significance about any male/female difference in salaries?(10 points)
**Code and answer below:**
```{r}
```
# The Linear Correlation Coefficient
You should remember from Basic Statistics that the _linear correlation
coefficient_ (denoted by _r_) is a statistic that ranges from -1 to 1 and
gives information on how two numeric variables are linearly related. Pictures
speak louder than words:
```{r}
# You don't have to pay too much attention to this code--it just creates the
# pictures you should look at (in the knit document).
df = data.frame(x=c(-2:2, rnorm(30), -2:2), y=c(2:(-2), rnorm(30), -2:2),
r = c(rep("r=-1", 5), rep("r=0", 30), rep("r=1", 5)))
ggplot(data=df) +
geom_point(mapping=aes(x=x, y=y)) + facet_wrap(facets="r")
```
The `cor` function measures the linear correlation between two numeric variables. It has
the form
`cor(var1, var2)`
Here are two ways to use the `cor` command to calculate the linear correlation
between `salary` and `yrs.since.phd`. Note that the output is a bit different,
depending on whether you give two vectors or one data frame, but you get the
same information.
You can copy-and-paste all of the text from this document into a blank RMarkdown file and be ready to start. Remember to change the author to your name, and submit your homework as a .pdf file (as needed, knit to .html first, then print to a .pdf).
```{r}
cor(Salaries$salary, Salaries$yrs.since.phd)
Salaries %>% select(salary, yrs.since.phd) %>% cor()
```
You can actually calculate the linear correlation between as many _numeric_
variables as you like. You'll get a table of correlations.
### Question 3 (25 points)
Alter the code from the last code block so that the `select` command selects _all three_ numeric variables. Run the code. Then comment on this question: In the graphing section, you gave your opinion about whether it looked like salaries were more strongly related to `yrs.since.phd` or `yrs.service`. Does the linear correlation coefficients agree with the conclusion you made "by eye" from the graphs? Briefly explain. **Answer below**
```{r}
```
# Regression
## The `lm` command
The `lm` command (short for linear model) creates a predictive model that uses
one or more explanatory variables to predict one response variable. We'll
first look at the simplest case where there is one explanatory variable and
one response variable. In this case, the `lm` command has the format:
`lm(response_var ~ explanatory_var, data=data_frame)`
where you replace each of those with your actual variable names and data frame
name. The formula `response_var ~ explanatory_var` is a shorthand for saying
that you want the family of models where `response_var = b0 + b1 *
explanatory_var`.
Let's look are the model of `salary` as a function of `yrs.since.phd`. First
we'll use the `lm` command to create the model, then we'll see what the result
looks like.
You can copy-and-paste all of the text from this document into a blank RMarkdown file and be ready to start. Remember to change the author to your name, and submit your homework as a .pdf file (as needed, knit to .html first, then print to a .pdf).
```{r}
salary_mod <- lm(salary ~ yrs.since.phd, data=Salaries)
summary(salary_mod)
coef(salary_mod)
```
You can see that the `lm` command creates a lot of information that we would
be very interested in if this were a linear regression class. However, the
`coef` command pulls out just the coefficients of the fitted model. R is
telling us that the best linear model of `salary` as a function of
`yrs.since.phd` is `salary = 91718.6854 + 985.3421 * yrs.since.phd`.
## Visualizing Predictions and Residuals
Each predictive model we might consider has its own set of details (that we
would be interested in if this were a different sort of class). However, one
way that we can think about _all_ predictive models in a unified way is to
consider their _predictions_ and _residuals_. Our collection of tidyverse
packages contains the package `modelr` which provides a nice way to visualize
predictions and residuals using familiar notation. Modelr might not be loaded when you load the general tidyverse packages, so
let's load it:
```{r}
# install.packages("modelr", repos='https://mirror.las.iastate.edu/CRAN/) #if needed
library(modelr)
```
The two new commands we care about are
`add_predictions(model_name)` and `add_residuals(model_name)`
They can be added into a set of piped commands, and they add a new column to a
data frame, either containing the model's predictions or its residuals. They
work well when you're trying to visualize your model.
The first code block just shows you how `add_predictions` works. Note the new
`pred` column. **Note that I would have to use the `lm` command to create
`salary_mod` if I hadn't already done so!**
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
You can copy-and-paste all of the text from this document into a blank RMarkdown file and be ready to start. Remember to change the author to your name, and submit your homework as a .pdf file (as needed, knit to .html first, then print to a .pdf).
```{r}
head(Salaries %>% add_predictions(salary_mod))
```
The next code block adds the line of best fit to the scatter plot of `salary`
versus `yrs.since phd`. We've already done this using the `stat_smooth`
command, but the advantage of _this_ method is that it works for _any_
predictive model we might choose.
```{r}
Salaries %>% add_predictions(salary_mod) %>%
ggplot() +
geom_point(mapping=aes(x=yrs.since.phd, y=salary)) + geom_line(mapping=aes(x=yrs.since.phd, y=pred), color="red")
```
Based on this graph, it looks like the simple linear regression line may not
capture everything happening in the data set. Large values of `yrs.since.phd`
seem to actually have _lower_ salaries.
One way to assess how well your model is doing is to graph its _residuals_.
Remember that the residual is the difference between the _actual_ response
variable and the _predicted_ value of the response variable. **If the model fits the data well, residuals should show no pattern and be
vertically centered symmetrically around 0.**
Let's use the `add_residuals` command to visualize residuals in our case. The
new column added is called `resid`.
```{r}
Salaries %>% add_residuals(salary_mod) %>%
ggplot() +
geom_point(mapping = aes(x=yrs.since.phd, y=resid))
```
There is still a significant pattern in residuals! That means that our simple
linear model is not a great fit to the data.
## Using `lm` with Categorical (Factor) Variables
You can copy-and-paste all of the text from this document into a blank RMarkdown file and be ready to start. Remember to change the author to your name, and submit your homework as a .pdf file (as needed, knit to .html first, then print to a .pdf).
The `lm` command can also be used with categorical predictor variables. For
example, if I wanted to predict `salary` based on `rank`, I could do that:
```{r}
salary_mod2 <- lm(salary ~ rank, data=Salaries)
coef(salary_mod2)
```
These coefficients have the following meaning:
- Average base salary (salary for Assistant Professor) = 80775.99
- Average for Associate Professor = 80775.99 + 13100.45 = 93876.44
- Average for Full Professor = 80775.99 + 45996.12 = 126772.10
We could graph them by putting a big red dot on the predicted salary for each
group.
```{r}
Salaries %>% add_predictions(salary_mod2) %>%
ggplot() +
geom_point(mapping = aes(x=rank, y=salary)) +
geom_point(mapping = aes(x=rank, y=pred), color="red", size=3)
```
### Question 4(30 points)
Repeat the commands that created a linear model, showed coefficients, graphed predictions and graphed residuals, but this time to model `salary` as a function of `yrs.service`. Briefly explain whether the linear fit is a good fit in this example, referring to specific aspects of your output. **Answers and code below.**
```{r}
```
Related Documents
Related Questions
Write a short note on regression analysis
arrow_forward
Find the equation of the regression line
arrow_forward
Write down the different Multiple Regression Equations?
arrow_forward
Find an example of regression that lead to an important discovery. Be sure to give proper attribution or credit to the source.
Discuss how and why it lead to a discovery in at least 6 sentences.
arrow_forward
Write a multiple regression equation that can be used to analyze the data for a two-factorialdesign with two levels for factor A and three levels for factor B. Define all variables.
arrow_forward
Excel file:
https://drive.google.com/file/d/1TQG5r2wzLGk--75whZXyb0SDTHZTWS0S/view?usp=sharing
arrow_forward
The data show the test results from students in two classrooms, one integrated and one not. There are
two sets of test scores, a pre-test and a post-test for both groups. First, calculate a measure of change
in the test scores for students in both groups. Then, use the resulting measure of change to compare
both groups on the measure of change from the pre-test to the post-test. When finished, write a short
summary of what you found, including:
Group 1 mean
Group 2 mean
Ho and H1
t-observed
t-critical
arrow_forward
The r code for side by side boxplot of vitamind v newage and vitamin d v country.
Scatterplot code for relationship between vitamin d level and age.
arrow_forward
Define the different ways to use linear regression?
arrow_forward
Write short note on regression line?
arrow_forward
The line of best fit through a set of data is
y = 34.764 + 2.713x
According to this equation, what is the predicted value
of the dependent variable when the independent
variable has value 150?
y =
place.
Round to 1 decimal
arrow_forward
Give 2 characteristics that indicate a linear model may be appropriate to model a data set
arrow_forward
Link to the excel file. Thank you for the help.
EXCEL DATA:
https://drive.google.com/file/d/1TQG5r2wzLGk--75whZXyb0SDTHZTWS0S/view?usp=sharing
arrow_forward
Link to the excel file. Thank you for the help.
EXCEL DATA:
https://drive.google.com/file/d/1TQG5r2wzLGk--75whZXyb0SDTHZTWS0S/view?usp=sharing
arrow_forward
The regression line always gives an exact model for data.
true or false
arrow_forward
On a cold day in Minneapolis, the afternoon temperature was 48 degrees before a cold front moved through. As
the front moved through the temperature dropped an average of 5 degrees per hour for a total of 5 hours.
14
2/1
Identify the domain of the data set.
arrow_forward
How should I draw this linear regression graph for this model? Thanks.
arrow_forward
SEE MORE QUESTIONS
Recommended textbooks for you

Glencoe Algebra 1, Student Edition, 9780079039897...
Algebra
ISBN:9780079039897
Author:Carter
Publisher:McGraw Hill

Related Questions
- Find an example of regression that lead to an important discovery. Be sure to give proper attribution or credit to the source. Discuss how and why it lead to a discovery in at least 6 sentences.arrow_forwardWrite a multiple regression equation that can be used to analyze the data for a two-factorialdesign with two levels for factor A and three levels for factor B. Define all variables.arrow_forwardExcel file: https://drive.google.com/file/d/1TQG5r2wzLGk--75whZXyb0SDTHZTWS0S/view?usp=sharingarrow_forward
- The data show the test results from students in two classrooms, one integrated and one not. There are two sets of test scores, a pre-test and a post-test for both groups. First, calculate a measure of change in the test scores for students in both groups. Then, use the resulting measure of change to compare both groups on the measure of change from the pre-test to the post-test. When finished, write a short summary of what you found, including: Group 1 mean Group 2 mean Ho and H1 t-observed t-criticalarrow_forwardThe r code for side by side boxplot of vitamind v newage and vitamin d v country. Scatterplot code for relationship between vitamin d level and age.arrow_forwardDefine the different ways to use linear regression?arrow_forward
- Write short note on regression line?arrow_forwardThe line of best fit through a set of data is y = 34.764 + 2.713x According to this equation, what is the predicted value of the dependent variable when the independent variable has value 150? y = place. Round to 1 decimalarrow_forwardGive 2 characteristics that indicate a linear model may be appropriate to model a data setarrow_forward
arrow_back_ios
SEE MORE QUESTIONS
arrow_forward_ios
Recommended textbooks for you
- Glencoe Algebra 1, Student Edition, 9780079039897...AlgebraISBN:9780079039897Author:CarterPublisher:McGraw Hill

Glencoe Algebra 1, Student Edition, 9780079039897...
Algebra
ISBN:9780079039897
Author:Carter
Publisher:McGraw Hill
