worksheet_regression2
pdf
keyboard_arrow_up
School
University of British Columbia *
*We aren’t endorsed by this school
Course
DSCI100
Subject
Statistics
Date
Feb 20, 2024
Type
Pages
12
Uploaded by CountKuduMaster478
Worksheet 9 - Regression Continued
Lecture and Tutorial Learning Goals:
By the end of the week, you will be able to:
Recognize situations where a simple regression analysis would be appropriate
for making predictions.
Explain the -nearest neighbour (
-nn) regression algorithm and describe
how it differs from k-nn classification.
Interpret the output of a -nn regression.
In a dataset with two variables, perform -nearest neighbour regression in R
using tidymodels
to predict the values for a test dataset.
Using R, execute cross-validation in R to choose the number of neighbours.
Using R, evaluate -nn regression prediction accuracy using a test data set
and an appropriate metric (
e.g.
, root means square prediction error).
In a dataset with > 2 variables, perform -nn regression in R using
tidymodels
to predict the values for a test dataset.
In the context of -nn regression, compare and contrast goodness of fit and
prediction properties (namely RMSE vs RMSPE).
Describe advantages and disadvantages of the -nearest neighbour
regression approach.
Perform ordinary least squares regression in R using tidymodels
to predict
the values for a test dataset.
Compare and contrast predictions obtained from -nearest neighbour
regression to those obtained using simple ordinary least squares regression
from the same dataset.
This worksheet covers parts of the Regression II chapter of the online textbook.
You should read this chapter before attempting the worksheet.
### Run this cell before continuing.
library
(
tidyverse
)
library
(
repr
)
library
(
tidymodels
)
library
(
cowplot
)
options
(
repr.matrix.max.rows =
6
)
source
(
"tests.R"
)
source
(
'cleanup.R'
)
Warm-up Questions
Here are some warm-up questions on the topic of multiple regression to get you
thinking before we jump into data analysis. The course readings should help you
answer these.
In [ ]:
Question 1.0
Multiple Choice:
{points: 1}
In multivariate k-nn regression with one outcome/target variable and two
predictor variables, the predictions take the form of what shape?
A. a flat plane
B. a wiggly/flexible plane
C. A straight line
D. a wiggly/flexible line
E. a 4D hyperplane
F. a 4D wiggly/flexible hyperplane
Save the letter of the answer you think is correct to a variable named answer1.0
.
Make sure you put quotations around the letter and pay attention to case.
### BEGIN SOLUTION
answer1.0 <-
"B"
### END SOLUTION
test_1.0
()
Question 1.1
Multiple Choice:
{points: 1}
In simple linear regression with one outcome/target variable and one predictor
variable, the predictions take the form of what shape?
A. a flat plane
B. a wiggly/flexible plane
C. A straight line
D. a wiggly/flexible line
E. a 4D hyperplane
F. a 4D wiggly/flexible hyperplane
Save the letter of the answer you think is correct to a variable named answer1.1
.
Make sure you put quotations around the letter and pay attention to case.
### BEGIN SOLUTION
answer1.1 <-
"C"
### END SOLUTION
test_1.1
()
In [ ]:
In [ ]:
In [ ]:
In [ ]:
Question 1.2
Multiple Choice:
{points: 1}
In multiple linear regression with one outcome/target variable and two predictor
variables, the predictions take the form of what shape?
A. a flat plane
B. a wiggly/flexible plane
C. A straight line
D. a wiggly/flexible line
E. a 4D hyperplane
F. a 4D wiggly/flexible hyperplane
Save the letter of the answer you think is correct to a variable named answer1.2
.
Make sure you put quotations around the letter and pay attention to case.
### BEGIN SOLUTION
answer1.2 <-
"A"
### END SOLUTION
test_1.2
()
Understanding Simple Linear Regression
Consider this small and simple dataset:
simple_data <-
tibble
(
X =
c
(
1
, 2
, 3
, 6
, 7
, 7
),
Y =
c
(
1
, 1
, 3
, 5
, 7
, 6
))
options
(
repr.plot.width =
5
, repr.plot.height =
5
)
base <-
ggplot
(
simple_data
, aes
(
x =
X
, y =
Y
)) +
geom_point
(
size =
2
) +
scale_x_continuous
(
limits =
c
(
0
, 7.5
), breaks =
seq
(
0
, 8
), minor_breaks =
se
scale_y_continuous
(
limits =
c
(
0
, 7.5
), breaks =
seq
(
0
, 8
), minor_breaks =
se
theme
(
text =
element_text
(
size =
20
))
base
Now consider these three potential
lines we could fit for the same dataset:
options
(
repr.plot.height =
3.5
, repr.plot.width =
10
)
line_a <-
base +
ggtitle
(
"Line A"
) +
geom_abline
(
intercept =
-0.897
, slope =
0.9834
, color =
"blue"
) +
theme
(
text =
element_text
(
size =
20
))
line_b <-
base +
ggtitle
(
"Line B"
) +
geom_abline
(
intercept =
0.1022
, slope =
0.9804
, color =
"purple"
) +
theme
(
text =
element_text
(
size =
20
))
line_c <-
base +
ggtitle
(
"Line C"
) +
geom_abline
(
intercept =
-0.2347
, slope =
0.9164
, color =
"green"
) +
In [ ]:
In [ ]:
In [ ]:
In [ ]:
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
theme
(
text =
element_text
(
size =
20
))
plot_grid
(
line_a
, line_b
, line_c
, ncol =
3
)
Question 2.0
{points: 1}
Use the graph below titled "Line A" to roughly calculate the average squared
vertical distance between the points and the blue line. Read values of the graph to
a precision of 0.25
(e.g. 1, 1.25, 1.5, 1.75, 2). Save your answer to a variable
named answer2.0
.
We reprint the plot for you in a larger size to make it easier to estimate the
locations on the graph.
#run this code
options
(
repr.plot.width =
9
, repr.plot.height =
9
)
line_a
### BEGIN SOLUTION
answer2.0 <-
((
0 -
1
)
^
2 +
(
1 -
1
)
^
2 +
(
2 -
3
)
^
2 +
(
5 -
5
)
^
2 +
(
6 -
6
)
^
2 +
(
6 -
7
### END SOLUTION
answer2.0
test_2.0
()
Question 2.1
{points: 1}
Use the graph titled "Line B" to roughly calculate the average squared vertical
distance between the points and the purple line. Read values of the graph to a
precision of 0.25
(e.g. 1, 1.25, 1.5, 1.75, 2). Save your answer to a variable named
answer2.1
.
We reprint the plot for you in a larger size to make it easier to estimate the
locations on the graph.
options
(
repr.plot.width =
9
, repr.plot.height =
9
)
line_b
### BEGIN SOLUTION
answer2.1 <-
((
1 -
1
)
^
2 +
(
2 -
1
)
^
2 +
(
3 -
3
)
^
2 +
(
6 -
5
)
^
2 +
(
7 -
7
)
^
2 +
(
6 -
7
### END SOLUTION
answer2.1
test_2.1
()
Question 2.2
{points: 1}
Use the graph titled "Line C" to roughly calculate the average squared vertical
distance between the points and the green line. Read values of the graph to a
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
precision of 0.25
(e.g. 1, 1.25, 1.5, 1.75, 2). Save your answer to a variable named
answer2.2
.
We reprint the plot for you in a larger size to make it easier to estimate the
locations on the graph.
options
(
repr.plot.width =
9
, repr.plot.height =
9
)
line_c
### BEGIN SOLUTION
answer2.2 <-
((
0.75 -
1
)
^
2 +
(
1.5 -
1
)
^
2 +
(
2.5 -
3
)
^
2 +
(
5.25 -
5
)
^
2 +
(
6.25 -
### END SOLUTION
answer2.2
test_2.2
()
Question 2.3
{points: 1}
Based on your calculations above, which line would linear regression by ordinary
least squares choose given our small and simple dataset? Line A, B or C? Assign
the letter that corresponds the line to a variable named answer2.3
. Make sure
you put quotations around the letter and pay attention to case.
### BEGIN SOLUTION
answer2.3 <-
"C"
### END SOLUTION
test_2.3
()
Marathon Training Revisited with Linear
Regression!
Source: https://media.giphy.com/media/BDagLpxFIm3SM/giphy.gif
Remember our question from last week: what features predict whether athletes
will perform better than others? Specifically, we are interested in marathon
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
runners, and looking at how the maximum distance ran per week during training
predicts the time it takes a runner to end the race?
This time around, however, we will analyze the data using simple linear regression
rather than -nn regression. In the end, we will compare our results to what we
found last week with -nn regression.
Question 3.0
{points: 1}
Load the marathon
data and assign it to an object called marathon
.
### BEGIN SOLUTION
marathon <-
read_csv
(
'data/marathon.csv'
)
### END SOLUTION
marathon
test_3.0
()
Question 3.1
{points: 1}
Similar to what we have done for the last few weeks, we will first split the dataset
into the training and testing datasets, using 75% of the original data as the
training data. Remember, we will be putting the test dataset away in a 'lock box'
that we will comeback to later after we choose our final model. In the strata
argument of the initial_split
function, place the variable we are trying to
predict. Assign your split dataset to an object named marathon_split
.
Assign your training dataset to an object named marathon_training
and your
testing dataset to an object named marathon_testing
.
set.seed
(
2000
) # DO NOT CHANGE THIS
### BEGIN SOLUTION
marathon_split <-
initial_split
(
marathon
, prop =
0.75
, strata =
time_hrs
)
marathon_training <-
training
(
marathon_split
)
marathon_testing <-
testing
(
marathon_split
)
### END SOLUTION
test_3.1
()
Question 3.2
{points: 1}
Using only the observations in the training dataset, create a scatterplot to assess
the relationship between race time (
time_hrs
) and maximum distance ran per
week during training (
max
). Put time_hrs
on the y-axis and max
on the x-axis.
Assign this plot to an object called marathon_eda
. Remember to do whatever is
necessary to make this an effective visualization.
In [ ]:
In [ ]:
In [ ]:
In [ ]:
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
options
(
repr.plot.height =
8
, repr.plot.width =
7
)
### BEGIN SOLUTION
marathon_eda <-
marathon_training |>
ggplot
(
aes
(
x =
max
, y =
time_hrs
)) +
geom_point
(
alpha =
0.25
, size =
2
) +
xlab
(
"Maximum Distance Ran per \n Week During Training (miles)"
) +
ylab
(
"Race Time (hours)"
) +
theme
(
text =
element_text
(
size =
20
))
### END SOLUTION
marathon_eda
test_3.2
()
Question 3.3
{points: 1}
Now that we have our training data, the next step is to build a linear regression
model specification. Thankfully, building other model specifications is quite
straightforward since we will still go through the same procedure (indicate the
function, the engine and the mode).
Instead of using the nearest_neighbor
function, we will be using the
linear_reg
function to let tidymodels
know we want to perform a linear
regression. In the set_engine
function, we have typically set "kknn"
there for -nn. Since we are doing a linear regression here, set "lm"
as the engine. Finally,
instead of setting "classification"
as the mode, set "regression"
as the
mode.
Assign your answer to an object named lm_spec
.
### BEGIN SOLUTION
lm_spec <-
linear_reg
() |>
set_engine
(
"lm"
) |>
set_mode
(
"regression"
)
lm_spec
### END SOLUTION
test_3.3
()
Question 3.3.1
{points: 1}
After we have created our linear regression model specification, the next step is to
create a recipe, establish a workflow analysis and fit our simple linear regression
model.
First, create a recipe with the variables of interest (race time and max weekly
training distance) using the training dataset and assign your answer to an object
named lm_recipe
.
In [ ]:
In [ ]:
In [ ]:
In [ ]:
Then, create a workflow analysis with our model specification and recipe.
Remember to fit in the training dataset as well. Assign your answer to an object
named lm_fit
.
#... <- recipe(... ~ ..., data = ...)
#... <- workflow() |>
# add_recipe(...) |>
# add_model(...) |>
# fit(...)
### BEGIN SOLUTION
lm_recipe <-
recipe
(
time_hrs ~
max
, data =
marathon_training
)
lm_fit <-
workflow
() |>
add_recipe
(
lm_recipe
) |>
add_model
(
lm_spec
) |>
fit
(
data =
marathon_training
)
### END SOLUTION
lm_fit
test_3.3.1
()
Question 3.4
{points: 1}
Now, let's visualize the model predictions as a straight line overlaid on the
training data. Use the predict
and bind_cols
functions on lm_fit
to create
predictions for the marathon_training
data. Name the resulting data frame
marathon_preds
.
Next, create a scatterplot with the marathon time (y-axis) against the maximum
distance run per week (x-axis) from marathon_preds
. Use an alpha value of 0.4 to
avoid overplotting. Plot the predictions as a black line over the data points.
Assign your plot to a variable called lm_predictions
. Remember the
fundamentals of effective visualizations such as having a human-readable axes
titles.
options
(
repr.plot.width =
8
, repr.plot.height =
7
)
# marathon_preds <- ... |>
# predict(...) |>
# bind_cols(...)
#
# lm_predictions <- marathon_preds |>
# ...(aes(x = ..., y = ...)) +
# geom_point(... = 0.4) +
# geom_line(
# mapping = aes(x = ..., y = ...), # color = "blue") +
# xlab("...") +
# ylab("...") +
# theme(text = ...(size = 20))
### BEGIN SOLUTION
In [ ]:
In [ ]:
In [ ]:
marathon_preds <-
lm_fit |>
predict
(
marathon_training
) |>
bind_cols
(
marathon_training
)
lm_predictions <-
marathon_preds |>
ggplot
(
aes
(
x =
max
, y =
time_hrs
)) +
geom_point
(
alpha =
0.4
) +
geom_line
(
mapping =
aes
(
x =
max
, y =
.pred
), color =
"blue"
) +
xlab
(
"Maximum Distance Ran per \n Week During Training (mi)"
) +
ylab
(
"Race Time (hours)"
) +
theme
(
text =
element_text
(
size =
20
))
### END SOLUTION
lm_predictions
test_3.4
()
Question 3.5
{points: 1}
Great! We can now see the line of best fit on the graph. Now let's calculate the
using the test data
. To get to this point, first, use the lm_fit
to make
predictions on the test data. Remember to bind the appropriate columns for the
test data. Afterwards, collect the metrics and store it in an object called
lm_test_results
.
From lm_test_results
, extract the and return a single numerical
value. Assign your answer to an object named lm_rmspe
.
#... <- lm_fit |>
# predict(...) |>
# bind_cols(...) |>
# metrics(truth = ..., estimate = ..)
#... <- lm_test_results |>
# filter(...) |>
# select(...) |>
# ...
### BEGIN SOLUTION
lm_test_results <-
lm_fit |>
predict
(
marathon_testing
) |>
bind_cols
(
marathon_testing
) |>
metrics
(
truth =
time_hrs
, estimate =
.pred
) lm_rmspe <-
lm_test_results |>
filter
(
.metric ==
'rmse'
) |>
select
(
.estimate
) |>
pull
()
### END SOLUTION
lm_rmspe
test_3.5
()
In [ ]:
In [ ]:
In [ ]:
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Question 3.5.1
{points: 1}
Now, let's visualize the model predictions as a straight line overlaid on the test
data. First, create a scatterplot to assess the relationship between race time
(
time_hrs
) and maximum distance ran per week during training (
max
) on the
testing data.
Use and alpha value of 0.4 to avoid overplotting. Then add a line to
the plot corresponding to the predictions from the fit linear regression model.
Remember to do whatever is necessary to make this an effective visualization.
Assign the plot to an object called lm_predictions_test
.
options
(
repr.plot.width =
8
, repr.plot.height =
7
)
# test_preds <- ...
# lm_predictions_test <- ...
### BEGIN SOLUTION
test_preds <-
lm_fit |>
predict
(
marathon_testing
) |>
bind_cols
(
marathon_testing
)
lm_predictions_test <-
test_preds |>
ggplot
(
aes
(
x =
max
, y =
time_hrs
)) +
geom_point
(
alpha =
0.4
) +
geom_line
(
mapping =
aes
(
x =
max
, y =
.pred
), color =
"blue"
) +
xlab
(
"Maximum Distance Ran per \n Week During Training (mi)"
) +
ylab
(
"Race Time (hours)"
) +
theme
(
text =
element_text
(
size =
20
))
### END SOLUTION
lm_predictions_test
test_3.5.1
()
Question 3.6
{points: 1}
Compare the test RMPSE of k-nn regression (
0.606
from last worksheet) to that
of simple linear regression, which is greater?
A. -nn regression has a greater RMSPE
B. Simple linear regression has a greater RMSPE
C. Neither, they are identical
Save the letter of the answer you think is correct to a variable named answer3.6
.
Make sure you put quotations around the letter and pay attention to case.
### BEGIN SOLUTION
answer3.6 <-
"A"
In [ ]:
In [ ]:
In [ ]:
### END SOLUTION
test_3.6
()
Question 3.7
Multiple Choice:
{points: 1}
Which model does a better job of predicting on the test dataset?
A. -nn regression
B. Simple linear regression
C. Neither, they are identical
Save the letter of the answer you think is correct to a variable named answer3.7
.
Make sure you put quotations around the letter and pay attention to case.
### BEGIN SOLUTION
answer3.7 <-
"B"
### END SOLUTION
test_3.7
()
Given that the linear regression model is a straight line, we can write our model as
a mathematical equation. We can get the two numbers we need for this from the
coefficients, (Intercept)
and time_hrs
.
# run this cell
lm_fit
Question 3.8.1
Multiple Choice:
{points: 1}
Which of the following mathematical equations represents the model based on
the numbers output in the cell above?
A. B. C. D. Save the letter of the answer you think is correct to a variable named
answer3.8.1
. Make sure you put quotations around the letter and pay attention
to case.
### BEGIN SOLUTION
answer3.8.1 <-
"A"
### END SOLUTION
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
test_3.8.1
()
source
(
'cleanup.R'
)
In [ ]:
In [ ]:
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Related Documents
Recommended textbooks for you

Elementary Linear Algebra (MindTap Course List)
Algebra
ISBN:9781305658004
Author:Ron Larson
Publisher:Cengage Learning

Functions and Change: A Modeling Approach to Coll...
Algebra
ISBN:9781337111348
Author:Bruce Crauder, Benny Evans, Alan Noell
Publisher:Cengage Learning

College Algebra
Algebra
ISBN:9781305115545
Author:James Stewart, Lothar Redlin, Saleem Watson
Publisher:Cengage Learning

Glencoe Algebra 1, Student Edition, 9780079039897...
Algebra
ISBN:9780079039897
Author:Carter
Publisher:McGraw Hill


Recommended textbooks for you
- Elementary Linear Algebra (MindTap Course List)AlgebraISBN:9781305658004Author:Ron LarsonPublisher:Cengage LearningFunctions and Change: A Modeling Approach to Coll...AlgebraISBN:9781337111348Author:Bruce Crauder, Benny Evans, Alan NoellPublisher:Cengage LearningCollege AlgebraAlgebraISBN:9781305115545Author:James Stewart, Lothar Redlin, Saleem WatsonPublisher:Cengage Learning
- Glencoe Algebra 1, Student Edition, 9780079039897...AlgebraISBN:9780079039897Author:CarterPublisher:McGraw Hill

Elementary Linear Algebra (MindTap Course List)
Algebra
ISBN:9781305658004
Author:Ron Larson
Publisher:Cengage Learning

Functions and Change: A Modeling Approach to Coll...
Algebra
ISBN:9781337111348
Author:Bruce Crauder, Benny Evans, Alan Noell
Publisher:Cengage Learning

College Algebra
Algebra
ISBN:9781305115545
Author:James Stewart, Lothar Redlin, Saleem Watson
Publisher:Cengage Learning

Glencoe Algebra 1, Student Edition, 9780079039897...
Algebra
ISBN:9780079039897
Author:Carter
Publisher:McGraw Hill

