lab17_4322
pdf
keyboard_arrow_up
School
University of Phoenix *
*We aren’t endorsed by this school
Course
4322
Subject
Mathematics
Date
Feb 20, 2024
Type
Pages
6
Uploaded by AmbassadorArtPorpoise41
Lab 17 MATH 4322
Bagging, Random Forest and Boosting
Spring 2022
•
We will apply bagging, random forests and boosting to the
Boston
data, using the
randomForest
package.
•
Note
: The exact results obtained in this lab may depend on the version of
R
and the version of the
randomForest
package installed on your computer. Give the results from your computer.
•
You can use the
Rmarkdown
script given or write down your answers and scan them as a pdf file to
upload in BlackBoard similar to your homework.
•
Possible points: 10.
Question 1
: For any data that has
p
predictors
bagging
requires that we consider how many predictors at
each split in a tree?
Mtry= p (the number of predictors)
First, we call the data and create training/testing sets.
library(ISLR2)
set.seed(
1
)
train
=
sample(
1
:nrow(Boston),nrow(Boston)/
2
)
boston.test
=
Boston[-train,
"medv"
]
Bagging
We perform bagging as follows:
library(randomForest)
set.seed(
10
)
bag.boston
=
randomForest(medv~.,
data =
Boston,
subset =
train,
mtry =
ncol(Boston) -
1
,
importance =
TRUE)
bag.boston
Question 2
: What is the
MSE
based on the training set? The MSE based on the training set is 11.22857
which is the average squared distance between the actual and the predicted values.
1
How well does this bagged model perform on the test set?
Question 3
: What is the formula to determine the
MSE
? This is the average squared distance.
Run the following in
R
.
yhat.bag
=
predict(bag.boston,
newdata =
Boston[-train,])
plot(yhat.bag,boston.test)
abline(
0
,
1
)
10
20
30
40
10
20
30
40
50
yhat.bag
boston.test
mean((yhat.bag - boston.test)ˆ
2
)
Question 4
: What is the
MSE
of the test data set?
mean((yhat.bag - boston.test)ˆ2)= 23.56386
2
We could change the number of trees grown by
randomForest()
using the
ntree
argument:
bag.boston
=
randomForest(medv ~ .,
data =
Boston,
subset =
train,
mtry =
ncol(Boston) -
1
,
ntree =
25
)
bag.boston
yhat.bag
=
predict(bag.boston,
newdata =
Boston[-train,])
mean((yhat.bag - boston.test)ˆ
2
)
Question 5
: What method do we use to get the different trees?
Bootstap Method.
Random Forests
Question 6
: For a building a random forest of regression trees, what should be
mtry
(number of predictors to
consider at each split)?
Mtry =p/3 for regression and Mtry=sqrt(p) for classification
Type and run the following in
R
:
set.seed(
10
)
rf.boston
=
randomForest(medv ~.,
data =
Boston,
subset =
train,
mtry =
(ncol(Boston)-
1
)/
3
,
importance =
TRUE)
yhat.rf
=
predict(rf.boston,
newdata =
Boston[-train,])
mean((yhat.rf - boston.test)ˆ
2
)
Question 7
: Compare the
MSE
of the test data to the
MSE
of the bagging.
mean((yhat.rf - boston.test)ˆ2)= 19.1759.We get lower MSE in the random forest.
Question 8
: Use the
importance()
function what are the two mores important variables? The two mores
important variables are lstat and rm.
importance(rf.boston)
##
%IncMSE IncNodePurity
## crim
15.48571304
1197.64717
## zn
3.34978057
169.00931
## indus
6.93488857
870.60348
## chas
0.05746934
61.05778
3
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
## nox
12.97835448
1179.66670
## rm
30.67206810
6612.55554
## age
13.52685213
760.41982
## dis
10.94707995
899.17273
## rad
4.60598124
129.80949
## tax
9.20624202
556.89248
## ptratio
6.99867017
1044.02812
## lstat
26.41637352
5483.83696
varImpPlot(rf.boston)
chas
zn
rad
indus
ptratio
tax
dis
nox
age
crim
lstat
rm
0
5
10
20
30
%IncMSE
chas
rad
zn
tax
age
indus
dis
ptratio
nox
crim
lstat
rm
0
2000
4000
6000
IncNodePurity
rf.boston
4
Boosting
Run the following in
R
:
library(gbm)
## Loaded gbm 2.1.8
set.seed(
1
)
boost.boston
=
gbm(medv ~.,
data =
Boston[train,],
distribution =
"gaussian"
,
n.trees =
5000
,
interaction.depth =
4
)
summary(boost.boston)
Question 9
: What are the two most important variables with the boosted trees?
With the boosted trees,the
two most important variables are rm and lstat again. We can produce
partial dependence plots
for these two
variables. The plots illustrate the marginal effect of the selected variables on the response after
integrating
out the other variables.
plot(boost.boston,
i =
"rm"
)
rm
y
20
25
30
35
4
5
6
7
8
plot(boost.boston,
i =
"lstat"
)
5
lstat
y
20
25
30
10
20
30
Notice that the house prices are increasing with
rm
and decreasing with
lstat
.
We will use the boosted model to predict
medv
on the test set:
yhat.boost
=
predict(boost.boston,
newdata =
Boston[-train,],
n.trees =
5000
)
mean((yhat.boost - boston.test)ˆ
2
)
Question 10
: Compare this
MSE
to the
MSE
of the random forest and bagging models. mean((yhat.boost -
boston.test)ˆ2)=18.84709 Here, MSE boost <= MSE r.f <= MSE bag
6
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help