lab17_4322

pdf

School

University of Phoenix *

*We aren’t endorsed by this school

Course

4322

Subject

Mathematics

Date

Feb 20, 2024

Type

pdf

Pages

6

Uploaded by AmbassadorArtPorpoise41

Report
Lab 17 MATH 4322 Bagging, Random Forest and Boosting Spring 2022 We will apply bagging, random forests and boosting to the Boston data, using the randomForest package. Note : The exact results obtained in this lab may depend on the version of R and the version of the randomForest package installed on your computer. Give the results from your computer. You can use the Rmarkdown script given or write down your answers and scan them as a pdf file to upload in BlackBoard similar to your homework. Possible points: 10. Question 1 : For any data that has p predictors bagging requires that we consider how many predictors at each split in a tree? Mtry= p (the number of predictors) First, we call the data and create training/testing sets. library(ISLR2) set.seed( 1 ) train = sample( 1 :nrow(Boston),nrow(Boston)/ 2 ) boston.test = Boston[-train, "medv" ] Bagging We perform bagging as follows: library(randomForest) set.seed( 10 ) bag.boston = randomForest(medv~., data = Boston, subset = train, mtry = ncol(Boston) - 1 , importance = TRUE) bag.boston Question 2 : What is the MSE based on the training set? The MSE based on the training set is 11.22857 which is the average squared distance between the actual and the predicted values. 1
How well does this bagged model perform on the test set? Question 3 : What is the formula to determine the MSE ? This is the average squared distance. Run the following in R . yhat.bag = predict(bag.boston, newdata = Boston[-train,]) plot(yhat.bag,boston.test) abline( 0 , 1 ) 10 20 30 40 10 20 30 40 50 yhat.bag boston.test mean((yhat.bag - boston.test)ˆ 2 ) Question 4 : What is the MSE of the test data set? mean((yhat.bag - boston.test)ˆ2)= 23.56386 2
We could change the number of trees grown by randomForest() using the ntree argument: bag.boston = randomForest(medv ~ ., data = Boston, subset = train, mtry = ncol(Boston) - 1 , ntree = 25 ) bag.boston yhat.bag = predict(bag.boston, newdata = Boston[-train,]) mean((yhat.bag - boston.test)ˆ 2 ) Question 5 : What method do we use to get the different trees? Bootstap Method. Random Forests Question 6 : For a building a random forest of regression trees, what should be mtry (number of predictors to consider at each split)? Mtry =p/3 for regression and Mtry=sqrt(p) for classification Type and run the following in R : set.seed( 10 ) rf.boston = randomForest(medv ~., data = Boston, subset = train, mtry = (ncol(Boston)- 1 )/ 3 , importance = TRUE) yhat.rf = predict(rf.boston, newdata = Boston[-train,]) mean((yhat.rf - boston.test)ˆ 2 ) Question 7 : Compare the MSE of the test data to the MSE of the bagging. mean((yhat.rf - boston.test)ˆ2)= 19.1759.We get lower MSE in the random forest. Question 8 : Use the importance() function what are the two mores important variables? The two mores important variables are lstat and rm. importance(rf.boston) ## %IncMSE IncNodePurity ## crim 15.48571304 1197.64717 ## zn 3.34978057 169.00931 ## indus 6.93488857 870.60348 ## chas 0.05746934 61.05778 3
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
## nox 12.97835448 1179.66670 ## rm 30.67206810 6612.55554 ## age 13.52685213 760.41982 ## dis 10.94707995 899.17273 ## rad 4.60598124 129.80949 ## tax 9.20624202 556.89248 ## ptratio 6.99867017 1044.02812 ## lstat 26.41637352 5483.83696 varImpPlot(rf.boston) chas zn rad indus ptratio tax dis nox age crim lstat rm 0 5 10 20 30 %IncMSE chas rad zn tax age indus dis ptratio nox crim lstat rm 0 2000 4000 6000 IncNodePurity rf.boston 4
Boosting Run the following in R : library(gbm) ## Loaded gbm 2.1.8 set.seed( 1 ) boost.boston = gbm(medv ~., data = Boston[train,], distribution = "gaussian" , n.trees = 5000 , interaction.depth = 4 ) summary(boost.boston) Question 9 : What are the two most important variables with the boosted trees? With the boosted trees,the two most important variables are rm and lstat again. We can produce partial dependence plots for these two variables. The plots illustrate the marginal effect of the selected variables on the response after integrating out the other variables. plot(boost.boston, i = "rm" ) rm y 20 25 30 35 4 5 6 7 8 plot(boost.boston, i = "lstat" ) 5
lstat y 20 25 30 10 20 30 Notice that the house prices are increasing with rm and decreasing with lstat . We will use the boosted model to predict medv on the test set: yhat.boost = predict(boost.boston, newdata = Boston[-train,], n.trees = 5000 ) mean((yhat.boost - boston.test)ˆ 2 ) Question 10 : Compare this MSE to the MSE of the random forest and bagging models. mean((yhat.boost - boston.test)ˆ2)=18.84709 Here, MSE boost <= MSE r.f <= MSE bag 6
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help