tutorial_regression2

pdf

School

University of British Columbia *

*We aren’t endorsed by this school

Course

DSCI100

Subject

Statistics

Date

Feb 20, 2024

Type

pdf

Pages

Uploaded by CountKuduMaster478

Tutorial 9: Regression Continued Lecture and Tutorial Learning Goals: By the end of the week, you will be able to: Recognize situations where a simple regression analysis would be appropriate for making predictions. Explain the -nearest neighbour ( -nn) regression algorithm and describe how it differs from k-nn classification. Interpret the output of a -nn regression. In a dataset with two variables, perform -nearest neighbour regression in R using tidymodels to predict the values for a test dataset. Using R, execute cross-validation in R to choose the number of neighbours. Using R, evaluate -nn regression prediction accuracy using a test data set and an appropriate metric ( e.g. , root means square prediction error). In a dataset with > 2 variables, perform -nn regression in R using tidymodels to predict the values for a test dataset. In the context of -nn regression, compare and contrast goodness of fit and prediction properties (namely RMSE vs RMSPE). Describe advantages and disadvantages of the -nearest neighbour regression approach. Perform ordinary least squares regression in R using tidymodels to predict the values for a test dataset. Compare and contrast predictions obtained from -nearest neighbour regression to those obtained using simple ordinary least squares regression from the same dataset. This tutorial covers parts of the Regression II chapter of the online textbook. You should read this chapter before attempting the worksheet. ### Run this cell before continuing. library ( tidyverse ) library ( repr ) library ( tidymodels ) library ( GGally ) library ( ISLR ) options ( repr.matrix.max.rows = 6 ) source ( "tests.R" ) source ( "cleanup.R" ) Predicting credit card balance In [ ]:

Source: https://media.giphy.com/media/LCdPNT81vlv3y/giphy-downsized- large.gif Here in this worksheet we will work with a simulated data set that contains information that we can use to create a model to predict customer credit card balance. A bank might use such information to predict which customers might be the most profitable to lend to (customers who carry a balance, but do not default, for example). Specifically, we wish to build a model to predict credit card balance ( Balance column) based on income ( Income column) and credit rating ( Rating column). We access this data set by reading it from an R data package that we loaded at the beginning of the worksheet, ISLR . Loading that package gives access to a variety of data sets, including the Credit data set that we will be working with. We will rename this data set credit_original to avoid confusion later in the worksheet. credit_original <- Credit credit_original Question 1.1 {points: 1} Select only the columns of data we are interested in using for our prediction (both the predictors and the response variable) and use the as_tibble function to convert it to a tibble (it is currently a base R data frame). Name the modified data frame credit (using a lowercase c). Note: We could alternatively just leave these variables in and use our recipe formula below to specify our predictors and response. But for this worksheet, let's select the relevant columns first. ### BEGIN SOLUTION credit <- credit_original |> select ( Balance , Income , Rating ) |> as_tibble () In [ ]: In [ ]:

### END SOLUTION credit test_1.1 () Question 1.2 {points: 1} Before we perform exploratory data analysis, we should create our training and testing data sets. First, split the credit data set. Use 60% of the data and set the variables we want to predict as the strata argument. Assign your answer to an object called credit_split . Assign your training data set to an object called credit_training and your testing data set to an object called credit_testing . set.seed ( 2000 ) ### BEGIN SOLUTION credit_split <- initial_split ( credit , prop = 0.6 , strata = Balance ) credit_training <- training ( credit_split ) credit_testing <- testing ( credit_split ) ### END SOLUTION test_1.2 () Question 1.3 {points: 1} Using only the observations in the training data set, use the ggpairs library create a pairplot (also called "scatter plot matrix") of all the columns we are interested in including in our model. Since we have not covered how to create these in the textbook, we have provided you with most of the code below and you just need to provide suitable options for the size of the plot. The pairplot contains a scatter plot of each pair of columns that you are plotting in the lower left corner, the diagonal contains smoothed histograms of each individual column, and the upper right corner contains the correlation coefficient (a quantitative measure of the relation between two variables) Name the plot object credit_pairplot . # options(...) # credit_pairplot <- credit_training |> # ggpairs( # lower = list(continuous = wrap('points', alpha = 0.4)), # diag = list(continuous = "barDiag") # ) + # theme(text = element_text(size = 20)) ### BEGIN SOLUTION options ( repr.plot.height = 8 , repr.plot.width = 9 ) credit_pairplot <- credit_training |> ggpairs ( mapping = aes ( alpha = 0.4 )) + In [ ]: In [ ]: In [ ]: In [ ]:

Your preview ends here