Building Regression Trees and Random Forests for Crime Prediction

Question 10.1 Using the same crime data set uscrime.txt as in Questions 8.2 and 9.1, find the best model you can using (a) a regression tree model, and (b) a random forest model. In R, you can use the tree package or the rpart package, and the randomForest package. For each model, describe one or two qualitative takeaways you get from analyzing the results (i.e., don’t just stop when you have a good model, but interpret it too). Answer 10.1 The US Crime file contains information from 47 states taken during the 1960s. The dataset contains the following datapoints: Variable Description M percentage of males aged 14—-24 in total state population So indicator variable for a southern state Ed mean years of schooling of the population aged 25 years or over Po1 per capita expenditure on police protection in 1960 Po2 per capita expenditure on police protection in 1959 LF labour force participation rate of civilian urban males in the age-group 14-24 M.F number of males per 100 females Pop state population in 1960 in hundred thousands NW percentage of nonwhites in the population U1 unemployment rate of urban males 14-24 U2 unemployment rate of urban males 35-39 Wealth wealth: median value of transferable assets or family income Ineq income inequality: percentage of families earning below half the median income Prob probability of imprisonment: ratio of number of commitments to number of offenses Time average time in months served by offenders in state prisons before their first release Crime crime rate: number of offenses per 100,000 population in 1960 Previously we have built a linear regression model using selected features from the dataset to predict the crime rate of a new city as well as one using Principal Component Analysis (PCA) on the dataset. Now we will look to use regression trees and random forest models to see if it improves the quality. We will begin by doing some basic exploratory data analysis, such as checking for outliers with a boxplot (Fig. 1), visualizing the distribution (Fig. 4) as well as a visualization of how each feature interacts with our response (Fig. 6). For the most part we see that each factor stays within its expected range, bar a few outliers. We see that number of males per 100 females, state population in 1960 in hundred thousands and percentage of nonwhites in the population have the most outliers. When looking at the density plots and how each feature interacts with our response, we see that most are within a normal range and close to a normal distribution except for So. When looking at the data this becomes apparent as it is actually a binary indicator variable for whether the state is a southern state or not. After visualizing the data we will begin to build our regression trees and random forests. Classification and Regression Trees (CART) models work differently than typical "math" based models. Instead of trying to fit a line or similiar, these models look to make decisions on how to split the data to reach a decision/prediction. Each split the model creates is called a branch, and each node is refered to as a leaf. Each leaf gets a simplified regression model on all the data points that are present. Yy=q Training a base model we see that only a few predictors were used, that there are 5 branches and 6 end leaves, and returns a Mean Absolute Error (MAE) of 171.93, which is only slighty worse than that of our initial regression model. var n dev yval splits.cutleft splits.cutright Po1l 35 571010497 911.9714 <10.75 >10.75 Pol 25 1466071.04 752.7200 <7.05 >7.05 Pop 13 501357.23 612.5385 <225 >22.5 leaf 8 84351.88 503.3750 leaf 5 169138.80 787.2000 LF 12 43250292 904.5833 <0.58 >0.58 leaf 7 133283.71 1017.4286 leaf 5 85287.20 746.6000 M.F 10 202494490 1310.1000 <96.75 >96.75 leaf 5 658284.80 1084.2000 leaf 5 856352.00 1536.0000

It is not surprising that Po1 is the primary split for the data, as it has both a strong correlation to Crime and was found to be statistically significant in our linear regression model. To test the fit and see if these are the correct number of branches, we can run 10 fold cross validation while pruning the tree to see if the model improves. Branches MAE 6 188.3099 5 201.8580 4 214.4362 3 232.0362 2 273.5210 Unsurpringly with the low number of datapoints to fit the model, and with the small number of branches to begin with, pruning the tree does not improve the model. Next we can look to train a Random Forest. Random Forests as the name suggest, is a collection of Trees created at random as opposed to one single tree. We lose explainability, but gain a better overall estimate of the data. We can loop through a large number of Trees and see at which point we begin to lose quality. The number of trees with the lowest MAE will be used to train the final Random Forest Model, which in this case is 12. Trees MAE 12 249.0137 23 260.8363 32 266.8360 89 267.6831 79 268.0012 107 268.5253 Looking at the MAE against our testing set we find 201.09, which is slightly worse than the base Regression Tree. Looking at the models Increase in Node Purity we see the following features having the most importance. Feature Node Purity Po1 1277314.89 Prob 556052.98 Po2 531698.20 Ineq 444998.31 Pop 359573.44 Looking over the features and thier importance between the two models, we can see that police spending has the most impact, whether it be for the current year (Po1) or the previous (Po2). Question 10.2 Describe a situation or problem from your job, everyday life, current events, etc., for which a logistic regression model would be appropriate. List some (up to 5) predictors that you might use. Answer 10.2 One area where logistic regression could be used is in beer brewing and distritbution. A concern amongst a lot of breweries is the shelf-life of thier product. Logistic regression can be used to predict the likelihood of beer spoilage over time, considering factors like temperature, packaging, and storage conditions. Question 10.3 1. Using the GermanCredit data set germancredit.txt from http://archive.ics.uci.edu/ml/machine- learning-databases/statlog/german / (description at http://archive.ics.uci.edu/ml/datasets/Statlog+%28German+Credit+Data%29 ), use logistic regression to find a good predictive model for whether credit applicants are good credit risks or not. Show your model (factors used and their coefficients), the software output, and the quality of fit. You can use the glm function in R. To get a logistic regression (logit) model on data where the response is either zero or one, use family=binomial(link="logit") in your gim function call. 2. Because the model gives a result between 0 and 1, it requires setting a threshold probability to separate between “good” and “bad” answers. In this data set, they estimate that incorrectly identifying a bad customer as good, is 5 times worse than incorrectly classifying a good customer as bad. Determine a good threshold probability based on your model. Answer 10.3 Logisitc regression uses a similiar algorithim to linear regression, but with a sigmoid activation function to return a probability between 0-1 instead of a continous response.

In logistic regression, the sigmoid function maps the linear combination z to the range [0, 1], representing the probability of the binary outcome being in the positive class (usually class 1). The function's S-shaped curve ensures that the probability remains between 0 and 1, making it suitable for binary classification tasks. o(z) = 1 l—e =z Where: o(2) is the sigmoid function. zis the linear combination of predictor variables and their associated coefficients z=Po+ fiz1 + Pazz + ...+ Bpzy Bo, B1, B2, - . . , Bp are the coefficients z1,%2,...,x, are the predictor variables. The logistic regression model can be expressed as follows: PY =1|X) = l—e =z The German Credit Dataset contains 1000 credit applications and thier outcome of either good (1) or bad (2). Variable Name Role Type Demographic Description Attribute1 Feature Categorical Status of existing checking account Attribute2 Feature Integer Duration months Attribute3 Feature Categorical Credit history Attribute4 Feature Categorical Purpose AttributeS Feature Integer Credit amount Attribute6 Feature Categorical Savings account/bonds Attribute7 Feature Categorical Other Present employment since o Attribute8 Feature Integer Installment rate in percentage of disposable income Attribute9 Feature Categorical Marital Status Attribute10 Feature Categorical Other debtors / guarantors Attribute11 Feature Integer Present residence since Attribute12 Feature Categorical Property Attribute13 Feature Integer Age Attribute14 Feature Categorical Other installment plans Attribute15 Feature Categorical Other Housing Attribute16 Feature Integer Number of existing credits at this bank Attribute17 Feature Categorical Occupation Job Attribute18 Feature Integer Number of people being liable to provide maintenance for Attribute19 Feature Binary Telephone Attribute20 Feature Binary Other foreign worker class Target Binary 1 =Good, 2 = Bad The dataset is accompanied by a cost matrix, where the cost of incorrectly classifying a customer as good when they are bad is 5 times worse than to classify a customer as bad when they are good. This will need to be considered when setting a threshold for prediction and evaluating the model. The classes are heavily skewed towards the good results at a more than 2:1 ratio. To counteract this we will downsample so that the number in each class remains even to avoid any potential bias in the training set. We will begin by performing Exploratory Data Analysis on the dataset, checking for outliers where the variable type is integer, and looking at distributions where it is either binary or categorical. We can now train a base model using all the features to identify those that are statistically important to predicting the class. After training the model against the whole dataset, we recieve an accuracy of 74%. Looking into the model we find the following features have the most impact. Variable Pr(>z) V5 V8 0.01840 * 0.01099 * V1A11 5.46e-12 *** V1A12 1.70e-07 *** V3A30 0.01488 * V3A31 0.01316 * V4A41 0.00770 ** V6A61 0.02699 * V7A72 0.03043 * V7A73 0.03958 *

Your preview ends here