Module 4 Resources _ LSU Moodle 4.1

pdf

School

Louisiana State University *

*We aren’t endorsed by this school

Course

3302

Subject

Industrial Engineering

Date

Feb 20, 2024

Type

pdf

Pages

23

Uploaded by MasterCloverSalmon28

Report
2/7/24, 2:39 PM Module 4 Resources | LSU Moodle 4.1 https://moodle.lsu.edu/mod/book/tool/print/index.php?id=2193524 1/23 Module 4 Resources Site: Welcome to LSU Moodle! Course: 2024 First Spring GBUS 3302 for Nicholaus Hutchinson Book: Module 4 Resources Printed by: Madelyn McDaniel Date: Wednesday, February 7, 2024, 2:39 PM
2/7/24, 2:39 PM Module 4 Resources | LSU Moodle 4.1 https://moodle.lsu.edu/mod/book/tool/print/index.php?id=2193524 2/23 Description Use the Table of Contents to view the readings and other resources for this module.
2/7/24, 2:39 PM Module 4 Resources | LSU Moodle 4.1 https://moodle.lsu.edu/mod/book/tool/print/index.php?id=2193524 3/23 Table of contents Evaluation of Predictive Models Data Partition for Model Evaluation Process of Making a Validation Column in JMP Measures Confusion Matrix Sensitivity, Speci±city, and False Positive and Negative ROC Curve Lift Chart Logistic Regression Logit Function Simple Logistic Regression in JMP Logistic Regression Example Multiple Logistic Regression Multiple Logistic Regression in JMP JMP Output Odd Ratios
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
2/7/24, 2:39 PM Module 4 Resources | LSU Moodle 4.1 https://moodle.lsu.edu/mod/book/tool/print/index.php?id=2193524 4/23 Evaluation of Predictive Models While inferential statistics uses t-statistics and p-value to assess statistical signi±cance of factors in a model, predictive analytics using data mining models do generally not use statistical tests or p-values. There are several ways to assess the reliability of data mining models. These include measuring the error in predictions, false positive rate, false negative rate, overall error, lift curve and ROC curves, and the confusion matrix. We will discuss each of these measures in this book.
2/7/24, 2:39 PM Module 4 Resources | LSU Moodle 4.1 https://moodle.lsu.edu/mod/book/tool/print/index.php?id=2193524 5/23 Data Partition for Model Evaluation With the exception of logistic regression, none of the classi±cation models we will discuss have standard errors associated with them. So we need another method of checking the reliability of the prediction. The standard method is to divide the data into different sets of randomly selected data. If only one modeling method is used, we need only a training and validation set. The training set of the data is used to build a model and the validation set is used to validate the model; this means checking the true performance of the model on data that were not used for the training. Often you must use several models to ±t the data. Then you should divide the data into training, validation, and test sets. Suppose we use ±ve different models on the training set. In that case, we use validation to select the best model of the ±ve and the test set is used to check the true performance of the ±nal model. The process of making a training, validation or test set is to randomly assign cases (rows) to either set. If you don’t have a software such as JMP available you could simply do this in Excel by using the rand() function to generate uniform random numbers in a new column (call it rand) and use an if statement to create a so called validation column that speci±es whether a row is belonging to training, validation or test. The JMP provides to create a so called validation column within the Predictive Modeling platform. Training Data used to build the various models Validation Data used to evaluate the predictive performance of the models Test Data is used to evaluate the final model selected
2/7/24, 2:39 PM Module 4 Resources | LSU Moodle 4.1 https://moodle.lsu.edu/mod/book/tool/print/index.php?id=2193524 6/23 Process of Making a Validation Column in JMP The JMP provides to create a so called validation column within the Predictive Modeling platform. Example: West Roxbury Housing Data Partitioning the data (Using the Make Validation Column Utility) Use Analyze>Predictive Modeling>Make Validation Column. This option in JMP allows you to select the size of the training set and validation set. It allows purely random selection of the subsamples or stratification by a variable. For instance, if you want to make sure that you have the same percentage of bedrooms in the training and validation sets you can stratify by BEDROOMS. The platform Make Validation Column shows as example a 50/50 distribution of training and validation. We then obtain the following validation column. Note that you give this column any name, but the default name is Validation. Predictions for the First 10 Observations (Highlighted rows are the validation set)
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
2/7/24, 2:39 PM Module 4 Resources | LSU Moodle 4.1 https://moodle.lsu.edu/mod/book/tool/print/index.php?id=2193524 7/23 Watch the following video on how to create a validation column in JMP Creating a Validation Column (Holdout Sample) Creating a Validation Column (Holdout Sample) Download the one-page summary of how to make a validation column in JMP
2/7/24, 2:39 PM Module 4 Resources | LSU Moodle 4.1 https://moodle.lsu.edu/mod/book/tool/print/index.php?id=2193524 8/23 Measures In the following chapters, we will cover the various measures used to evaluate models confusion matrix sensitivity and speci±city false positive and false negative receiver operating characteristic (ROC) curve lift chart error rate
2/7/24, 2:39 PM Module 4 Resources | LSU Moodle 4.1 https://moodle.lsu.edu/mod/book/tool/print/index.php?id=2193524 9/23 Confusion Matrix Some of the metrics used to evaluate the models are derived from the so called confusion matrix. Please note that the name does not indicate that the matrix is confusing. Instead the name comes from the fact that the matrix describes the uncertainty (confusion) in the data. It summarizes the correct and incorrect classi±cations produced by the classi±er used of the data. Consider the example of predicting the graduation of students within 4 years. The data are coded as 0 (graduated) and 1 (not graduated). Note that the assignment of 0 or 1 is arbitrary, but it is custom to assign 1 to the class that is predicted. Variables used for prediction are demographic characteristics, High School GPA, ACT and grade point average in college and various stages. We provide details about the modelling later. Here we just want to introduce how models are evaluated and therefore assume that we have build a speci±c model. This (constructed data) is used to explain the various metrics that are available to evaluate a model. The matrix shows the outcome of the predictions of 3000 students of which 2689 students were correctly classi±ed by the model as having graduated within 4 years, while 201 students were correctly classi±ed as having not graduated in 4 years; 25 were classi±ed by the model as having not graduated but actually graduated; 85 were classi±ed as graduating but actually did not graduate. The two latter cases are misclassi±cations. This matrix is called the confusion matrix for the model used. We will derive some of the measures from this matrix.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
2/7/24, 2:39 PM Module 4 Resources | LSU Moodle 4.1 https://moodle.lsu.edu/mod/book/tool/print/index.php?id=2193524 10/23 Sensitivity, Speci±city, and False Positive and Negative Sensitivity and Speci±city Sensitivity and specificity are two measures that use the rows in the matrix to calculate the percent of actual classes predicted correctly. In other words, how many of the students who did drop out were identified correctly by the model and how many of those who graduated were identified correctly. The objective is to identify students who will drop out, so 1 is the class we want to predict. (of course when we develop a model to predict 1 (no graduation) we also implicitly obtain a model to predict 0 (graduation). The percentage of correct prediction for this class is called sensitivity. It is 201/(201+85)=70%. The specificity, predicting graduation correctly, is 2689/(2689+25)=99%. Note that if we had the goal to predict 0 (graduation) then the terms would be in reverse. So the terms sensitivity and specificity are relative to the objective of predictions. False Positive and False Negative We can also look at the column rather than the row and compute the percentages of incorrect predictions. This is called the false positive/negative rate. The false positive rate is 25/(25+201)=11% and the false negative rate is 85/(85+2689)=3%. In summary, we can measure performance by the percent of correct classification (sensitivity & specificity) or by the percent that a model generates false results. Important is to remember the di±erence between the two sets of metrics. One set (sensitivity & specificity) is based on the true classes, that means we obtain an assessment of what percentage of the true classes the method correctly identifies. The other set (false positive & false negative) is based on the method. Hence the set of metrics assesses what percentage of 0ʼs (graduates) the model provides are really 0ʼs.
2/7/24, 2:39 PM Module 4 Resources | LSU Moodle 4.1 https://moodle.lsu.edu/mod/book/tool/print/index.php?id=2193524 11/23 ROC Curve The ROC curve is a popular measure to compare models. Classi±cation models basically predicted the probability that an observation belongs to class 1. The probabilities range from 0 to 1. A cutoff value is used to decide if the probability to belong to class 1 is large enough to warrant this classi±cation. The default value is 0.5. When the predicted probability that an observation belongs to class 1 exceeds 0.5, it is classi±ed as belonging to class 1, otherwise to class 0. But the cutoff values can be changed. The ROC curve shows what happens to the sensitivity (probability of classifying individuals in class 1 correctly) when the cutoff value is decreased from 1 to 0 versus 1-speci±city (percentage of individuals from class 0 identi±ed incorrectly). The straight line is a fourth default model that assigns cases at random to class 0 or class 1, i.e., the diagonal represents ²ipping a coin (random) for classifying a member. The higher the curve is toward the upper left corner the better the model is able to classify cases correctly. The area under the curve (AUC) is standardized to be 1 for a model that classi±es all cases correctly. Hence, the closer the AUC is to 1 the better the model. Basically, the ROC curve shows the correct speci±cation of class 1 members versus the incorrect speci±cation of class 0 members. We want the curve to be in the upper left corner. The ±gure below shows the OC curve for three different models based on different information from pre college, ±rst semester, and second semester. The ±rst semester ROC is quite a bit better than the pre college model. The second semester model is slightly better than the ±rst semester model because the model uses more information. The ROC enables us to quickly assess how much better one model is versus other models.
2/7/24, 2:39 PM Module 4 Resources | LSU Moodle 4.1 https://moodle.lsu.edu/mod/book/tool/print/index.php?id=2193524 12/23
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
2/7/24, 2:39 PM Module 4 Resources | LSU Moodle 4.1 https://moodle.lsu.edu/mod/book/tool/print/index.php?id=2193524 13/23 Lift Chart The lift chart is another method of evaluating the gain in the prediction compared to ²ipping a coin to classify individual cases. To obtain the lift chart, one needs to rank the data by their prediction probability for class 1. The example below 24 students where 12 have class 1 (e.g., drop out of college) and the other 12 have class 0 shown in column 3 and the cumulative correct classi±cation of class 1 is in column 4. Note that the data are sorted by the probability for class 1 provided by the model. The probabilities range from high of 1 to a low of 0.32. The cases were numbered accordingly from rank 1 to rank 24. The Case #1 was predicted to be of class 1 with probability 1 and case number 24 was predicted to be of class 1 with probability of 0.32. The portion of the data covered as we go down row by row is depicted in column 5. Case Probability Actual Cumulative Correct 1 Portion of Data Cumulative Expected Random Assignment Lift 1 1.00 1 1 0.04 0.50 2.00 2 1.00 1 2 0.08 1.00 2.00 3 0.99 1 3 0.13 1.50 2.00 4 0.99 1 4 0.17 2.00 2.00 5 0.99 1 5 0.21 2.50 2.00 6 0.99 1 6 0.25 3.00 2.00 7 0.98 1 7 0.29 3.50 2.00 8 0.91 0 7 0.33 4.00 1.75 9 0.89 1 8 0.38 4.50 1.78 10 0.79 1 9 0.42 5.00 1.80 11 0.73 0 9 0.46 5.50 1.64 12 0.70 1 10 0.50 6.00 1.67 13 0.69 0 10 0.54 6.50 1.54 14 0.65 1 11 0.58 7.00 1.57 15 0.61 1 12 0.63 7.50 1.60 16 0.58 0 12 0.67 8.00 1.50 17 0.55 0 12 0.71 8.50 1.41 18 0.51 0 12 0.75 9.00 1.33 19 0.48 0 12 0.79 9.50 1.26 20 0.45 0 12 0.83 10.00 1.20 21 0.41 0 12 0.88 10.50 1.14 22 0.38 0 12 0.92 11.00 1.09 23 0.35 0 12 0.96 11.50 1.04 24 0.32 0 12 1.00 12.00 1.00
2/7/24, 2:39 PM Module 4 Resources | LSU Moodle 4.1 https://moodle.lsu.edu/mod/book/tool/print/index.php?id=2193524 14/23 Twelve of the 24 students dropped out of college. If we do not use any model for prediction, then the expected number of dropouts we would observe in the ±rst 12 rows is 6, namely half of it. In the ±rst two rows, we would expect 12 x (2/24)=1 for a random assignment. This number is in column 6. So the Lift is calculated as the ratio of the cumulative correct classi±cations over the cumulative expected by random assignment. Thus the lift chart measures how much better we are predicting using the model versus ²ipping a coin to assign classes. The lift for the ±rst seven rows is 2 and drops to 1.75 in the 8th row (7/4=1.75). How do we recognize a good Lift curve? In this example, the lift is 2 as we go through 30% of the data and then it declines as shown in the ±gure below. This is a feature of a good model, namely, a lift curve that is much larger than 1 at the beginning and falls steep to 1 at the end. Error Rate The error rate is obtained simply by dividing the sum of false predictions by the total number of cases. RMSE The RMSE stand for the root mean squared error and is the square root of the average squared error. RMSE is reported as RASE in Fit Model and some other JMP platforms. A ±nal note on these metrics. They are computed using the validation data set. Why would we compute the confusion matrix with validation data and not with training data? The training data is not useful for obtaining a reliable estimate for the misclassi±cation rate due to inherent over-±tting of models.
2/7/24, 2:39 PM Module 4 Resources | LSU Moodle 4.1 https://moodle.lsu.edu/mod/book/tool/print/index.php?id=2193524 15/23 Logistic Regression We will now turn our attention to the logistic regression methods used to predict variables that have only two outcomes. While logistic regression can be used to predict variables that have more than two outcomes, we will restrict our discussion to just two outcomes. These are the most common problems. Examples are defaulting on a loan or not, winning an election or not, committing fraud or not, an item is defective or not, a customer will respond to an ad or not, a student will graduate or not. Logistic regression is a method used for explaining as well as prediction. The reason is that it is a statistical method that has been used for a long time, and it is more intuitive than other data mining methods. It basically models logarithm of odds as a function of the predictors. The method is widely used, particularly where a structured model is useful to explain (=pro±ling) or to predict. Logistic regression extends the idea of linear regression to situation where outcome variable is categorical. We focus on binary classification i.e. Y=0 or Y=1. However, since a binary outcome is not normally distributed a straightforward regression would not work. One could think of building a linear regression model for the probability of Y=1. This does also not work because p takes only values between 0 and 1. But a regression would not guarantee that. For that reason we ±rst have to transform the probability in some way to make linear regression work. The following section motivates how that is achieved by using a so called logit function. Step 1: Logistic Response Function p = probability of belonging to class 1 q=number of predictors We need to relate p to predictors with a function that guarantees 0 <= p <=1 A standard linear function (as shown below) does not constrain the probability. p= β + β x + β x + ... β x (equation 10.1 in textbook) The Logit The goal is to ±nd a function of the predictor variables that relates them to a 0/1 outcome The logit function is one that achieves that goal. The logit function is de±ned as the (natural) logarithm of the probability p of Y=1 divided by the probability 1-p of Y=0. The ratio p/(1-p) is called the odds of Y=1 versus Y=0. So the logit is the (natural) logarithm of the odds of obtaining Y=1. To summarize: Instead of the probability p=Pr(Y=1) as the outcome variable (like in linear regression), we use a function of p called the logit The logit can be modeled as a linear function of the predictors The logit can be mapped back to a probability, which, in turn, can be mapped to a class Logit function: Log(p/(1-p)) Odds More generally the odds of an event are de±ned as p/(1-p) where p is the probability of the event. For instance, the powerball uses 5/69 (white balls) plus 1/26 (Powerballs) matrix from which winning numbers are chosen, resulting in odds of 1 in 292,201,338 of winning a jackpot per play. The odds of an event are de±ned as: Odds = probability of event / (1 - probability of event) Given the odds of an event, the probability of the event can be computer by: 0 1 1 2 2 q q
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
2/7/24, 2:39 PM Module 4 Resources | LSU Moodle 4.1 https://moodle.lsu.edu/mod/book/tool/print/index.php?id=2193524 16/23 probability of event = odds / (1 + odds) Equations 10.3 and 10.4 in textbook
2/7/24, 2:39 PM Module 4 Resources | LSU Moodle 4.1 https://moodle.lsu.edu/mod/book/tool/print/index.php?id=2193524 17/23 Logit Function The logit function can take on values between - in±nity and + in±nity making it a useful variable that can be used in a regression. Hence, the logit is modeled using a linear regression of the predictors. Note that the regression may also use non-linear terms in the same way as ordinary linear regression can be extended to nonlinear terms. Regression for logit: log(Odds) = β 0 + β 1x1 + β 2x2 + ... + β qxq The odds can also be expressed as the exponential function of the predictors, i.e., Odds relates to the predictors by: Odds = Exp( β 0 + β 1x1 + β 2x2 + ... + β qxq ) Finally when we have a model for the odds we can compute the probabilities by using the formula odds=p/(1-p) and solve for p. p = Exp( β 0 + β 1x1 + β 2x2 + ... + β qxq )/[1+Exp( β 0 + β 1x1 + β 2x2 + ... + β qxq ] In summary, rather than using a linear regression model for modeling p we use a linear regression model for log(Odds) and than use this regression to obtain p. This was necessary because regression requires a variable that is not restricted to an interval (0,1) as the probability p is. This transformation allows us to use well know regression methods to obtain a model for the probability of a binary variable.
2/7/24, 2:39 PM Module 4 Resources | LSU Moodle 4.1 https://moodle.lsu.edu/mod/book/tool/print/index.php?id=2193524 18/23 Simple Logistic Regression in JMP Watch the following video on performing a simple logistic regression in JMP using only one predictor. Simple Logistic Regression Simple Logistic Regression Review the one-page summary overview of how to conduct a simple logistic regression in JMP.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
2/7/24, 2:39 PM Module 4 Resources | LSU Moodle 4.1 https://moodle.lsu.edu/mod/book/tool/print/index.php?id=2193524 19/23 Logistic Regression Example This chapter provides an overview of logistic regression using the example of predicting alcohol use among drivers in fatal crashes in Louisiana. First, we will exclude all rows in the data ±le that do not have a known BAC value. This is done by selecting all rows with Known=No as shown in Figure 1. Select Rows and then Exclude and the rows will be excluded during any other operation in JMP. If we had only one predictor, then Fit Y by X in the Analyze platform would provide a simple logistic regression. Instructions 1. From an open JMP® data table, select Analyze > Fit Y by X. 2. Click on a categorical variable from Select Columns. 3. Click Y, Response (nominal variables have red bars, ordinal variables have green bars) to select the variable called BAC>=0.08. 4. Click on a nominal or continuous variable. 5. Click on X, Factor (continuous variables have blue triangles). ³. Click on OK to run the analysis. By default, JMP will provide the following results: The logistic plot, with curves of cumulative predicted (±tted) probabilities. The whole model test for model signi±cance. Parameter estimates for the ±tted model.
2/7/24, 2:39 PM Module 4 Resources | LSU Moodle 4.1 https://moodle.lsu.edu/mod/book/tool/print/index.php?id=2193524 20/23 Multiple Logistic Regression If we have multiple predictors we need to run a multiple regression. Instead of Fit Y by X we use Analyze>> Fit Model. From an open JMP® data table, select Analyze > Fit Model. Click on a categorical variable from Select Columns, and click Y (nominal variables have red bars, ordinal variables have green bars). We select again the dependent variable BAC>=0.08. Choose explanatory variables from Select Columns, and click Add. We select the variables DR_SEX 2 and NUM_VEH as shown in the ±gure below.
2/7/24, 2:39 PM Module 4 Resources | LSU Moodle 4.1 https://moodle.lsu.edu/mod/book/tool/print/index.php?id=2193524 21/23 Multiple Logistic Regression in JMP Watch the following video on how to perform a multiple logistic regression in JMP. Multiple Logistic Regression Multiple Logistic Regression Review the one-page summary overview of how to conduct a multiple logistic regression in JMP.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
2/7/24, 2:39 PM Module 4 Resources | LSU Moodle 4.1 https://moodle.lsu.edu/mod/book/tool/print/index.php?id=2193524 22/23 JMP Output When you have more variables and use the Fit Y by X program you would obtain several different results, one for each predictor. The ±gure below shows the output for four different predictors. This includes a mosaic plot for each nominal variable and a scatter plot for continuous variables that can be used for data preparation instead of creating it yourself in the graph menu.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
2/7/24, 2:39 PM Module 4 Resources | LSU Moodle 4.1 https://moodle.lsu.edu/mod/book/tool/print/index.php?id=2193524 23/23 Odd Ratios Odds Ratios are, as the name indicates, the ratio of two odds. For instance, the odds of winning in the powerball is 1 in 292,201,338. (The game has 5/69 (white balls) + 1/26 (Powerballs)). The odds of winning in the Louisiana Lottery are 1 in 3,838,380. The odds ratio is 76.1=292,201,338/3,838,380. Hence, the odds for a player to win in the Louisiana lottery are 76 times that of winning in the powerball. Why are more people playing in the powerball than the Louisiana lottery? The logistic regression output allows display of odds ratios as shown in the Figure below. The dependent variable is driver having a BAC>=0.08. This output is obtained by clicking the red triangle and selecting odds ratios.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help