Module 4 Resources _ LSU Moodle 4.1
pdf
keyboard_arrow_up
School
Louisiana State University *
*We aren’t endorsed by this school
Course
3302
Subject
Industrial Engineering
Date
Feb 20, 2024
Type
Pages
23
Uploaded by MasterCloverSalmon28
2/7/24, 2:39 PM
Module 4 Resources | LSU Moodle 4.1
https://moodle.lsu.edu/mod/book/tool/print/index.php?id=2193524
1/23
Module 4 Resources
Site:
Welcome to LSU Moodle!
Course:
2024 First Spring GBUS 3302 for Nicholaus Hutchinson
Book:
Module 4 Resources
Printed by:
Madelyn McDaniel
Date:
Wednesday, February 7, 2024, 2:39 PM
2/7/24, 2:39 PM
Module 4 Resources | LSU Moodle 4.1
https://moodle.lsu.edu/mod/book/tool/print/index.php?id=2193524
2/23
Description
Use the Table of Contents to view the readings and other resources for this module.
2/7/24, 2:39 PM
Module 4 Resources | LSU Moodle 4.1
https://moodle.lsu.edu/mod/book/tool/print/index.php?id=2193524
3/23
Table of contents
Evaluation of Predictive Models
Data Partition for Model Evaluation
Process of Making a Validation Column in JMP
Measures
Confusion Matrix
Sensitivity, Speci±city, and False Positive and Negative
ROC Curve
Lift Chart
Logistic Regression
Logit Function
Simple Logistic Regression in JMP
Logistic Regression Example
Multiple Logistic Regression
Multiple Logistic Regression in JMP
JMP Output
Odd Ratios
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
2/7/24, 2:39 PM
Module 4 Resources | LSU Moodle 4.1
https://moodle.lsu.edu/mod/book/tool/print/index.php?id=2193524
4/23
Evaluation of Predictive Models
While inferential statistics uses t-statistics and p-value to assess statistical signi±cance of factors in a
model, predictive analytics using data mining models do generally not use statistical tests or p-values.
There are several ways to assess the reliability of data mining models. These include measuring the error in
predictions, false positive rate, false negative rate, overall error, lift curve and ROC curves, and the confusion
matrix. We will discuss each of these measures in this book.
2/7/24, 2:39 PM
Module 4 Resources | LSU Moodle 4.1
https://moodle.lsu.edu/mod/book/tool/print/index.php?id=2193524
5/23
Data Partition for Model Evaluation
With the exception of logistic regression, none of the classi±cation models we will discuss have standard
errors associated with them. So we need another method of checking the reliability of the prediction. The
standard method is to divide the data into different sets of randomly selected data. If only one modeling
method is used, we need only a training and validation set. The training set of the data is used to build a
model and the validation set is used to validate the model; this means checking the true performance of the
model on data that were not used for the training.
Often you must use several models to ±t the data. Then you should divide the data into training, validation,
and test sets. Suppose we use ±ve different models on the training set. In that case, we use validation to
select the best model of the ±ve and the test set is used to check the true performance of the ±nal model.
The process of making a training, validation or test set is to randomly assign cases (rows) to either set. If
you don’t have a software such as JMP available you could simply do this in Excel by using the rand()
function to generate uniform random numbers in a new column (call it rand) and use an if statement to
create a so called validation column that speci±es whether a row is belonging to training, validation or test.
The JMP provides to create a so called validation column within the Predictive Modeling platform.
Training
Data used
to build the
various
models
Validation
Data used to
evaluate the
predictive
performance of
the models
Test
Data is used to
evaluate the
final model
selected
2/7/24, 2:39 PM
Module 4 Resources | LSU Moodle 4.1
https://moodle.lsu.edu/mod/book/tool/print/index.php?id=2193524
6/23
Process of Making a Validation Column in JMP
The JMP provides to create a so called validation column within the Predictive Modeling platform. Example: West Roxbury Housing Data
Partitioning the data (Using the Make Validation Column Utility)
Use Analyze>Predictive Modeling>Make Validation Column. This option in JMP allows you to select the size of the
training set and validation set. It allows purely random selection of the subsamples or stratification by a variable. For
instance, if you want to make sure that you have the same percentage of bedrooms in the training and validation sets
you can stratify by BEDROOMS. The platform Make Validation Column shows as example a 50/50 distribution
of training and validation. We then obtain the following validation column. Note that you give this column
any name, but the default name is Validation.
Predictions for the First 10 Observations (Highlighted rows are the validation set)
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
2/7/24, 2:39 PM
Module 4 Resources | LSU Moodle 4.1
https://moodle.lsu.edu/mod/book/tool/print/index.php?id=2193524
7/23
Watch the following video on how to create a validation column in JMP
Creating a Validation Column (Holdout Sample)
Creating a Validation Column (Holdout Sample)
Download the one-page summary of how to make a validation column in JMP
2/7/24, 2:39 PM
Module 4 Resources | LSU Moodle 4.1
https://moodle.lsu.edu/mod/book/tool/print/index.php?id=2193524
8/23
Measures
In the following chapters, we will cover the various measures used to evaluate models
confusion matrix
sensitivity and speci±city
false positive and false negative
receiver operating characteristic (ROC) curve
lift chart
error rate
2/7/24, 2:39 PM
Module 4 Resources | LSU Moodle 4.1
https://moodle.lsu.edu/mod/book/tool/print/index.php?id=2193524
9/23
Confusion Matrix
Some of the metrics used to evaluate the models are derived from the so called confusion matrix. Please
note that the name does not indicate that the matrix is confusing. Instead the name comes from the fact
that the matrix describes the uncertainty (confusion) in the data. It summarizes the correct and incorrect
classi±cations produced by the classi±er used of the data.
Consider the example of predicting the graduation of students within 4 years. The data are coded as 0
(graduated) and 1 (not graduated). Note that the assignment of 0 or 1 is arbitrary, but it is custom to assign
1 to the class that is predicted. Variables used for prediction are demographic characteristics, High School
GPA, ACT and grade point average in college and various stages. We provide details about the modelling
later. Here we just want to introduce how models are evaluated and therefore assume that we have build a
speci±c model. This (constructed data) is used to explain the various metrics that are available to evaluate a
model. The matrix shows the outcome of the predictions of 3000 students of which 2689 students were
correctly classi±ed by the model as having graduated within 4 years, while 201 students were correctly
classi±ed as having not graduated in 4 years; 25 were classi±ed by the model as having not graduated but
actually graduated; 85 were classi±ed as graduating but actually did not graduate. The two latter cases are
misclassi±cations. This matrix is called the confusion matrix for the model used. We will derive some of the
measures from this matrix.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
2/7/24, 2:39 PM
Module 4 Resources | LSU Moodle 4.1
https://moodle.lsu.edu/mod/book/tool/print/index.php?id=2193524
10/23
Sensitivity, Speci±city, and False Positive and Negative
Sensitivity and Speci±city
Sensitivity and specificity are two measures that use the rows in the matrix to calculate the percent of actual classes
predicted correctly. In other words, how many of the students who did drop out were identified correctly by the
model and how many of those who graduated were identified correctly. The objective is to identify students who will
drop out, so 1 is the class we want to predict. (of course when we develop a model to predict 1 (no graduation) we
also implicitly obtain a model to predict 0 (graduation). The percentage of correct prediction for this class is called
sensitivity. It is 201/(201+85)=70%. The specificity, predicting graduation correctly, is 2689/(2689+25)=99%. Note that
if we had the goal to predict 0 (graduation) then the terms would be in reverse. So the terms sensitivity and specificity
are relative to the objective of predictions. False Positive and False Negative
We can also look at the column rather than the row and compute the percentages of incorrect predictions. This is
called the false positive/negative rate. The false positive rate is 25/(25+201)=11% and the false negative rate is
85/(85+2689)=3%. In summary, we can measure performance by the percent of correct classification (sensitivity & specificity) or by the
percent that a model generates false results. Important is to remember the di±erence between the two sets of
metrics. One set (sensitivity & specificity) is based on the true classes, that means we obtain an assessment of what
percentage of the true classes the method correctly identifies. The other set (false positive & false negative) is based
on the method. Hence the set of metrics assesses what percentage of 0ʼs (graduates) the model provides are really
0ʼs.
2/7/24, 2:39 PM
Module 4 Resources | LSU Moodle 4.1
https://moodle.lsu.edu/mod/book/tool/print/index.php?id=2193524
11/23
ROC Curve
The ROC curve is a popular measure to compare models. Classi±cation models basically predicted the
probability that an observation belongs to class 1. The probabilities range from 0 to 1. A cutoff value is used
to decide if the probability to belong to class 1 is large enough to warrant this classi±cation. The default
value is 0.5. When the predicted probability that an observation belongs to class 1 exceeds 0.5, it is
classi±ed as belonging to class 1, otherwise to class 0. But the cutoff values can be changed. The ROC
curve shows what happens to the sensitivity (probability of classifying individuals in class 1 correctly) when
the cutoff value is decreased from 1 to 0 versus 1-speci±city (percentage of individuals from class 0
identi±ed incorrectly). The straight line is a fourth default model that assigns cases at random to class 0 or
class 1, i.e., the diagonal represents ²ipping a coin (random) for classifying a member. The higher the curve
is toward the upper left corner the better the model is able to classify cases correctly. The area under the
curve (AUC) is standardized to be 1 for a model that classi±es all cases correctly. Hence, the closer the AUC
is to 1 the better the model.
Basically, the ROC curve shows the correct speci±cation of class 1 members versus the incorrect
speci±cation of class 0 members. We want the curve to be in the upper left corner.
The ±gure below shows the OC curve for three different models based on different information from pre
college, ±rst semester, and second semester. The ±rst semester ROC is quite a bit better than the pre
college model. The second semester model is slightly better than the ±rst semester model because the
model uses more information. The ROC enables us to quickly assess how much better one model is versus
other models.
2/7/24, 2:39 PM
Module 4 Resources | LSU Moodle 4.1
https://moodle.lsu.edu/mod/book/tool/print/index.php?id=2193524
12/23
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
2/7/24, 2:39 PM
Module 4 Resources | LSU Moodle 4.1
https://moodle.lsu.edu/mod/book/tool/print/index.php?id=2193524
13/23
Lift Chart
The lift chart is another method of evaluating the gain in the prediction compared to ²ipping a coin to
classify individual cases. To obtain the lift chart, one needs to rank the data by their prediction probability for
class 1.
The example below 24 students where 12 have class 1 (e.g., drop out of college) and the other 12 have
class 0 shown in column 3 and the cumulative correct classi±cation of class 1 is in column 4. Note that the
data are sorted by the probability for class 1 provided by the model. The probabilities range from high of 1 to
a low of 0.32. The cases were numbered accordingly from rank 1 to rank 24. The Case #1 was predicted to
be of class 1 with probability 1 and case number 24 was predicted to be of class 1 with probability of 0.32.
The portion of the data covered as we go down row by row is depicted in column 5.
Case
Probability
Actual
Cumulative
Correct 1
Portion
of Data
Cumulative
Expected
Random
Assignment
Lift
1
1.00
1
1
0.04
0.50
2.00
2
1.00
1
2
0.08
1.00
2.00
3
0.99
1
3
0.13
1.50
2.00
4
0.99
1
4
0.17
2.00
2.00
5
0.99
1
5
0.21
2.50
2.00
6
0.99
1
6
0.25
3.00
2.00
7
0.98
1
7
0.29
3.50
2.00
8
0.91
0
7
0.33
4.00
1.75
9
0.89
1
8
0.38
4.50
1.78
10
0.79
1
9
0.42
5.00
1.80
11
0.73
0
9
0.46
5.50
1.64
12
0.70
1
10
0.50
6.00
1.67
13
0.69
0
10
0.54
6.50
1.54
14
0.65
1
11
0.58
7.00
1.57
15
0.61
1
12
0.63
7.50
1.60
16
0.58
0
12
0.67
8.00
1.50
17
0.55
0
12
0.71
8.50
1.41
18
0.51
0
12
0.75
9.00
1.33
19
0.48
0
12
0.79
9.50
1.26
20
0.45
0
12
0.83
10.00
1.20
21
0.41
0
12
0.88
10.50
1.14
22
0.38
0
12
0.92
11.00
1.09
23
0.35
0
12
0.96
11.50
1.04
24
0.32
0
12
1.00
12.00
1.00
2/7/24, 2:39 PM
Module 4 Resources | LSU Moodle 4.1
https://moodle.lsu.edu/mod/book/tool/print/index.php?id=2193524
14/23
Twelve of the 24 students dropped out of college. If we do not use any model for prediction, then the
expected number of dropouts we would observe in the ±rst 12 rows is 6, namely half of it. In the ±rst two
rows, we would expect 12 x (2/24)=1 for a random assignment. This number is in column 6. So the Lift is
calculated as the ratio of the cumulative correct classi±cations over the cumulative expected by random
assignment. Thus the lift chart measures how much better we are predicting using the model versus
²ipping a coin to assign classes. The lift for the ±rst seven rows is 2 and drops to 1.75 in the 8th row
(7/4=1.75).
How do we recognize a good Lift curve? In this example, the lift is 2 as we go through 30% of the data and
then it declines as shown in the ±gure below. This is a feature of a good model, namely, a lift curve that is
much larger than 1 at the beginning and falls steep to 1 at the end.
Error Rate
The error rate is obtained simply by dividing the sum of false predictions by the total number of cases.
RMSE
The RMSE stand for the root mean squared error and is the square root of the average squared error. RMSE
is reported as RASE in Fit Model and some other JMP platforms.
A ±nal note on these metrics. They are computed using the validation data set. Why would we compute the
confusion matrix with validation data and not with training data? The training data is not useful for obtaining
a reliable estimate for the misclassi±cation rate due to inherent over-±tting of models.
2/7/24, 2:39 PM
Module 4 Resources | LSU Moodle 4.1
https://moodle.lsu.edu/mod/book/tool/print/index.php?id=2193524
15/23
Logistic Regression
We will now turn our attention to the logistic regression methods used to predict variables that have only
two outcomes. While logistic regression can be used to predict variables that have more than two
outcomes, we will restrict our discussion to just two outcomes. These are the most common problems.
Examples are defaulting on a loan or not, winning an election or not, committing fraud or not, an item is
defective or not, a customer will respond to an ad or not, a student will graduate or not.
Logistic regression is a method used for explaining as well as prediction. The reason is that it is a statistical
method that has been used for a long time, and it is more intuitive than other data mining methods. It
basically models logarithm of odds as a function of the predictors. The method is widely used, particularly
where a structured model is useful to explain (=pro±ling) or to predict. Logistic regression extends the idea
of linear regression to situation where outcome variable is categorical.
We focus on binary classification i.e. Y=0
or Y=1. However, since a binary outcome is not normally distributed a straightforward regression would not
work. One could think of building a linear regression model for the probability of Y=1. This does also not
work because p takes only values between 0 and 1. But a regression would not guarantee that. For that
reason we ±rst have to transform the probability in some way to make linear regression work. The following
section motivates how that is achieved by using a so called logit function.
Step 1: Logistic Response Function
p = probability of belonging to class 1
q=number of predictors
We need to relate p to predictors with a function that guarantees 0 <= p <=1
A standard linear function (as shown below) does not constrain the probability.
p=
β
+
β
x
+
β
x
+ ...
β
x
(equation 10.1 in textbook)
The Logit
The goal is to ±nd a function of the predictor variables that relates them to a 0/1 outcome
The logit function is one that achieves that goal. The logit function is de±ned as the (natural) logarithm of
the probability p of Y=1 divided by the probability 1-p of Y=0. The ratio p/(1-p) is called the odds of Y=1
versus Y=0. So the logit is the (natural) logarithm of the odds of obtaining Y=1. To summarize:
Instead of the probability p=Pr(Y=1) as the outcome variable (like in linear regression), we use a function of p
called the logit
The logit can be modeled as a linear function of the predictors
The logit can be mapped back to a probability, which, in turn, can be mapped to a class
Logit function: Log(p/(1-p))
Odds
More generally the odds of an event are de±ned as p/(1-p) where p is the probability of the event. For
instance, the powerball uses 5/69 (white balls) plus 1/26 (Powerballs) matrix from which winning numbers
are chosen, resulting in odds of 1 in 292,201,338 of winning a jackpot per play.
The odds of an event are de±ned as:
Odds = probability of event / (1 - probability of event)
Given the odds of an event, the probability of the event can be computer by:
0
1
1
2
2
q
q
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
2/7/24, 2:39 PM
Module 4 Resources | LSU Moodle 4.1
https://moodle.lsu.edu/mod/book/tool/print/index.php?id=2193524
16/23
probability of event = odds / (1 + odds)
Equations 10.3 and 10.4 in textbook
2/7/24, 2:39 PM
Module 4 Resources | LSU Moodle 4.1
https://moodle.lsu.edu/mod/book/tool/print/index.php?id=2193524
17/23
Logit Function
The logit function can take on values between - in±nity and + in±nity making it a useful variable that can be
used in a regression. Hence, the logit is modeled using a linear regression of the predictors. Note that the
regression may also use non-linear terms in the same way as ordinary linear regression can be extended to
nonlinear terms.
Regression for logit:
log(Odds) =
β
0 +
β
1x1 +
β
2x2 + ... +
β
qxq
The odds can also be expressed as the exponential function of the predictors, i.e., Odds relates to the
predictors by:
Odds = Exp(
β
0 +
β
1x1 +
β
2x2 + ... +
β
qxq )
Finally when we have a model for the odds we can compute the probabilities by using the formula
odds=p/(1-p) and solve for p.
p = Exp(
β
0 +
β
1x1 +
β
2x2 + ... +
β
qxq )/[1+Exp(
β
0 +
β
1x1 +
β
2x2 + ... +
β
qxq ]
In summary, rather than using a linear regression model for modeling p we use a linear regression model for
log(Odds) and than use this regression to obtain p. This was necessary because regression requires a variable that
is not restricted to an interval (0,1) as the probability p is. This transformation allows us to use well know
regression methods to obtain a model for the probability of a binary variable.
2/7/24, 2:39 PM
Module 4 Resources | LSU Moodle 4.1
https://moodle.lsu.edu/mod/book/tool/print/index.php?id=2193524
18/23
Simple Logistic Regression in JMP
Watch the following video on performing a simple logistic regression in JMP using only one predictor.
Simple Logistic Regression
Simple Logistic Regression
Review the one-page summary overview of how to conduct a simple logistic regression in JMP.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
2/7/24, 2:39 PM
Module 4 Resources | LSU Moodle 4.1
https://moodle.lsu.edu/mod/book/tool/print/index.php?id=2193524
19/23
Logistic Regression Example
This chapter provides an overview of logistic regression using the example of predicting alcohol use among
drivers in fatal crashes in Louisiana. First, we will exclude all rows in the data ±le that do not have a known
BAC value. This is done by selecting all rows with Known=No as shown in Figure 1. Select Rows and then
Exclude and the rows will be excluded during any other operation in JMP.
If we had only one predictor, then Fit Y by X in the Analyze platform would provide a simple logistic
regression.
Instructions
1. From an open JMP® data table, select Analyze > Fit Y by X.
2. Click on a categorical variable from Select Columns.
3. Click Y, Response (nominal variables have red bars, ordinal variables have green bars) to select the variable
called BAC>=0.08.
4. Click on a nominal or continuous variable.
5. Click on X, Factor (continuous variables have blue triangles).
³. Click on OK to run the analysis.
By default, JMP will provide the following results:
The logistic plot, with curves of cumulative predicted (±tted) probabilities. The whole model test for model signi±cance.
Parameter estimates for the ±tted model.
2/7/24, 2:39 PM
Module 4 Resources | LSU Moodle 4.1
https://moodle.lsu.edu/mod/book/tool/print/index.php?id=2193524
20/23
Multiple Logistic Regression
If we have multiple predictors we need to run a multiple regression. Instead of Fit Y by X we use Analyze>>
Fit Model. From an open JMP® data table, select Analyze > Fit Model. Click on a categorical variable from
Select Columns, and click Y (nominal variables have red bars, ordinal variables have green bars). We select
again the dependent variable BAC>=0.08.
Choose explanatory variables from Select Columns, and click
Add. We select the variables DR_SEX 2 and NUM_VEH as shown in the ±gure below.
2/7/24, 2:39 PM
Module 4 Resources | LSU Moodle 4.1
https://moodle.lsu.edu/mod/book/tool/print/index.php?id=2193524
21/23
Multiple Logistic Regression in JMP
Watch the following video on how to perform a multiple logistic regression in JMP.
Multiple Logistic Regression
Multiple Logistic Regression
Review the one-page summary overview of how to conduct a multiple logistic regression in JMP.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
2/7/24, 2:39 PM
Module 4 Resources | LSU Moodle 4.1
https://moodle.lsu.edu/mod/book/tool/print/index.php?id=2193524
22/23
JMP Output
When you have more variables and use the Fit Y by X program you would obtain several different results,
one for each predictor. The ±gure below shows the output for four different predictors. This includes a
mosaic plot for each nominal variable and a scatter plot for continuous variables that can be used for data
preparation instead of creating it yourself in the graph menu.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
2/7/24, 2:39 PM
Module 4 Resources | LSU Moodle 4.1
https://moodle.lsu.edu/mod/book/tool/print/index.php?id=2193524
23/23
Odd Ratios
Odds Ratios are, as the name indicates, the ratio of two odds. For instance, the odds of winning in the
powerball is 1 in 292,201,338. (The game has 5/69 (white balls) + 1/26 (Powerballs)). The odds of winning in the Louisiana Lottery are 1 in 3,838,380. The odds ratio is 76.1=292,201,338/3,838,380. Hence, the odds for a player to win in the Louisiana lottery
are 76 times that of winning in the powerball. Why are more people playing in the powerball than the
Louisiana lottery?
The logistic regression output allows display of odds ratios as shown in the Figure below. The dependent
variable is driver having a BAC>=0.08. This output is obtained by clicking the red triangle and selecting odds
ratios.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help