hw7-1
pdf
keyboard_arrow_up
School
Georgia Institute Of Technology *
*We aren’t endorsed by this school
Course
6501
Subject
Aerospace Engineering
Date
Jan 9, 2024
Type
Pages
45
Uploaded by CountFog22175
Question
10.1
Using
the
same
crime
data
set
uscrime.txt
as
in
Questions
8.2
and
9.1,
find
the
best
model
you
can
using
(a)
a
regression
tree
model,
and
(b)
a
random
forest
model.
In
R,
you
can
use
the
tree
package
or
the
rpart
package,
and
the
randomForest
package.
For
each
model,
describe
one
or
two
qualitative
takeaways
you
get
from
analyzing
the
results
(i.e.,
don’t
just
stop
when
you
have
a
good
model,
but
interpret
it
too).
Answer
10.1
The
US
Crime
file
contains
information
from
47
states
taken
during
the
1960s.
The dataset
contains
the
following
datapoints:
Variable
Description
M
percentage
of
males
aged
14—-24
in
total
state
population
So
indicator
variable
for
a
southern
state
Ed
mean
years
of
schooling
of
the
population
aged
25
years
or
over
Po1
per
capita
expenditure
on
police
protection
in
1960
Po2
per
capita
expenditure
on
police
protection
in
1959
LF
labour
force
participation
rate
of
civilian
urban
males
in
the
age-group
14-24
M.F
number
of
males
per
100
females
Pop
state
population
in
1960
in
hundred
thousands
NW
percentage
of
nonwhites
in
the
population
U1
unemployment
rate
of
urban
males
14-24
U2
unemployment
rate
of
urban
males
35-39
Wealth
wealth:
median
value
of
transferable
assets
or
family
income
Ineq
income
inequality:
percentage
of
families
earning
below
half
the
median
income
Prob
probability
of
imprisonment:
ratio
of
number
of
commitments
to
number
of
offenses
Time
average
time
in
months
served
by
offenders
in
state
prisons before
their
first
release
Crime
crime
rate:
number
of
offenses
per
100,000
population
in
1960
Previously
we
have
built
a
linear
regression
model
using
selected
features
from
the
dataset
to
predict
the
crime
rate
of
a
new
city
as
well
as
one
using
Principal
Component
Analysis
(PCA)
on
the
dataset.
Now
we
will
look
to
use
regression
trees
and
random
forest
models
to
see
if
it
improves
the
quality.
We
will
begin
by
doing
some
basic
exploratory
data
analysis,
such
as
checking
for
outliers
with
a
boxplot
(Fig.
1),
visualizing
the
distribution
(Fig.
4)
as
well
as
a
visualization
of
how
each
feature
interacts
with
our
response
(Fig.
6).
For the
most
part
we
see
that
each
factor
stays
within
its
expected
range,
bar
a
few
outliers.
We
see
that
number
of
males
per
100
females,
state
population
in
1960
in
hundred
thousands
and
percentage
of
nonwhites
in
the
population
have
the
most
outliers.
When
looking
at
the
density
plots
and
how
each
feature
interacts
with
our
response,
we
see
that
most
are
within
a
normal
range
and
close
to
a
normal
distribution
except
for
So.
When
looking
at
the
data
this
becomes
apparent
as
it
is
actually
a
binary
indicator
variable
for
whether
the
state
is
a
southern
state
or
not.
After
visualizing
the
data
we
will
begin
to
build
our
regression
trees
and
random
forests.
Classification
and
Regression
Trees
(CART)
models
work
differently
than
typical
"math"
based
models.
Instead
of
trying
to
fit
a
line
or
similiar,
these
models
look
to
make
decisions
on
how
to
split
the
data
to
reach
a
decision/prediction.
Each
split
the
model
creates
is
called
a
branch,
and
each
node
is
refered
to
as
a
leaf.
Each
leaf
gets
a
simplified
regression
model
on
all
the
data
points
that
are
present.
Yy=q
Training
a
base
model
we
see
that
only
a
few
predictors
were
used,
that
there
are
5
branches
and
6
end
leaves,
and
returns
a
Mean
Absolute
Error
(MAE)
of
171.93,
which
is
only
slighty
worse
than
that
of
our
initial
regression
model.
var
n
dev
yval
splits.cutleft
splits.cutright
Po1l
35
571010497
911.9714
<10.75
>10.75
Pol
25
1466071.04
752.7200
<7.05
>7.05
Pop
13
501357.23
612.5385
<225
>22.5
leaf
8
84351.88
503.3750
leaf
5
169138.80
787.2000
LF
12
43250292
904.5833
<0.58
>0.58
leaf
7
133283.71
1017.4286
leaf
5
85287.20
746.6000
M.F
10
202494490
1310.1000
<96.75
>96.75
leaf
5
658284.80
1084.2000
leaf
5
856352.00
1536.0000
It
is
not
surprising
that
Po1
is
the
primary
split
for
the
data,
as
it
has
both
a
strong
correlation
to
Crime
and
was
found
to
be
statistically
significant
in
our
linear
regression
model.
To
test
the
fit
and
see
if
these
are
the
correct
number
of
branches,
we
can
run
10
fold
cross
validation
while
pruning
the
tree
to
see
if
the
model
improves.
Branches
MAE
6
188.3099
5
201.8580
4
214.4362
3
232.0362
2
273.5210
Unsurpringly
with
the
low
number
of
datapoints
to
fit
the
model,
and
with
the
small
number
of
branches
to
begin
with,
pruning
the
tree
does
not
improve
the
model.
Next
we
can
look
to
train
a
Random
Forest.
Random
Forests
as
the
name
suggest,
is
a
collection
of
Trees
created
at
random
as
opposed
to
one
single
tree.
We
lose
explainability,
but
gain
a
better
overall
estimate
of
the
data.
We
can
loop
through
a
large
number
of
Trees
and
see
at
which
point
we
begin
to
lose
quality.
The
number
of
trees
with
the
lowest
MAE
will
be
used
to
train
the
final
Random
Forest
Model,
which
in
this
case
is
12.
Trees
MAE
12
249.0137
23
260.8363
32
266.8360
89
267.6831
79
268.0012
107
268.5253
Looking
at
the
MAE
against
our
testing
set
we
find
201.09,
which
is
slightly
worse
than
the
base
Regression
Tree.
Looking
at
the
models
Increase
in
Node
Purity
we
see
the
following
features
having
the
most
importance.
Feature
Node
Purity
Po1
1277314.89
Prob
556052.98
Po2
531698.20
Ineq
444998.31
Pop
359573.44
Looking
over
the
features
and
thier
importance
between
the
two
models,
we
can
see
that
police
spending
has
the
most
impact,
whether
it
be
for
the
current
year
(Po1)
or
the
previous
(Po2).
Question
10.2
Describe
a
situation
or
problem
from
your
job,
everyday
life,
current
events,
etc.,
for
which
a
logistic
regression
model
would
be
appropriate.
List
some
(up
to
5)
predictors
that
you
might
use.
Answer
10.2
One
area
where
logistic
regression
could
be
used
is
in
beer
brewing
and
distritbution.
A
concern
amongst
a
lot
of
breweries
is
the
shelf-life
of
thier
product.
Logistic
regression
can
be
used
to
predict
the
likelihood
of
beer
spoilage
over
time,
considering
factors
like
temperature,
packaging,
and
storage
conditions.
Question
10.3
1.
Using
the
GermanCredit
data
set
germancredit.txt
from
http://archive.ics.uci.edu/ml/machine-
learning-databases/statlog/german
/
(description
at
http://archive.ics.uci.edu/ml/datasets/Statlog+%28German+Credit+Data%29
),
use
logistic
regression
to
find
a
good
predictive
model
for
whether
credit
applicants
are
good
credit
risks
or
not.
Show
your
model
(factors
used
and
their
coefficients),
the
software
output,
and
the
quality
of
fit.
You
can
use
the
glm
function
in
R.
To
get
a
logistic
regression
(logit)
model
on
data
where
the
response
is
either
zero
or
one,
use
family=binomial(link="logit")
in
your
gim
function
call.
2.
Because
the
model
gives
a
result
between
0
and
1,
it
requires
setting
a
threshold
probability
to
separate
between
“good”
and
“bad”
answers.
In
this
data
set,
they
estimate
that
incorrectly
identifying
a
bad
customer
as
good,
is
5
times
worse
than
incorrectly
classifying
a
good
customer
as
bad.
Determine
a
good
threshold
probability
based
on
your
model.
Answer
10.3
Logisitc
regression
uses
a
similiar
algorithim
to
linear
regression,
but
with
a
sigmoid
activation
function
to
return
a
probability
between
0-1
instead
of
a
continous
response.
In
logistic
regression,
the
sigmoid
function
maps
the
linear
combination
z
to
the
range
[0,
1],
representing
the
probability
of
the
binary
outcome
being
in
the
positive
class
(usually
class
1).
The
function's
S-shaped
curve
ensures
that
the
probability
remains
between
0
and
1,
making
it
suitable
for
binary
classification
tasks.
o(z)
=
1
l—e
=z
Where:
o(2)
is
the
sigmoid
function.
zis
the
linear
combination
of
predictor
variables
and
their
associated
coefficients
z=Po+
fiz1
+
Pazz
+
...+
Bpzy
Bo, B1, B2,
-
.
.
,
Bp
are
the
coefficients
z1,%2,...,x,
are
the
predictor
variables.
The
logistic
regression
model
can
be
expressed
as
follows:
PY
=1|X)
=
l—e
=z
The
German
Credit
Dataset
contains
1000
credit
applications
and
thier
outcome
of
either
good
(1)
or
bad
(2).
Variable
Name
Role
Type
Demographic
Description
Attribute1
Feature
Categorical
Status
of
existing
checking
account
Attribute2
Feature
Integer
Duration
months
Attribute3
Feature
Categorical
Credit
history
Attribute4
Feature
Categorical
Purpose
AttributeS
Feature
Integer
Credit
amount
Attribute6
Feature
Categorical
Savings
account/bonds
Attribute7
Feature
Categorical
Other
Present
employment
since
o
Attribute8
Feature
Integer
Installment
rate
in
percentage
of
disposable
income
Attribute9
Feature
Categorical
Marital
Status
Attribute10
Feature
Categorical
Other
debtors
/
guarantors
Attribute11
Feature
Integer
Present
residence
since
Attribute12
Feature
Categorical
Property
Attribute13
Feature
Integer
Age
Attribute14
Feature
Categorical
Other
installment
plans
Attribute15
Feature
Categorical
Other
Housing
Attribute16
Feature
Integer
Number
of
existing
credits
at
this
bank
Attribute17
Feature
Categorical
Occupation
Job
Attribute18
Feature
Integer
Number
of
people
being
liable
to
provide
maintenance
for
Attribute19
Feature
Binary
Telephone
Attribute20
Feature
Binary
Other
foreign
worker
class
Target
Binary
1
=Good,
2
=
Bad
The dataset
is
accompanied
by
a
cost
matrix,
where
the
cost
of
incorrectly
classifying
a
customer
as
good
when
they
are
bad
is
5
times
worse
than
to
classify
a
customer
as
bad
when
they
are
good.
This
will
need
to
be
considered
when
setting
a
threshold
for
prediction
and
evaluating
the
model.
The
classes
are
heavily
skewed
towards
the
good
results
at
a
more
than
2:1
ratio.
To
counteract
this
we
will
downsample
so
that
the
number
in
each
class
remains
even
to
avoid any
potential
bias
in
the
training
set.
We
will
begin
by
performing
Exploratory
Data
Analysis
on
the
dataset,
checking
for
outliers
where
the
variable
type
is
integer,
and
looking
at
distributions
where
it
is
either
binary
or
categorical.
We
can
now
train
a
base
model
using
all
the
features
to
identify
those
that
are
statistically
important
to
predicting
the
class.
After
training
the
model
against
the
whole
dataset,
we
recieve
an
accuracy
of
74%.
Looking
into
the
model
we
find
the
following
features
have
the
most
impact.
Variable
Pr(>z)
V5
V8
0.01840
*
0.01099
*
V1A11
5.46e-12
***
V1A12
1.70e-07
***
V3A30
0.01488
*
V3A31
0.01316
*
V4A41
0.00770
**
V6A61
0.02699
*
V7A72
0.03043
*
V7A73
0.03958
*
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Variable
Pr(>z)
V14A141
0.04293
*
V14A142
0.04505*
V17A171
0.03535*
V20A201
0.00667
**
Since
we
now
have
a
reduced
set
of
parameters,
we
can
use
these
to
train
a
better
model.
We
will
use
a
70/30%
split
for
training/testing
to
avoid
overfitting.
This
model
returns
a
slightly
higher
accuracy
at
76%.
However,
this
incorrectly
classifies
25
"bad"
applications
as
"good".
This
comes
at
a
high
cost.
To
look
to
minimize
this,
we
will
run
a
loop
through
various
thresholds,
applying
the
cost
matrix
to
weigh
each
incorrect
response
correctly.
Doing
this
we
find
a
threshold
of
.18
to
return
the
lowest
cost.
This
causes
a
decrease
in
accuracy
to
68%,
but
gives
the
best
results
in
terms
of
misclassifying
a
"bad"
application
as
"good",
by
only
classifying
4
"bad"
applications
as
"good".
Analyzing
the
final
model
returns
the
following
formula:
1
ifL1_>.1
P(Y=1|X)={O
!
1+e—z>
8
where
z
=
—b.4276285631
4
V5
-
0.0002035011
4
V8
-
0.2101376376
+
V1A1l
-
2.0302865296
+
V1A12
-
1.2778372848
+
V3A30
-
1.5598222459
+
V3A31
-
1.2389246022
-
V4A41
-
—1.6316162897
-
V6A61
-
0.5649107523
+
VTAT72
-
1.0242411208
+
V7A73-0.4603303719
+
V14A141
-
0.9845414604
+
V14A142
-
0.6207414093
+
V17A171
-
—0.0503063358
+
V20A201
-
2.2325916676
Appendix
e
Code
*
Graphs
-
Crime
o
Figure
1
-
Boxplots
for
all
features
o
Figure
2
-
Boxplot
for
M.F
to
examine
outliers
o
Figure
3
-
Boxplot
for
Pop
to
examine
outliers
o
Figure
4
-
Boxplot
for
NW
to
examine
outliers
o
Figure
5
-
Density
plots
for
all
features
o
Figure
6
-
Density
plot
for
So
to
examine
multiple
peaks
o
Figure
7
-
Scatterplot
to
see
interaction
of
features
with
response
o
Figure
8
-
Scatterplot
for
So
o
Figure
9
-
Regression
Tree
o
Figure
10
-
MAE
vs
Number
of
Branches
o
Figure
11
-
MAE
vs
Number
of
Trees
o
Figure
12
-
Error
vs
Number
of
Trees
Credit
Data
o
Figure
13
-
Exploratory
Analysis
of
Features
o
Figure
14
-
Histogram
of
Class
Distribution
o
Figure
15
-
Confusion
Matrix
Base
Model
o
Figure
16
-
Confusion
Matrix
Improved
Model
o
Figure
17
-
ROC
Curve
o
Figure
18
-
Confusion
Matrix
with
ROC
threshold
o
Figure
19
-
Confusion
Matrix
with
cost
threshold
o
Figure
20
-
Cost
vs
Threshold
o
Figure
21
-
Residuals
vs
Fitted
o
Figure
22
-
Q-Q
Residuals
o
Figure
23
-
Scale-Location
o
Figure
24
-
Residuals
vs
Leverage
v
Code
install.packages("tree")
library(tree)
install.packages("randomForest™)
library(randomForest)
Installing
package
into
¢/usr/local/lib/R/site-library’
(as
‘lib’
is
unspecified)
Installing
package
into
¢/usr/local/lib/R/site-library’
(as
1lib’
is
unspecified)
randomForest
4.7-1.1
Type
rfNews()
to
see
new
features/changes/bug
fixes.
¥
Question
10.1
data
<-
read.delim("http://www.statsci.org/data/general/uscrime.txt")
head(data)
A
data.frame:
6
x
16
M
So
Ed
Pol
Po2
LF
M.F
Pop
NW
<dbl>
<int>
<dbl>
<dbl>
<dbl>
<dbl>
<dbl>
<int>
<dbl>
1
151
1
9.1
5.8
56
0510
950
33
301
2
143
0
113
103
9.5
0583
101.2
13
10.2
3
142
1
8.9
45
44
0533
96.9
18
219
4
136
0
121
149
141
0577
994
157
8.0
5
141
0
121
109
101
0591
985
18
3.0
6
121
0
110
18
115
0547
964
25
44
#
Visual
check
for
outliers
par(mfrow
=
c(2,
2))
for
(name
in
names(data))
{
boxplot(data[
[name]],main=name)
}
par(mfrow
=
c(1,
1))
boxplot(data[["M.F"]],main="M.F.")
boxplot(data[["Pop"]],main="Pop")
boxplot(data[["NW"]],main="NW")
par(mfrow
=
c(2,
2))
for
(name
in
names(data))
{
density_data
<-
density(data[[name]])
uli
<dbl>
0.108
0.096
0.094
0.102
0.091
0.084
U2
Wealth
<dbl>
<int>
4.1
3940
3.6
5570
3.3
3180
39
6730
20
5780
29
6890
plot(density_data,
main=paste(name,"Density
Plot"),
xlab=name,
ylab="Density")
}
par(mfrow
=
c(1,
1))
density_data
<-
density(data[["So"]])
plot(density_data,
main="So
Density
Plot",
xlab="So",
ylab="Density")
par(mfrow
=
c(2,2))
for
(name
in
names(data))
{
plot(data[[name]],data$Crime,
xlab=name,
ylab="Crime")
}
par(mfrow
=
c(1,
1))
plot(data[["So"]],data$Crime,
xlab="So",
ylab="Crime")
install.packages("caret")
library(caret)
library(ggplot2)
Ineq
<dbl>
26.1
19.4
25.0
16.7
17.4
12.6
Prob
<dbl>
0.084602
0.029599
0.083401
0.015801
0.041399
0.034201
Time
<dbl>
26.2011
25.2999
24.3006
29.9012
21.2998
20.9995
Crime
<int>
791
1635
578
1969
1234
682
Installing
package
into
¢/usr/local/lib/R/site-library’
(as
‘lib’
is
unspecified)
also
installing
the
dependencies
f‘listenv’,
‘parallelly’,
‘future’,
‘globals’,
¢‘shape’,
‘future.apply’,
Loading
required
package:
ggplot2
Loading
required
package:
lattice
#create
train/test
split
set.seed(1234)
#for
reproducability
train_indices
<-
createDataPartition(data$Crime,
times=1,
p=.7,
list=FALSE)
training
set
<-
data[train_indices,]
testing
set
<-
data[-train_indices,]
tree_model
<-
tree(Crime
summary
(tree_model)
Regression
tree:
tree(formula
=
Crime
~
~y
.,
training
set)
.,
data
=
training_set)
Variables
actually
used
in
tree
construction:
"M.F"
Number
of
terminal
nodes:
6
68510
=
1987000
/
29
Distribution
of
residuals:
[1]
llPolll
"P0p"
IlLFII
Residual
mean
devian
Min.
1st
Qu.
Med
-607.00
-112.40
12
#quality
of
fit
ce.
ian
.57
Mean
3rd
Qu.
Max.
0.00
117.90
589.80
preds
<-
predict(tree_model,
testing_set[,1:15])
MAE
<-
mean(abs(preds
-
testing
set$Crime))
MAE
171.935714285714
plot(tree_model)
text(tree_model)
print(tree_model$frame)
var
n
1
Pol
35
5710104.
2
Pol
25
1466071.
4
Pop
13
501357.
8
<leaf>
8
84351.
9
<leaf>
5
169138.
5
LF
12
432502.
10
<leaf>
7
133283.
11
<leaf>
5
85287.
3
M.F
10
2024944.
6
<leaf>
5
658284.
7
<leaf>
5
856352,
#
Create
a
function
for
cross-validation
dev
97
04
23
88
80
92
71
20
90
80
00
911
1017
yval
splits.cutleft
splits.cutright
.9714
752.
612.
503.
787.
904.
.4286
746.
1310.
1084.
1536.
7200
5385
3750
2000
5833
6000
1000
2000
0000
<10.75
<7.05
<22.5
<9.58
<96.75
cross_val_prune
<-
function(data,
folds,
branches)
{
fold_size
<-
nrow(data)
%/%
folds
avg_accuracy
<-
list()
for
(k
in
branches)
{
accuracy
list
<-
list
for
(i
in
1:folds)
{
0
start
<-
(i
-
1)
*
fold
_size
+
1
end
<-
ifelse(i
==
folds,
nrow(data),
i
*
fold
size)
val
data
<-
data[start:end,
]
train_data
<-
data[-c(start:end),
]
#
Train
model
prune.tree_model
<-
prune.tree(tree
model,
best
=
k)
>10.75
>7.05
>22.5
>0.58
>96.75
‘numDeriv’,
f‘progressr’,
¢SQUAR
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
#
Make
predictions
preds
<-
predict(prune.tree_model,
val_data[,1:15])
MAE
<-
mean(abs(preds
-
val_data$Crime))
#
Add
to
list
accuracy_list[[as.character(i)]]
<-
MAE
}
avg_accuracy[[as.character(k)]]
<-
mean(unlist(accuracy_list))
}
return(avg_accuracy)
}
result
<-
cross_val_prune(training_set,
folds
=
1@,
branches
=
2:6)
#
Print
the
results
accuracy_df
=
data.frame(Branches
=
names(result),
MAE
=
unlist(result))
accuracy_df
<-
accuracy_df[order(accuracy_df$MAE),
]
#
Create
a
line
plot
ggplot(accuracy_df,
aes(x
=
Branches,
y
=
MAE))
+
geom_point()
+
labs(x
=
"Number
of
Branches",
y
=
"Mean
Absolute
Error")
+
ggtitle("MAE
vs.
Number
of
Trees")
#
Create
a
function
for
cross-validation
cross_val
<-
function(data,
folds,
numtrees)
{
fold_size
<-
nrow(data)
%/%
folds
avg_accuracy
<-
list()
for
(k
in
numtrees)
{
accuracy_list
<-
list()
for
(i
in
1:folds)
{
start
<-
(i
-
1)
*
fold_size
+
1
end
<-
ifelse(i
==
folds,
nrow(data),
i
*
fold_size)
val_data
<-
data[start:end,
]
train_data
<-
data[-c(start:end),
]
#
Train
model
rf_classifier
<-
randomForest(Crime
~
.,
data
=
train_data,
ntree
=
k)
#
Make
predictions
preds
<-
predict(rf_classifier,
val_data[,1:15])
MAE
<-
mean(abs(preds
-
val_data$Crime}))
#
Add
to
list
accuracy_list[[as.character(i)]]
<-
MAE
}
avg_accuracy[[as.character(k)]]
<-
mean(unlist(accuracy_list))
}
return(avg_accuracy)
}
result
<-
cross_val(training_set,
folds
=
10,
numtrees
=
10:500)
#
Print
the
results
accuracy_df
=
data.frame(Trees
=
names(result),
MAE
=
unlist(result))
accuracy_df
<-
head(accuracy_df[order(accuracy_df$MAE),
])
head(accuracy_df)
Adata.frame:
6
x
2
Trees
MAE
<chr>
<dbl>
#
Create
a
line
plot
ggplot(accuracy_df,
aes(x
=
Trees,
y
=
MAE))
+
geom_point()
+
labs(x
=
"Number
of
Trees",
y
=
"Mean
Absolute
Error")
+
ggtitle("MAE
vs.
Number
of
Trees")
L*
-4
oI
LU
VO
1
rf_classifier
<-
randomForest(Crime
~
.,
data
=
training_set,
ntree
=
12)
#quality
of
fit
preds
<-
predict(rf_classifier,
testing_set[,1:15])
MAE
<-
mean(abs(preds
-
testing_set$Crime))
MAE
201.094560185185
rf_classifier$importance
A
matrix:
15
x
1
of
type
dbl
IncNodePurity
M
202046.89
So
0.00
Ed
346167.78
Po1
1277314.89
Po2
531698.20
LF
216288.30
M.F
83758.37
Pop
359573.44
NW
168214.81
u1
263844.81
U2
238374.61
Wealth
367466.09
Ineq
444998.31
Prob
556052.98
Time
182935.93
plot(rf_classifier)
¥
Question
10.3
credit.data
<-
read.delim("germancredit.txt",sep="
",header=F)
head(credit.data)
A
data.frame:
6
x
21
Vi
V2
V3
V4
V5
V6
V7
V8
V9
vie
-
V12
V13
Vi4
V15
V16
V17
V18
V19
V20
#
Identify
categorical
columns
(excluding
the
target
variable
'class’')
categorical_cols
<-
names(credit.data)[sapply(credit.data,
is.character)
&
names(credit.data)
!=
"v21"]
categorical_cols
'V1'-'V3'-'V4'-'V6'
-
'VT7'
-
'V9'
-
'V10'
-
'V12'
-
'V14'
-
V15
-
VAT
-
VA9
-
V20
#
Create
dummy
variables
for
categorical
columns
dummy_data
<-
dummyVars(V21l
~
.,
data
=
credit.data[,
c("v21",
categorical
_cols)])
dummy_data
<-
predict(dummy_data,
newdata
=
credit.data)
#
Combine
the
dummy
variables
with
the
original
data
credit.data_encoded
<-
cbind(credit.data[,
-which(names(credit.data)
%in%
categorical
cols)],
dummy_data)
head(credit.data_encoded)
A
data.frame:
6
x
62
V2
V5
V8
Vil
V13
V16
V18
V21
V1Al11l
V1Al2
.-
V15A152
V15A153
V17A171
V17A172
V17A173
V17A174
V19A191]
<int>
<int>
<int>
<int>
<int>
<int>
<int>
<int>
<dbl>
<dbl>
-
<dbl>
<dbl>
<dbl>
<dbl>
<dbl>
<dbl>
<dbl:
1
6
1169
4
4
67
2
1
1
1
o
.-
1
0
0
0
1
0
(
2
48
5951
2
2
22
1
1
2
0
1
.
1
0
0
0
1
0
1
3
12
2096
2
3
49
1
2
1
0
o
-
1
0
0
1
0
0
1
4
42
7882
2
4
45
1
2
1
1
o
-
0
1
0
0
1
0
1
5
24
4870
3
4
53
2
2
2
1
o
.-
0
1
0
0
1
0
1
6
36
9055
2
4
35
1
2
1
0
o
.-
0
1
0
1
0
0
(
summary
(credit.data_encoded)
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
V2
V5
v8
V1l
V13
Min.
:
4.0
Min.
:
250
Min.
:1.000
Min.
:1.000
Min.
:19.00
1st
Qu.:12.0
1st
Qu.:
1366
1st
Qu.:2.000
1st
Qu.:2.000
1st
Qu.:27.00
Median
:18.0
Median
:
2320
Median
:3.000
Median
:3.000
Median
:33.00
Mean
:20.9
Mean
:
3271
Mean
:2.973
Mean
:2.845
Mean
:35.55
3rd
Qu.:24.0
3rd
Qu.:
3972
3rd
Qu.:4.000
3rd
Qu.:4.000
3rd
Qu.:42.00
Max.
72,6
Max.
:18424
Max.
14,000
Max.
:4,000
Max.
:75.00
V1ie
V18
V21
V1Al1ll
V1A12
Min.
:1.000
Min.
1.000
Min.
:1.0
Min.
:0.000
Min.
0.000
1st
Qu.:1.000
1st
Qu.:1.000
1st
Qu.:1.0
1st
Qu.:0.000
1st
Qu.:0.000
Median
:1.000
Median
:1.000
Median
:1.0
Median
:0.000
Median
:0.000
Mean
:1.407
Mean
1.155
Mean
:1.3
Mean
:0.274
Mean
0.269
3rd
Qu.:2.000
3rd
Qu.:1.000
3rd
Qu.:2.0
3rd
Qu.:1.000
3rd
Qu.:1.000
Max.
:4.000
Max
:2.000
Max.
:2.0
Max.
:1.000
Max.
:1.000
V1A13
V1A14
V3A30
V3A31
V3A32
Min.
:0.000
Min.
:0.000
Min.
10.00
Min.
:10.000
Min.
:10.00
1st
Qu.:0.00@0
1st
Qu.:9.000
1st
Qu.:0.00
1st
Qu.:8.000
1st
Qu.:0.00
Median
:0.000
Median
:9.000
Median
:0.00
Median
:0.000
Median
:1.00
Mean
:0.063
Mean
9.394
Mean
10.04
Mean
0.049
Mean
:0.53
3rd
Qu.:0.000
3rd
Qu.:1.000
3rd
Qu.:0.00
3rd
Qu.:0.000
3rd
Qu.:1.00
Max.
:1.000
Max
1.000
Max.
:1.00
Max.
:1.000
Max.
:1.00
V3A33
V3A34
V4A40
V4A41
Min.
:0.000
Min
0.000
Min.
:0.000
Min.
0.000
1st
Qu.:0.000
1st
Qu.:0.000
1st
Qu.:9.000
1st
Qu.:0.000
Median
:0.000
Median
:9.000
Median
:0.000
Median
:0.000
Mean
:0.088
Mean
9.293
Mean
10.234
Mean
0.103
3rd
Qu.:0.000
3rd
Qu.:1.000
3rd
Qu.:0.000
3rd
Qu.:0.000
Max.
:1.000
Max.
:1.000
Max.
:1.800
Max.
:1.000
V4A410
V4A42
V4A43
V4A44
V4AA45
Min.
:0.000
Min.
:0.000
Min.
:0.00
Min.
:0.000
Min.
:0.000
1st
Qu.:9.000
1st
Qu.:0.000
1st
Qu.:0.00
1st
Qu.:9.000
1st
Qu.:0.000
Median
:0.000
Median
:0.000
Median
:0.00
Median
:0.000
Median
:0.000
Mean
:10.012
Mean
9.181
Mean
:9.28
Mean
:0.012
Mean
:19.022
3rd
Qu.:0.000
3rd
Qu.:9.000
3rd
Qu.:1.00
3rd
Qu.:0.000
3rd
Qu.:0.000
Max.
:1.000
Max.
:1.000
Max.
:1.00
Max.
:1.000
Max.
:1.000
VAA46
V4A48
V4A49
V6A61
V6A62
Min
:0.00
Min
0.000
Min.
0.000
Min.,
:0.000
Min,
:9.000
1st
Qu.:0.00
1st
Qu.:0.000
1st
Qu.:0.000
1st
Qu.:0.000
1st
Qu.:0.000
Median
:0.00
Median
:0.000
Median
:0.000
Median
:1.000
Median
:0.000
Mean
:0.05
Mean
0.009
Mean
0.097
Mean
:0.603
Mean
:0.1063
3rd
Qu.:0.00
3rd
Qu.:9.000
3rd
Qu.:0.000
3rd
Qu.:1.000
3rd
Qu.:0.000
Max.
:1.006
Max.
:1.000
Max.
:1.000
Max.
:1.000
Max.
:1.000
V6A63
V6A64
V6A65
V7A71
Min.
:0.000
Min.
:9.000
Min.
:0.000
Min.
:0.000
1st
Qu.:0.000
1st
Qu.:0.000
1st
Qu.:0.000
1st
Qu.:0.000
Median
:0.000
Median
:0.000
Median
:0.000
Median
:0.000
Mean
:0.063
Mean
:9.048
Mean
9.183
Mean
:0.062
3rd
Qu.:0.000
3rd
Qu.:90.000
3rd
Qu.:0.000
3rd
Qu.:0.000
Max.
:1.000
Max.
:1.000
Max.
:1.000
Max.
:1.000
V7A72
V7A73
V7A74
V7A75
VOA91
Min.
:0.000
Min.
:0.000
Min.
0.000
Min.
:0.000
Min.
:0.00
1st
Qu.:0.000
1st
Qu.:0.000
1st
Qu.:90.000
1st
Qu.:0.000
1st
Qu.:9.00
Median
:0.000
Median
:0.000
Median
:0.000
Median
:9.000
Median
:0.00
Mean
:0.172
Mean
:9.339
Mean
0.174
Mean
:0.253
Mean
:19.05
3rd
Qu.:0.000
3rd
Qu.:1.000
3rd
Qu.:90.000
3rd
Qu.:1.000
3rd
Qu.:0.00
Max.
:1.000
Max.
:1.000
Max.
:1.000
Max.
:1.000
Max.
:1.00
VIA92
VIA93
V9A94
V10A101
V10A102
credit.data_encoded$v21[credit.data_encoded$v21==1]
<-
©
credit.data_encoded$V21[credit.data_encoded$v21==2]
<-
1
Mean
:@0.31
Mean
:0.548
Mean
:0.092
Mean
:08.907
Mean
10.041
ggplot(credit.data_encoded,
aes(x
=
V21))
+
geom_histogram(binwidth
=
9.5,
fill
=
"blue",
color
=
"black")
+
labs(title
=
"Histogram",
x
=
"Values™,
y
=
"Frequency")
15L
YU.
:0.9099
15L
YU.
:9.990
45T
YU.
:9.909
15L
Yu.
:9.009
install.packages("rattle")
library(rattle)
library(dplyr)
library(pROC)
install.packages("tidymodels")
library(tidymodels)
M~
-
ArCA
Mo
=
19N
M
~in
A
NAT?
LY
PRy
-
014
smaller_class_size
<-
min(table(credit.data_encoded$v21))
balanced_data
<-
credit.data_encoded
%>%
group_by(V21)
%>%
sample_n(size
=
smaller_class_size)
%>%
ungroup()
Mean
:0.179
Mean
:0.713
Mean
:0.108
Mean
:0.822
Mean
9.2
credit.model
<-
glm(V2l
~
.,
data
=
balanced_data,
family
=binomial(link="logit"))
V17A173
V17A174
V19A191
V19A192
V20A201
summary
(credit.model)
Call:
glm(formula
=
V21
~
.,
family
=
binomial(link
=
"logit"),
data
=
balanced_data)
Coefficients:
(13
not
defined
because
of
singularities)
Estimate
Std.
Error
z
value
Pr(>|z|)
(Intercept)
V2
V5
V8
V1l
V13
V16
V18
V1A11
V1A12
V1A13
V1A14
V3A30
V3A31
V3A32
V3A33
V3A34
V4A40
V4p41l
V4A410
V4A42
V4A43
V4A44
V4A45
V4A46
V4A48
V4A49
V6A61
V6A62
V6A63
V6A64
V6A65
V7A71
V7A72
V7A73
V7A74
V7A75
VIA91
VIA92
V9A93
V9A94
V10A101
V10A102
V1eA103
V12A121
V12A122
V12A123
V12A124
V14A141
V14A142
V14A143
V15A151
V15A152
V15A153
V17A171
V17A172
V17A173
V17A174
V19A191
V19A192
V20A201
V20A202
Signif.
codes:
©
(Dispersion
parameter
for
binomial
-6.948e+00
2.062e-02
1.344e-04
2.821e-01
2.949e-02
-6.047e-03
3.799e-01
6.173e-01
2.041e+00
1.493e+00
4.711e-01
NA
1.529e+00
1.474e+00
5.633e-01
6.767e-01
NA
6.586e-01
-1.531e+00
-1.626e-01
2.024e-02
-6.732e-02
-5.528e-01
1.242e-01
5.794e-01
-1.301e+00
NA
6.952e-01
1.586e-01
7.248e-01
-6.518e-01
NA
5.201e-01
8.090e-01
6.628e-01
-3.576e-02
NA
1.794e-01
-2.309%e-01
-7.687e-01
NA
7.272e-01
1.094e+00
NA
-3.872e-01
-2.026e-01
-2.845e-01
NA
6.067e-01
1.074e+00
NA
9.321e-02
-2.641e-01
NA
-1.622e+00
-7.689%e-01
-6.018e-01
NA
2.900e-01
NA
1.987e+00
NA
€xkk
1.562e+00
1.161e-02
5.702e-05
1.110e-01
1.085e-01
1.182e-02
2.415e-01
3.365e-01
2.961e-01
2.856e-01
4.242e-01
NA
6.279%e-01
5.946e-01
3.049%9¢e-01
4.403e-01
NA
4.27%e-01
5.745e-01
1.374e+00
4.553e-01
4.270e-01
8.457e-01
6.996e-01
5.825e-01
1.358e+00
NA
3.143e-01
4.201e-01
5.332e-01
6.465e-01
NA
5.102e-01
3.737e-01
3.220e-01
3.834e-01
NA
5.933e-01
3.925e-01
3.969e-01
NA
5.259e-01
7.075e-01
NA
5.222e-01
5.0858e-01
4.97%e-01
NA
2.997e-01
5.357e-01
NA
6.179e-01
5.941e-01
NA
7.707e-01
4.63%e-01
3.838e-01
NA
2.586e-01
NA
7.324e-01
NA
€%k
0.001
Null
deviance:
831.78
on
599
Residual
deviance:
AIC:
682.39
584.39
on
551
-4.447
8.70e-06
***
1.775
@.07589
.
2.357
0.01840
*
2.543
0.01099
0.272
0.78573
-8.512
0.60881
1.573
0.11566
1.834
0.06663
.
6.893
5.46e-12
5.230
1.70e-07
1.111
0.26673
NA
NA
2.435
0.01488
*
2.479
0.01316
*
1.848
0.06467
.
1.537
0.12434
NA
NA
1.539
0.12377
-2.665
©0.00770
**
-0.118
0.90575
0.044
0.96454
-8.158
0.87473
-0.654
0.51330
0.178
0.85910
8.995
0.31986
-0.957
0.33832
NA
NA
2.212
9.02699
*
8.378
0.70573
1.359
0.17406
-1.008
0.31334
NA
NA
1.019
0.30805
2.164
0.03043
*
2.958
0.03958
-0.093
0.92567
NA
NA
0.302
0.76239
-0.588
0.55632
-1.937
0.05275
.
NA
NA
0.16678
0.12210
NA
NA
0.45838
0.68869
0.56767
NA
NA
0.04293
*
0.04505
NA
NA
0.88009
0.65668
NA
NA
©0.03535
*
0.09738
0.11685
NA
NA
0.26200
NA
NA
0.00667
**
NA
NA
*
*
*
0.01
‘*’
@.05
.’
0.1
°°
1
family
taken
to
be
1)
degrees
of
freedom
degrees
of
freedom
Number
of
Fisher
Scoring
iterations:
5
preds
<-
predict(credit.model,newdata=credit.data_encoded,type="response")
threshold
<-
0.5
predicted_classes
<-
ifelse(preds
>=
threshold,
1,
9)
true_values
<-
as.factor(credit.data_encoded$v21)
confusion_matrix
<-
confusionMatrix(as.factor(predicted_classes),
true_values)
cm_data
<-
as.data.frame(as.table(confusion_matrix))
confusion_matrix
ggplot(data
=
cm_data,
aes(x
=
Reference,
y
=
Prediction,
fill
=
Freq))
+
geom_tile()
+
geom_text(aes(label
=
sprintf("%d",
Freq)),
vjust
=
1)
+
scale_fill_gradient(low
=
"lightblue",
high
=
"darkred™)
+
theme_minimal()
+
labs(title
=
"Confusion
Matrix”,
x
=
"Actual”,
y
=
"Predicted”,
fill
=
"Frequency”)
+
theme(legend.position
=
"right")
Confusion
Matrix
and
Statistics
Reference
Prediction
(%]
1
8
505
65
1
195
235
Accuracy
:
0.74
95%
CI
:
(0.7116,
0.7669)
No
Information
Rate
:
0.7
P-Value
[Acc
>
NIR]
:
©.002908
Kappa
:
0.4492
Mcnemar's
Test
P-Value
:
1.242e-15
Sensitivity
:
0.7214
Specificity
:
©.7833
Pos
Pred
Value
:
©.8860
Neg
Pred
Value
:
0.5465
Prevalence
:
0.7000
Detection
Rate
:
©.5850
Detection
Prevalence
:
©.5700
Balanced
Accuracy
:
0.7524
'Positive’
Class
:
@
Confusion
Matrix
5
195
235
Frequency
500
3
|
j
400
S
g
300
&
200
100
0
505
65
Actual
#create
train/test
split
set.seed(1234)
#for
reproducability
train_indices
<-
createDataPartition(credit.data_encoded$v2l,
times=1,
p=.7,
1list=FALSE)
training_set
<-
balanced_data[train_indices,
]
testing_set
<-
balanced_data[-train_indices,
]
credit.model
<-
glm(V21l
~
V5
+
V8
+
V1A11l
+
V1A12
+
V3A30
+
V3A31
+
VAAALl
+
V6A61
+
V7A72
+
V7A73
+
V14A141
+
V14A142
+
V17A171
+
V20A201,
d:
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
summary
(credit.model)
Call:
glm(formula
=
V21
~
V5
+
V8
+
V1A11l
+
V1A12
+
V3A30
+
V3A31
+
V4A41
+
V6A61
+
V7A72
+
V7A73
+
V14A141
+
V14A142
+
V17A171
+
V20A201,
family
=
binomial(link
=
"logit"),
data
=
training_set)
Coefficients:
Estimate
Std.
Error
z
value
Pr(>|z|)
(Intercept)
-5.428e+00
1.025e+0@
-5.294
1.20e-07
***
V5
2.035e-04
5.125e-85
3.971
7.16e-@5
***
V8
2.101e-01
1.122e-01
1.872
0.06114
.
V1Al1ll
2.030e+00
2.996e-01
6.776
1.24e-11
***
V1A12
1.278e+00
2.903e-01
4.402
1.07e-05
***
V3A30
1.560e+00
7.658e-01
2.037
0.04168
*
V3A31
1.239e+00
5.862e-01
2.114
@.03455
*
V4A41
-1.632e+00
4.773e-01
-3.418
0.00063
***
V6A61
5.649e-01
2.553e-01
2.213
©.02690
*
V7A72
1.024e+00
3.253e-01
3.149
0.00164
**
V7A73
4.603e-01
2.721e-01
1.691
0.09074
.
V14A141
9.845e-01
3.523e-01
2.794
0.00520
**
V14A142
6.207e-01
6.421e-01
0.967
0.33369
V17A171
-5.031e-02
7.116e-01
-0.071
0.94364
V20A201
2.233e+00
8.677e-01
2.573
0.01008
*
Signif.
codes:
@
‘***’
@.,901
“**’
@.01
‘*’
9.05
‘.’
0.1
¢’
1
(Dispersion
parameter
for
binomial
family
taken
to
be
1)
Null
deviance:
576.69
on
415
degrees
of
freedom
Residual
deviance:
438.45
on
401
degrees
of
freedom
(284
observations
deleted
due
to
missingness)
AIC:
468.45
Number
of
Fisher
Scoring
iterations:
5
preds
<-
predict(credit.model,newdata=testing_set,type="response”)
threshold
<-
0.5
predicted_classes
<-
ifelse(preds
>=
threshold,
1,
@)
true_values
<-
as.factor(testing_set$v21)
confusion_matrix
<-
confusionMatrix(as.factor(predicted_classes),
true_values)
c¢m_data
<-
as.data.frame(as.table(confusion_matrix))
confusion_matrix
ggplot(data
=
cm_data,
aes(x
=
Reference,
y
=
Prediction,
fill
=
Freq))
+
geom_tile()
+
geom_text(aes(label
=
sprintf("%d",
Freq)),
vjust
=
1)
+
scale_fill_gradient(low
=
"lightblue",
high
=
"darkred™)
+
theme_minimal()
+
labs(title
=
"Confusion
Matrix",
x
=
"Actual”,
y
=
"Predicted"”,
fill
=
"Frequency")
+
theme(legend.position
=
"right")
roc_obj
<-
roc(response
=
true_values,
predictor
=
preds)
#
Plot
the
ROC
curve
with
AUC
plot.roc(roc_obj,
print.auc
=
TRUE,
auc.polygon
=
TRUE,
grid
=
TRUE,
legacy.axes
=
TRUE)
preds
<-
predict(credit.model,newdata=testing_set,type="response"”)
threshold
<-
0.806
predicted_classes
<-
ifelse(preds
>=
threshold,
1,
@)
true_values
<-
as.factor(testing_set$v21)
confusion_matrix
<-
confusionMatrix(as.factor(predicted_classes),
true_values)
cm_data
<-
as.data.frame(as.table(confusion_matrix))
confusion_matrix
ggplot(data
=
cm_data,
aes(x
=
Reference,
y
=
Prediction,
fill
=
Freq))
+
geom_tile()
+
geom_text(aes(label
=
sprintf("%d",
Freq)),
vjust
=
1)
+
scale_fill_gradient(low
=
"lightblue",
high
=
"darkred™)
+
theme_minimal()
+
labs(title
=
"Confusion
Matrix",
x
=
"Actual",
y
=
"Predicted”,
fill
=
"Frequency")
+
theme(legend.position
=
"right")
Confusion
Matrix
and
Statistics
Reference
Prediction
©
1
0
84
68
1
7
25
Accuracy
:
0.5924
95%
CI
:
(@©.5177,
9.6641)
No
Information
Rate
:
©.5054
P-vValue
[Acc
>
NIR]
:
©.01098
Kappa
:
©.1905
Mcnemar's
Test
P-Value
:
4.262e-12
Sensitivity
:
0.9231
Specificity
:
0.2688
Pos
Pred
Value
:
0.5526
Neg
Pred
Value
:
0.7813
Prevalence
:
0.4946
Detection
Rate
:
0.4565
Detection
Prevalence
:
0.8261
Balanced
Accuracy
:
0.5959
'Positive’
Class
:
©
Confusion
Matrix
fifl
B
7
25
Frequency
80
g
%
=
20
G
84
68
Actual
#
Define
the
cost
matrix
cost_matrix
=
matrix(c(®,
5,
1,
@),
nrow
=
2)
#
Compute
the
total
cost
for
different
thresholds
thresholds
<-
seq(@,
1,
by
=
0.01)
total_costs
<-
numeric(length(thresholds))
for
(i
in
1:length(thresholds))
{
#
Apply
the
threshold
to
predicted
probabilities
thresholded_predictions
<-
ifelse(preds
>=
thresholds[i],
1,0)
#
Create
a
confusion
matrix
confusion_matrix
<-
confusionMatrix(as.factor(thresholded_predictions),
true_values)
fn
<-
confusion_matrix$table[1,
2]
fp
<-
confusion_matrix$table[2,
1]
#
Calculate
the
total
cost
based
on
the
cost
matrix
total_costs[i]
<-
fn*5+fp*1
#
Find
the
threshold
with
the
lowest
total
cost
best_threshold
<-
thresholds[which.min(total_costs)]
cat("Best
Threshold:",
best_threshold,
"\n")
#
Use
the
best
threshold
to
classify
your
predictions
classified_predictions
<-
ifelse(preds
>=
best_threshold,
1,
0)
#
Evaluate
your
model
using
the
best
threshold
and
cost-sensitive
metrics
confusion_matrix
<-
confusionMatrix(as.factor(classified_predictions),
true_values)
c¢m_data
<-
as.data.frame(as.table(confusion_matrix))
ggplot(data
=
cm_data,
aes(x
=
Reference,
y
=
Prediction,
fill
=
Freq))
+
geom_tile()
+
geom_text(aes(label
=
sprintf("%d",
Freq)),
vjust
=
1)
+
scale_fill_gradient(low
=
"lightblue",
high
=
"darkred™)
+
theme_minimal()
+
labs(title
=
"Confusion
Matrix",
x
=
"Actual",
y
=
"Predicted
n
2
fill
=
"Frequency™)
+
theme(legend.position
=
"right")
confusion_matrix
Warning
message
“Levels
are
not
Warning
message
“Levels
are
not
Warning
message
“Levels
are
not
Warning
message
“Levels
are
not
Warning
message
“Levels
are
not
Best
Threshold:
Reference
Prediction
©
1
e
37
4
1
54
89
in
in
in
in
in
in
in
in
in
in
confusionMatrix.default(as.factor(thresholded_predictions),
true_values):
the
same
order
for
reference
and
data.
Refactoring
data
to
match.”
confusionMatrix.default(as.factor(thresholded_predictions),
true_values):
the
same
order
for
reference
and
data.
Refactoring
data
to
match.”
confusionMatrix.default(as.factor(thresholded_predictions),
true_values):
the
same
order
for
reference
and
data.
Refactoring
data
to
match.”
confusionMatrix.default(as.factor(thresholded
_predictions),
true_values):
the
same
order
for
reference
and
data.
Refactoring
data
to
match.”
confusionMatrix.default(as.factor(thresholded_predictions),
true_values):
the
same
order
for
reference
and
data.
Refactoring
data
to
match.”
9.18
Confusion
Matrix
and
Statistics
Accuracy
:
0.6848
95%
CI
:
(0.6123,
@.7512)
No
Information
Rate
:
©.5054
P-Value
[Acc
>
NIR]
:
6.222¢-07
Kappa
:
©.3657
Mcnemar's
Test
P-Value
:
1.243e-10
Sensitivity
:
0.4066
Specificity
:
0.9570
Pos
Pred
Value
:
0.9024
Neg
Pred
Value
:
0.6224
Prevalence
:
0.4946
Detection
Rate
:
0.2011
Detection
Prevalence
:
©.2228
Balanced
Accuracy
:
9.6818
'Positive’
Class
:
©
Confusion
Matrix
Predicted
Frequency
Actual
total_costs
<-
data.frame(Cost
=
total_costs,
Threshold=thresholds)
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
plot
<-
ggplot(total_costs,
aes(Threshold,
Cost))
+
geom_line()
+
geom_point(data
=
total_costs[total_costs$Cost
==
min(total_costs$Cost),
],
aes(Threshold,
Cost),
color
=
"red",
size
=
3)
+
labs(
title
=
"Total
Costs
vs
Threshold”,
X
=
"Threshold”,
y
=
"Cost"
)
plot
plot(credit.model)
data.frame(credit.model$coefficients)
A
data.frame:
15
x
1
credit.model.coefficients
<dbl>
(Intercept)
-5.4276285631
V5
0.0002035011
Vs
0.2101376376
V1A
2.0302865296
V1A12
1.2778372848
V3A30
15598222459
V3A31
1.2389246022
V4A41
-1.6316162897
V6A61
0.5649107523
V7AT2
1.0242411208
V7AT3
0.4603303719
V14A141
0.9845414604
V14A142
0.6207414093
VA7A171
-0.0503063358
V20A201
2.2325916676
u1
Nw
L
-
T
T
T
T
T
T
T
€10
TT0
600
L00
o
*
.......
+
T
T
T
T
T
or
oe
oz
ot
0
Wealth
u2
0002
0009
000S
000F
000€
So
T
T
T
T
T
T
0T
80
90
¥0
20
00
T
T
T
T
T
LT
9T
ST
¥#T
€T
2T
Pol
Ed
9T
¥T
2T
OT
8
9
T
ozt
T
0TT
T
oot
T
06
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Prob
Ineq
T
0T'0o
90°0
T
200
k14
oz
ST
Crime
Time
T
T
T
T
0002
00ST
000T
005
T
T
T
T
T
1T
S
OF
S€
0€
SZ
02
ST
M.F.
90T
Pop
oo
0ST
00T
0s
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
NwW
or
0g
oz
oT
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Density
Density
0.10
0.20
0.30
0.00
0.10
020
0.30
0.00
M
Density
Plot
So
Density
Plot
12
z
3
]
2
5
a
=
S
<
T
T
T
T
°
T
T
T
T
T
12
14
16
18
05
00
05
10
15
M
So
Ed
Density
Plot
Po1
Density
Plot
o
q
S
2
8
2
o
5
a
g
s
8
T
T
T
T
s
T
T
T
T
9
10
11
12
13
5
10
15
20
Ed
Pol
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Density
Density
0.05
0.10
0.15
0.00
0.05
010
015
0.00
Po2
Density
Plot
LF
Density
Plot
©
2
]
5
g
<
~
°
T
T
T
T
T
T
T
T
5
10
15
045
050
055
0.60
065
0.70
Po2
LF
M.F
Density
Plot
Pop
Density
Plot
2
]
2
5
o
95
100
105
110
0.000
0.005
0.010
0.015
200
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Density
Density
0.02
0.04
0.06
0.00
00
01
02
03
04
NW
Density
Plot
U1
Density
Plot
<
)
B
2
3
5
S
a
3
o
T
T
T
T
1
T
T
T
T
T
T
-10
10
20
30
40
50
0.06
0.08
0.0
0.12
0.14
0.16
NW
u1
U2
Density
Plot
Wealth
Density
Plot
;
z
3
z2
&
a
3
Ei
8
T
T
T
T
T
T
8
T
T
T
T
T
T
T
1
3
4
5
6
2000
4000
6000
8000
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Density
1.2
1.0
0.8
0.6
0.4
0.2
0.0
So
Density
Plot
0.0
0.5
So
1.0
1.5
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
P
o
°
0
o
2%
°
P
o
o
=)
@OQ
OO
o
o
8o
¢
00
80
3%
o
o
°
T
T
T
T
0002
00ST
000T
00S
suwiio
S
°
o
el
o
o
°
o
o
Foo
o
o
00
oowsa
o
°
%%
80
8o
8s
T
T
T
T
0002
00ST
000T
00S
Bawny
0.55
0.60
0.50
LF
Po2
000z
00ST
000T
sunp
B
°
°
o
OO
o
o
o
o
o
o0
o
0
AN
00
0020
°
o
%,
©
o
T
T
T
T
000z
00ST
00T
008
sunp
150
100
104
94
96
98
100
Pop
M.F
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
o
00
@@
o
0
©
®
®
®
ocamam
1.0
06
08
Pol
0.4
0.2
0.0
T
T
T
T
0002
00ST
000T
005
Bawny
°
%05
0
©
°
°
o
°
%00
g
o
o
m%o
°
15
16
17
105
115
Ed
14
13
9.0
95
12
T
T
T
T
0002
00ST
000T
005
Bawny
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Crime
Crime
500
1000
1500
2000
500
1000
1500
2000
2000
B
bl
®
8
°
°
gl
o
3
?
oo
g
0%
o
°
®
5
8
°
o
o
gs
o5
S
o
©o
%
o
o
00
oo
o
©
Po
o
°
08
o
*
o
°
oo
@0°%
g
H°
@
°°
°
ol
o
o
T
T
T
T
T
T
T
T
T
0
10
20
007
009
011
013
NW
U1
%
°
o
°
g
°
5
8
COQ)O
°
8
°©
®0°
o
Qo
o
®
°8
o
6
o
°
°
o
T
T
T
T
2
3
4
3000
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
o
0
0
ap
oo
o
@
©
oo
om
®
0000
@O
O
T
T
T
T
000z
00ST
000T
00S
1.0
0.8
0.6
0.4
0.2
0.0
So
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Pol
7.05
22.5
LF
<]
Pol
<,10.75
}
M.F
<[96.75
0.58
1017.0
1084.0
1536.0
746.6
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Mean
Absolute
Error
MAE
vs.
Number
of
Trees
260~
240-
220-
200-
1
Number
of
Branches
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Mean
Absolute
Error
MAE
vs.
Number
of
Trees
265~
260~
255+
250~
107
12
23
32
Number
of
Trees
79
89
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
rf_classifier
T
00000€
T
000052
Jou3
T
000002
T
0000ST
12
10
trees
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Histogram
600~
400~
Frequency
200~
0.0
Values
10
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Predicted
Confusion
Matrix
B
195
505
235
Frequency
500
400
i
300
200
100
65
Actual
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Predicted
Confusion
Matrix
19
72
Actual
Frequency
70
60
50
25
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Sensitivity
1.0
0.8
0.6
0.4
0.2
0.0
AUC:
0.806
0.0
0.2
T
T
0.4
0.6
1
-
Specificity
0.8
1.0
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Predicted
Confusion
Matrix
Actual
25
Frequency
80
60
40
20
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Predicted
Confusion
Matrix
37
Actual
Frequency
80
60
40
20
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Cost
Total
Costs
vs
Threshold
400~
300~
200-
100~
050
Threshold
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Pearson
Residuals
Residuals
vs
Fitted
<
0310
o
o
~
o
@D
ooao
@
o
o
So@oo
o
®
Q)()
?
¥
A
1700
779
©
|
'
T
T
T
T
T
-4
-2
0
2
4
Predicted
values
gim(V21
~
V5
+
V8
+
V1ALl
+
V1A12
+
V3A30
+
V3A31
+
VAA4L
+
V6A6L
+
VTAT2
+
...
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
IStd.
Deviance
resid.|
Q-Q
Residuals
25
00
goceiro
2.0
1.5
1.0
0.5
770
0.0
0.5
1.0
1.5
2.0
2.5
Theoretical
Quantiles
gim(V21
~
V5
+
V8
+
V1A1l
+
VI1A12
+
V3A30
+
V3A31
+
VAA4L
+
VBABL
+
VTAT2
+
...
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Scale-Location
70
N
9217
170
&
°
v
|
=
o
k=]
o
2
c
5
2
8§
o
£
2
=
(%
v
S
<
S
T
T
T
T
T
-4
-2
0
2
4
Predicted
values
gim(V21
~
V5
+
V8
+
V1ALl
+
V1A12
+
V3A30
+
V3A31
+
VAA4L
+
V6A6L
+
VTAT2
+
...
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Std.
Pearson
resid.
Residuals
vs
Leverage
<
0217
~
o
4
g
¥
o
o7
©
|
----
Cook's
distance
T
T
0.00
0.05
T
0.10
Leverage
gim(V21
~
V5
+
V8
+
V1A1l
+
VI1A12
+
V3A30
+
V3A31
+
VAA4L
+
VBABL
+
VTAT2
+
...
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help