Project2
pdf
keyboard_arrow_up
School
University of California, Los Angeles *
*We aren’t endorsed by this school
Course
M148
Subject
Computer Science
Date
Jan 9, 2024
Type
Pages
45
Uploaded by patilkunal919
Project 2 - Binary Classification Comparative
Methods
For this project we're going to attempt a binary classification of a dataset using multiple methods and
compare results.
Our goals for this project will be to introduce you to several of the most common classification techniques,
how to perform them and tweek parameters to optimize outcomes, how to produce and interpret results,
and compare performance. You will be asked to analyze your findings and provide explanations for
observed performance.
DEFINITIONS
</u>
Binary Classification:
In this case a complex dataset has an added 'target' label with one of two options.
Your learning algorithm will try to assign one of these labels to the data.
Supervised Learning:
This data is fully supervised, which means it's been fully labeled and we can trust
the veracity of the labeling.
Submission Details
Project is due May 17th at 12:00 pm (Wednesday Noon). To submit the project, please save the
notebook as a pdf file and submit the assignment via Gradescope. In addition, make sure that all
figures are legible and su
ff
iciently large. For best pdf results, we recommend downloading Latex
and print the notebook using Latex.
Loading Essentials and Helper Functions
In [ ]:
#Here are a set of libraries we imported to complete this assignment. #Feel free to use these or equivalent libraries for your implementation
import
numpy
as
np
# linear algebra
import
pandas
as
pd
# data processing, CSV file I/O (e.g. pd.read_csv)
import
matplotlib.pyplot
as
plt
# this is used for the plot the graph import
matplotlib
import
os
import
time
#Sklearn classes
from
sklearn.model_selection
import
train_test_split
,
cross_val_score
,
GridSearchCV
,
KFo
from
sklearn
import
metrics
from
sklearn.svm
import
SVC
#SVM classifier
from
sklearn.linear_model
import
LogisticRegression
from
sklearn.neighbors
import
KNeighborsClassifier
from
sklearn.metrics
import
confusion_matrix
import
sklearn.metrics.cluster
as
smc
from
sklearn.pipeline
import
Pipeline
from
sklearn.preprocessing
import
StandardScaler
,
OneHotEncoder
,
Normalizer
,
MinMaxScale
from
sklearn.compose
import
ColumnTransformer
,
make_column_transformer
Loading [MathJax]/extensions/Safe.js
from
matplotlib
import
pyplot
import
itertools
%
matplotlib
inline
#Sets random seed
import
random
random
.
seed
(
42
)
In [ ]:
# Helper function allowing you to export a graph
def
save_fig
(
fig_id
,
tight_layout
=
True
,
fig_extension
=
"png"
,
resolution
=
300
):
path
=
os
.
path
.
join
(
fig_id
+
"."
+
fig_extension
)
print
(
"Saving figure"
,
fig_id
)
if
tight_layout
:
plt
.
tight_layout
()
plt
.
savefig
(
path
,
format
=
fig_extension
,
dpi
=
resolution
)
In [ ]:
# Helper function that allows you to draw nicely formatted confusion matrices
def
draw_confusion_matrix
(
y
,
yhat
,
classes
):
'''
Draws a confusion matrix for the given target and predictions
Adapted from scikit-learn and discussion example.
'''
plt
.
cla
()
plt
.
clf
()
matrix
=
confusion_matrix
(
y
,
yhat
)
plt
.
imshow
(
matrix
,
interpolation
=
'nearest'
,
cmap
=
plt
.
cm
.
YlOrBr
)
plt
.
title
(
"Confusion Matrix"
)
plt
.
colorbar
()
num_classes
=
len
(
classes
)
plt
.
xticks
(
np
.
arange
(
num_classes
),
classes
,
rotation
=
90
)
plt
.
yticks
(
np
.
arange
(
num_classes
),
classes
)
fmt
=
'd'
thresh
=
matrix
.
max
()
/
2.
for
i
,
j
in
itertools
.
product
(
range
(
matrix
.
shape
[
0
]),
range
(
matrix
.
shape
[
1
])):
plt
.
text
(
j
,
i
,
format
(
matrix
[
i
,
j
],
fmt
),
horizontalalignment
=
"center"
,
color
=
"white"
if
matrix
[
i
,
j
]
>
thresh
else
"black"
)
plt
.
ylabel
(
'True label'
)
plt
.
xlabel
(
'Predicted label'
)
plt
.
tight_layout
()
plt
.
show
()
In [ ]:
def
heatmap
(
data
,
row_labels
,
col_labels
,
figsize
=
(
20
,
12
),
cmap
=
"YlGn"
,
cbar_kw
=
{},
cbarlabel
=
""
,
valfmt
=
"
{x:.2f}
"
,
textcolors
=
(
"black"
,
"white"
),
threshold
=
None
):
"""
Create a heatmap from a numpy array and two lists of labels. Taken from matplotlib example.
Parameters
----------
data
A 2D numpy array of shape (M, N).
row_labels
A list or array of length M with the labels for the rows.
col_labels
A list or array of length N with the labels for the columns.
ax
Loading [MathJax]/extensions/Safe.js
A `matplotlib.axes.Axes` instance to which the heatmap is plotted. If
not provided, use current axes or create a new one. Optional.
cmap
A string that specifies the colormap to use. Look at matplotlib docs for informa
Optional.
cbar_kw
A dictionary with arguments to `matplotlib.Figure.colorbar`. Optional.
cbarlabel
The label for the colorbar. Optional.
valfmt
The format of the annotations inside the heatmap. This should either
use the string format method, e.g. "$ {x:.2f}", or be a
`matplotlib.ticker.Formatter`. Optional.
textcolors
A pair of colors. The first is used for values below a threshold,
the second for those above. Optional.
threshold
Value in data units according to which the colors from textcolors are
applied. If None (the default) uses the middle of the colormap as
"""
plt
.
figure
(
figsize
=
figsize
)
ax
=
plt
.
gca
()
# Plot the heatmap
im
=
ax
.
imshow
(
data
,
cmap
=
cmap
)
# Create colorbar
cbar
=
ax
.
figure
.
colorbar
(
im
,
ax
=
ax
,
**
cbar_kw
)
cbar
.
ax
.
set_ylabel
(
cbarlabel
,
rotation
=-
90
,
va
=
"bottom"
)
# Show all ticks and label them with the respective list entries.
ax
.
set_xticks
(
np
.
arange
(
data
.
shape
[
1
]),
labels
=
col_labels
)
ax
.
set_yticks
(
np
.
arange
(
data
.
shape
[
0
]),
labels
=
row_labels
)
# Let the horizontal axes labeling appear on top.
ax
.
tick_params
(
top
=
True
,
bottom
=
False
,
labeltop
=
True
,
labelbottom
=
False
)
# Rotate the tick labels and set their alignment.
plt
.
setp
(
ax
.
get_xticklabels
(),
rotation
=-
30
,
ha
=
"right"
,
rotation_mode
=
"anchor"
)
# Turn spines off and create white grid.
ax
.
spines
[:]
.
set_visible
(
False
)
ax
.
set_xticks
(
np
.
arange
(
data
.
shape
[
1
]
+
1
)
-
.5
,
minor
=
True
)
ax
.
set_yticks
(
np
.
arange
(
data
.
shape
[
0
]
+
1
)
-
.5
,
minor
=
True
)
ax
.
grid
(
which
=
"minor"
,
color
=
"w"
,
linestyle
=
'-'
,
linewidth
=
3
)
ax
.
tick_params
(
which
=
"minor"
,
bottom
=
False
,
left
=
False
)
# Normalize the threshold to the images color range.
if
threshold
is
not
None
:
threshold
=
im
.
norm
(
threshold
)
else
:
threshold
=
im
.
norm
(
data
.
max
())
/
2.
# Set default alignment to center, but allow it to be
# overwritten by textkw.
kw
=
dict
(
horizontalalignment
=
"center"
,
verticalalignment
=
"center"
)
# Get the formatter in case a string is supplied
Loading [MathJax]/extensions/Safe.js
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
if
isinstance
(
valfmt
,
str
):
valfmt
=
matplotlib
.
ticker
.
StrMethodFormatter
(
valfmt
)
# Loop over the data and create a `Text` for each "pixel".
# Change the text's color depending on the data.
texts
=
[]
for
i
in
range
(
data
.
shape
[
0
]):
for
j
in
range
(
data
.
shape
[
1
]):
kw
.
update
(
color
=
textcolors
[
int
(
im
.
norm
(
data
[
i
,
j
])
>
threshold
)])
text
=
im
.
axes
.
text
(
j
,
i
,
valfmt
(
data
[
i
,
j
],
None
),
**
kw
)
texts
.
append
(
text
)
In [ ]:
def
make_meshgrid
(
x
,
y
,
h
=
0.02
):
"""Create a mesh of points to plot in
Parameters
----------
x: data to base x-axis meshgrid on
y: data to base y-axis meshgrid on
h: stepsize for meshgrid, optional
Returns
-------
xx, yy : ndarray
"""
x_min
,
x_max
=
x
.
min
()
-
1
,
x
.
max
()
+
1
y_min
,
y_max
=
y
.
min
()
-
1
,
y
.
max
()
+
1
xx
,
yy
=
np
.
meshgrid
(
np
.
arange
(
x_min
,
x_max
,
h
),
np
.
arange
(
y_min
,
y_max
,
h
))
return
xx
,
yy
def
plot_contours
(
clf
,
xx
,
yy
,
**
params
):
"""Plot the decision boundaries for a classifier.
Parameters
----------
ax: matplotlib axes object
clf: a classifier
xx: meshgrid ndarray
yy: meshgrid ndarray
params: dictionary of params to pass to contourf, optional
"""
Z
=
clf
.
predict
(
np
.
c_
[
xx
.
ravel
(),
yy
.
ravel
()])
Z
=
Z
.
reshape
(
xx
.
shape
)
out
=
plt
.
contourf
(
xx
,
yy
,
Z
,
**
params
)
return
out
def
draw_contour
(
x
,
y
,
clf
,
class_labels
=
[
"Negative"
,
"Positive"
]):
"""
Draws a contour line for the predictor
Assumption that x has only two features. This functions only plots the first two col
"""
X0
,
X1
=
x
[:,
0
],
x
[:,
1
]
xx0
,
xx1
=
make_meshgrid
(
X0
,
X1
)
plt
.
figure
(
figsize
=
(
10
,
6
))
plot_contours
(
clf
,
xx0
,
xx1
,
cmap
=
"PiYG"
,
alpha
=
0.8
)
scatter
=
plt
.
scatter
(
X0
,
X1
,
c
=
y
,
cmap
=
"PiYG"
,
s
=
30
,
edgecolors
=
"k"
)
plt
.
legend
(
handles
=
scatter
.
legend_elements
()[
0
],
labels
=
class_labels
)
Loading [MathJax]/extensions/Safe.js
Example Project
In this part, we will go over how to perform a Binary classification task using a variety of models. We will
provide examples of how to train and evaluate these models.
Dataset Description
Healthcare is an important industry that uses machine learning to aid doctors in diagnosing many different
kinds of illnesses and diseases. For this example project, we will be using the Breast Cancer Wisconsin
Dataset
to determine whether a mass found in a body is benign or malignant.
Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They
describe characteristics of the cell nuclei present in the image.
Feature Information:
Column 1: ID number
Column 2: Diagnosis (M = malignant, B = benign)
Ten real-valued features are computed for each cell nucleus:
1. radius (mean of distances from center to points on the perimeter)
2. texture (standard deviation of gray-scale values)
3. perimeter
4. area
5. smoothness (local variation in radius lengths)
6. compactness (perimeter^2 / area - 1.0)
7. concavity (severity of concave portions of the contour)
8. concave points (number of concave portions of the contour)
9. symmetry
10. fractal dimension ("coastline approximation" - 1)
Due to the statistical nature of the test, we are not able to get exact measurements of the previous values.
Instead, the dataset contains the mean and standard error of the real-valued features.
Columns 3-12 present the mean of the measured values
Columns 13-22 present the standard error of the measured values
Load and Analyze the dataset
plt
.
xlim
(
xx0
.
min
(),
xx0
.
max
())
plt
.
ylim
(
xx1
.
min
(),
xx1
.
max
())
In [ ]:
#Load Data
data
=
pd
.
read_csv
(
'datasets/breast_cancer_data.csv'
)
Loading [MathJax]/extensions/Safe.js
Always look at your dataset after loading it. Use information from .describe and .info to learn more about the
dataset.
id
diagnosis
radius_mean
texture_mean
perimeter_mean
area_mean
smoothness_mean
compactne
0
842302
M
17.99
10.38
122.80
1001.0
0.11840
1
842517
M
20.57
17.77
132.90
1326.0
0.08474
2
84300903
M
19.69
21.25
130.00
1203.0
0.10960
3
84348301
M
11.42
20.38
77.58
386.1
0.14250
4
84358402
M
20.29
14.34
135.10
1297.0
0.10030
5 rows × 22 columns
id
radius_mean
texture_mean
perimeter_mean
area_mean
smoothness_mean
compactness
count
5.690000e+02
569.000000
569.000000
569.000000
569.000000
569.000000
569
mean
3.037183e+07
14.127292
19.289649
91.969033
654.889104
0.096360
0
std
1.250206e+08
3.524049
4.301036
24.298981
351.914129
0.014064
0
min
8.670000e+03
6.981000
9.710000
43.790000
143.500000
0.052630
0
25%
8.692180e+05
11.700000
16.170000
75.170000
420.300000
0.086370
0
50%
9.060240e+05
13.370000
18.840000
86.240000
551.100000
0.095870
0
75%
8.813129e+06
15.780000
21.800000
104.100000
782.700000
0.105300
0
max
9.113205e+08
28.110000
39.280000
188.500000
2501.000000
0.163400
0
8 rows × 21 columns
In [ ]:
data
.
head
(
5
)
Out[ ]:
In [ ]:
data
.
describe
()
Out[ ]:
In [ ]:
data
.
info
()
Loading [MathJax]/extensions/Safe.js
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 22 columns):
# Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 569 non-null int64 1 diagnosis 569 non-null object 2 radius_mean 569 non-null float64
3 texture_mean 569 non-null float64
4 perimeter_mean 569 non-null float64
5 area_mean 569 non-null float64
6 smoothness_mean 569 non-null float64
7 compactness_mean 569 non-null float64
8 concavity_mean 569 non-null float64
9 concave points_mean 569 non-null float64
10 symmetry_mean 569 non-null float64
11 fractal_dimension_mean 569 non-null float64
12 radius_se 569 non-null float64
13 texture_se 569 non-null float64
14 perimeter_se 569 non-null float64
15 area_se 569 non-null float64
16 smoothness_se 569 non-null float64
17 compactness_se 569 non-null float64
18 concavity_se 569 non-null float64
19 concave points_se 569 non-null float64
20 symmetry_se 569 non-null float64
21 fractal_dimension_se 569 non-null float64
dtypes: float64(20), int64(1), object(1)
memory usage: 97.9+ KB
While .info shows that every entry has 569 non-null and there are 569 entries, it is good to explicitly check
for nulls.
id 0
diagnosis 0
radius_mean 0
texture_mean 0
perimeter_mean 0
area_mean 0
smoothness_mean 0
compactness_mean 0
concavity_mean 0
concave points_mean 0
symmetry_mean 0
fractal_dimension_mean 0
radius_se 0
texture_se 0
perimeter_se 0
area_se 0
smoothness_se 0
compactness_se 0
concavity_se 0
concave points_se 0
symmetry_se 0
fractal_dimension_se 0
dtype: int64
Awesome! No need for imputation!
While we are looking at the dataset, we shall remove the "id" column.
In [ ]:
data
.
isnull
()
.
sum
()
Out[ ]:
Loading [MathJax]/extensions/Safe.js
Looking at the target labels
For this project, we wish to classify the diagnosis column.
0 M
1 M
2 M
3 M
4 M
..
564 M
565 M
566 M
567 M
568 B
Name: diagnosis, Length: 569, dtype: object
We need to transform this column into numerical column so that we may use them in our models. To do this,
we will employ the LabelEncoder to automatically transform all the target label.
['B' 'M']
0 1
1 1
2 1
3 1
4 1
..
564 1
565 1
566 1
567 1
568 0
Name: diagnosis, Length: 569, dtype: int64
Let's look at a histogram of the full dataset.
Its always good to get a global view of your datasets by looking at their histograms. You might see some
interesting trends.
In [ ]:
data
=
data
.
drop
([
"id"
],
axis
=
1
)
In [ ]:
data
[
"diagnosis"
]
Out[ ]:
In [ ]:
from
sklearn.preprocessing
import
LabelEncoder
le
=
LabelEncoder
()
data
[
'diagnosis'
]
=
le
.
fit_transform
(
data
[
'diagnosis'
])
print
(
le
.
classes_
)
In [ ]:
data
[
'diagnosis'
]
Out[ ]:
In [ ]:
data
.
hist
(
figsize
=
(
20
,
15
))
Loading [MathJax]/extensions/Safe.js
array([[<Axes: title={'center': 'diagnosis'}>,
<Axes: title={'center': 'radius_mean'}>,
<Axes: title={'center': 'texture_mean'}>,
<Axes: title={'center': 'perimeter_mean'}>,
<Axes: title={'center': 'area_mean'}>],
[<Axes: title={'center': 'smoothness_mean'}>,
<Axes: title={'center': 'compactness_mean'}>,
<Axes: title={'center': 'concavity_mean'}>,
<Axes: title={'center': 'concave points_mean'}>,
<Axes: title={'center': 'symmetry_mean'}>],
[<Axes: title={'center': 'fractal_dimension_mean'}>,
<Axes: title={'center': 'radius_se'}>,
<Axes: title={'center': 'texture_se'}>,
<Axes: title={'center': 'perimeter_se'}>,
<Axes: title={'center': 'area_se'}>],
[<Axes: title={'center': 'smoothness_se'}>,
<Axes: title={'center': 'compactness_se'}>,
<Axes: title={'center': 'concavity_se'}>,
<Axes: title={'center': 'concave points_se'}>,
<Axes: title={'center': 'symmetry_se'}>],
[<Axes: title={'center': 'fractal_dimension_se'}>, <Axes: >,
<Axes: >, <Axes: >, <Axes: >]], dtype=object)
Out[ ]:
Loading [MathJax]/extensions/Safe.js
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
From the histograms, we can see some interesting trends. Possible observations:
Many of the _se columns indicate a heavy skewness towards low values and have large tails
Many of the _mean columns look more Gaussian in shape
There is a large disparity between the ranges of certain features. For example, radius mean can go
from 0 to 25 while smoothness_mean is in the range [0.050,0.150]. This indicates we will have to
normalize or standardize the features if the models are sensitive to such measures.
Looking at the correlation matrix to get an idea about which features are
important
In [ ]:
correlations
=
data
.
corr
()
columns
=
list
(
data
)
#Creates the heatmap
heatmap
(
correlations
.
values
,
columns
,
columns
,
figsize
=
(
20
,
12
),
cmap
=
"hsv"
)
In [ ]:
#Let's specifically look at the correlations of our target feature
correlations
[
"diagnosis"
]
.
sort_values
(
ascending
=
False
)
Loading [MathJax]/extensions/Safe.js
diagnosis 1.000000
concave points_mean 0.776614
perimeter_mean 0.742636
radius_mean 0.730029
area_mean 0.708984
concavity_mean 0.696360
compactness_mean 0.596534
radius_se 0.567134
perimeter_se 0.556141
area_se 0.548236
texture_mean 0.415185
concave points_se 0.408042
smoothness_mean 0.358560
symmetry_mean 0.330499
compactness_se 0.292999
concavity_se 0.253730
fractal_dimension_se 0.077972
symmetry_se -0.006522
texture_se -0.008303
fractal_dimension_mean -0.012838
smoothness_se -0.067016
Name: diagnosis, dtype: float64
We can see that there is a lot of correlation between the features and the target label. Thus, we can expect
to learn something from the data
When doing classification, check if classes are heavily imbalanced.
It is important that the dataset does not prefer one class over any others. Otherwise, it may bias the model
to not learn the minority classes well.
Lets use a histogram and count the number of elements in each class.
0 357
1 212
Name: diagnosis, dtype: int64
Out[ ]:
In [ ]:
data
[
'diagnosis'
]
.
hist
(
bins
=
2
,
figsize
=
(
5
,
5
))
data
[
'diagnosis'
]
.
value_counts
()
Out[ ]:
Loading [MathJax]/extensions/Safe.js
There is a bit of an imbalance which is something to keep in mind if we find that our models do not perform
well on the minority classes. For our purposes, this imbalance is not big enough to be an issue so we will
not perform balancing techniques for this dataset.
Since the dataset is small though, we want to be careful when making training and testing splits to ensure
that there is enough of each class for both splits. We will show how to perform this shortly.
Setting up the data
Before starting any model training, we have to split up the target labels from our features.
Now, we also split the data into training and testing data. To ensure that there is not an imbalance of classes
in the training and testing set, we will use the stratify parameter in train_test_split to perform stratified
sampling on the data (Recall from lecture how stratified sampling is performed).
Note that we named the input feature data as raw to indicate that there has been no pre-processing on
them such as standardization. Shortly, we will show the affect that pre-processing has on the performance
of the model.
Let us quickly test that the splits are somewhat balanced.
In [ ]:
y
=
data
[
"diagnosis"
]
x
=
data
.
drop
([
"diagnosis"
],
axis
=
1
)
In [ ]:
train_raw
,
test_raw
,
target
,
target_test
=
train_test_split
(
x
,
y
,
test_size
=
0.2
,
stratify
In [ ]:
#Training classes
target
.
hist
(
bins
=
2
,
figsize
=
(
5
,
5
))
target
.
value_counts
()
Loading [MathJax]/extensions/Safe.js
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
0 285
1 170
Name: diagnosis, dtype: int64
0 72
1 42
Name: diagnosis, dtype: int64
Out[ ]:
In [ ]:
#Testing classes
target_test
.
hist
(
bins
=
2
,
figsize
=
(
5
,
5
))
target_test
.
value_counts
()
Out[ ]:
Loading [MathJax]/extensions/Safe.js
We can see that the class balance is about the same as before the split. In fact, we can see that if a
classifier just guessed class 0, it would have an accuracy of $100 * \frac{72}{72+42} = 63.15\%$. We can
consider this the baseline accuracy to compare against.
Models for Classification: KNN
For our first model, we will use KNN classfication. This is a model we have seen many times throughout the
course and it would be interesting to see how well it performs.
Simple KNN classification with K = 3
Let us try KNN on the raw data with simply 3 nearest neighbors. We use the sklearn metric library
to
calculate the measures of interest. In this case, we focus on accuracy.
Accuracy: 0.877193
We can see that there is already a huge improvement in accuracy on comparison to the baseline of 63.15%.
Let's see the effect that standardizing the input features would have on the KNN performance.
Affect of pre-processing on KNN
Accuracy: 0.921053
We can see that with pre-processing we were able to get a much better classification accuracy.
Here we only used StandardScaler. Lets see if other pre-processing techniques could have also worked. As
such, lets look at MinMaxScaler and Normalizer:
In [ ]:
# k-Nearest Neighbors algorithm
knn
=
KNeighborsClassifier
(
n_neighbors
=
3
)
knn
.
fit
(
train_raw
,
target
)
predicted
=
knn
.
predict
(
test_raw
)
In [ ]:
print
(
"
%-12s
%f
"
%
(
'Accuracy:'
,
metrics
.
accuracy_score
(
target_test
,
predicted
)))
In [ ]:
#Since all features are real-valued, we only have one pipeline
pipeline
=
Pipeline
([
(
'scaler'
,
StandardScaler
())
])
#Transform raw data train
=
pipeline
.
fit_transform
(
train_raw
)
test
=
pipeline
.
transform
(
test_raw
)
#Note that there is no fit calls
In [ ]:
# k-Nearest Neighbors algorithm
knn
=
KNeighborsClassifier
(
n_neighbors
=
3
)
knn
.
fit
(
train
,
target
)
testing_result
=
knn
.
predict
(
test
)
predicted
=
knn
.
predict
(
test
)
In [ ]:
print
(
"
%-12s
%f
"
%
(
'Accuracy:'
,
metrics
.
accuracy_score
(
target_test
,
predicted
)))
In [ ]:
preprocessors
=
[
StandardScaler
(),
MinMaxScaler
(),
Normalizer
()
]
Loading [MathJax]/extensions/Safe.js
StandardScaler()
Accuracy: 0.903509
MinMaxScaler()
Accuracy: 0.912281
Normalizer()
Accuracy: 0.885965
We can see that MinMaxScaler had the same performance as StandardScaler. Yet, Normalizer did not
improve the model.
Visualizing decision boundaries for KNN
Its always nice to see the decision boundaries a model decides upon. Let's see how the decision boundary
changes as function of k when only using the two most correlated features to the target labels:
concave_points_mean and perimeter_mean.
for
pre
in
preprocessors
:
pipeline
=
Pipeline
([
(
'preprocessor'
,
pre
)
])
#Transform raw data train
=
pipeline
.
fit_transform
(
train_raw
)
test
=
pipeline
.
transform
(
test_raw
)
#Note that there is no fit calls
# k-Nearest Neighbors algorithm
knn
=
KNeighborsClassifier
(
n_neighbors
=
7
)
knn
.
fit
(
train
,
target
)
testing_result
=
knn
.
predict
(
test
)
predicted
=
knn
.
predict
(
test
)
print
(
pre
)
print
(
"
%-12s
%f
"
%
(
'Accuracy:'
,
metrics
.
accuracy_score
(
target_test
,
predicted
)))
In [ ]:
#Extract first two features and use the standardscaler train_2
=
StandardScaler
()
.
fit_transform
(
train_raw
[[
'concave points_mean'
,
'perimeter_mea
k_r
=
[
1
,
3
,
5
,
7
]
for
k
in
k_r
:
knn
=
KNeighborsClassifier
(
n_neighbors
=
k
)
knn
.
fit
(
train_2
,
target
)
draw_contour
(
train_2
,
target
,
knn
,
class_labels
=
[
'Benign'
,
'Malignant'
])
plt
.
title
(
f
"K =
{
k
}
"
)
Loading [MathJax]/extensions/Safe.js
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Loading [MathJax]/extensions/Safe.js
We can see that as k gets larger, the decision boundary gets smoother.
Models for Classification: Logistic Regression
While KNN is a very powerful model, it does come with a few issues such as
Loading [MathJax]/extensions/Safe.js
Require storing the full training dataset
Prediction is done by comparing new sample will all samples in training set which is time-consuming
These issues arise because KNN is a non-parametric
model which means that it does not summarize the
data into a finite set of parameters.
Let us now look at Logistic Regression which is an example of a parametric
model.
Simple Logistic Regression
First, let us see how logistic regression performs without any regularization.
Accuracy: 0.964912
/Users/kunalpatil/anaconda3/lib/python3.10/site-packages/sklearn/linear_model/_logistic.
py:458: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
We can see that Logistic Regression is actually performing much better than any of the KNN models we
tried. We can also see the parameters that the model learned.
array([[ 9.49657288e+00, 5.29884067e-01, -1.70890677e+00,
1.80971480e-02, 4.25364740e+01, 2.49869264e+01,
7.58303604e+01, 6.71108523e+01, 2.97123432e+01,
1.51697880e-01, -3.38079100e+01, -2.88171687e+00,
8.24165745e-01, 4.32658867e-01, -2.05081068e+00,
-4.98690890e+01, -6.82370605e+01, -7.81039978e+00,
-1.70256855e+01, -1.11729757e+01]])
array([-17.051348])
Number of Features in data: 20
Number of Parameters: 20
Since we are using Logistic Regression where we model the log odds with a linear function, it makes sense
that we have a parameter/coefficient for each input feature.
In [ ]:
log_reg
=
LogisticRegression
(
penalty
=
"l2"
,
max_iter
=
1000
,
solver
=
"lbfgs"
,
C
=
(
10
**
30
#C is choosen to be high to remove regularization
#We could have chosen penalty = "none" since lbfgs supports it but this option is not po
log_reg
.
fit
(
train_raw
,
target
)
testing_result
=
log_reg
.
predict
(
test_raw
)
predicted
=
log_reg
.
predict
(
test_raw
)
print
(
"
%-12s
%f
"
%
(
'Accuracy:'
,
metrics
.
accuracy_score
(
target_test
,
predicted
)))
In [ ]:
#Parameters for each feature
log_reg
.
coef_
Out[ ]:
In [ ]:
#Intercept term
log_reg
.
intercept_
Out[ ]:
In [ ]:
print
(
"Number of Features in data:"
,
train_raw
.
shape
[
1
])
print
(
"Number of Parameters:"
,
len
(
log_reg
.
coef_
[
0
]))
Loading [MathJax]/extensions/Safe.js
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Parameters for Logistic Regression
In Sci-kit Learn, the following are just some of the parameters we can pass into Logistic Regression:
penalty: {‘l1’, ‘l2’, ‘elasticnet’, ‘none’} default="l2"
Specifies the type of regularization to use. Not all penalties work for each solver.
C: positive float, default=1
Inverse of the regularization strength. You can treat C as $\frac{1}{\lambda}$ as shown in lecture.
Thus, as C gets smaller, the regularization strength increases.
solver: {‘newton-cg’, ‘lbfgs’, ‘liblinear’, ‘sag’, ‘saga’}, default=’lbfgs’
Algorithm to use in the optimization problem. Each algorithms solves logistic regression using
different iterative methods that are based on the gradient. Read the sci-kit learn documentation
for
more information.
max_iter: int, default=100
Maximum number of iterations taken for the solvers to converge.
Each parameter has a different effect on the model. Let's look at how the choose of max_iter affects the
model performance on the raw data and the standardized dataset.
Raw Data Accuracy: 0.938596
Preprocessed Data Accuracy: 0.947368
We see that the accuraccies are pretty close to each other. Lets see what happens when we decrease the
max_iter.
In [ ]:
#Since all features are real-valued, we only have one pipeline
preprocesser
=
Pipeline
([
(
'scaler'
,
StandardScaler
())
])
#Transform raw data train
=
preprocesser
.
fit_transform
(
train_raw
)
test
=
preprocesser
.
transform
(
test_raw
)
#Note that there is no fit call
In [ ]:
log_reg
=
LogisticRegression
(
penalty
=
"l2"
,
max_iter
=
1000
,
solver
=
"lbfgs"
,
C
=
0.01
)
#Train raw is the data before preprocessing
log_reg
.
fit
(
train_raw
,
target
)
predicted
=
log_reg
.
predict
(
test_raw
)
print
(
"
%-12s
%f
"
%
(
'Raw Data Accuracy:'
,
metrics
.
accuracy_score
(
target_test
,
predicted
))
#Train is the data after preprocessing (using Standard scalar)
log_reg
.
fit
(
train
,
target
)
predicted
=
log_reg
.
predict
(
test
)
print
(
"
%-12s
%f
"
%
(
'Preprocessed Data Accuracy:'
,
metrics
.
accuracy_score
(
target_test
,
pr
In [ ]:
log_reg
=
LogisticRegression
(
penalty
=
"l2"
,
max_iter
=
70
,
solver
=
"lbfgs"
,
C
=
0.01
)
#Train raw is the data before preprocessing
log_reg
.
fit
(
train_raw
,
target
)
predicted
=
log_reg
.
predict
(
test_raw
)
print
(
"
%-12s
%f
"
%
(
'Raw Data Accuracy:'
,
metrics
.
accuracy_score
(
target_test
,
predicted
))
Loading [MathJax]/extensions/Safe.js
Raw Data Accuracy: 0.921053
/Users/kunalpatil/anaconda3/lib/python3.10/site-packages/sklearn/linear_model/_logistic.
py:458: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
Ooops! The model did not seem to converge. Its seem that the scale of the features strongly affects the
convergence speed of the iterative algorithm. As suggested, we can fix this issue by increaing the max_iter,
re-scaling the data, or using a different solver.
Preprocessed Data Accuracy: 0.947368
Cross Validation for Logistic Regression
Let us do a little experiment using cross validation to see how each term affects the logistic regression. We
will perform this example on the standardized data.
In [ ]:
#Train is the data after preprocessing (using Standard scalar)
log_reg
.
fit
(
train
,
target
)
predicted
=
log_reg
.
predict
(
test
)
print
(
"
%-12s
%f
"
%
(
'Preprocessed Data Accuracy:'
,
metrics
.
accuracy_score
(
target_test
,
pr
In [ ]:
#You may even do Cross validation for classification
from
sklearn.model_selection
import
GridSearchCV
#Note that this a list of dict
#Each dict describes the combination of parameters to check
parameters
=
[
{
"penalty"
:
[
"l2"
],
"C"
:
[
0.01
,
1
,
100
],
"solver"
:
[
"lbfgs"
,
"liblinear"
]},
#These solvers support penalty = "l2"
{
"penalty"
:
[
"none"
],
"C"
:
[
1
],
#Specified to prevent error message
"solver"
:
[
"lbfgs"
,
"newton-cg"
]},
#These solvers support penalty = "none"
]
#instantiate model
#Implementing cross validation
k
=
3
kf
=
KFold
(
n_splits
=
k
,
random_state
=
None
)
log_reg
=
LogisticRegression
(
penalty
=
"none"
,
max_iter
=
1000
,
solver
=
"lbfgs"
)
#will c
grid
=
GridSearchCV
(
log_reg
,
parameters
,
cv
=
kf
,
scoring
=
"accuracy"
)
grid
.
fit
(
train
,
target
)
Loading [MathJax]/extensions/Safe.js
/Users/kunalpatil/anaconda3/lib/python3.10/site-packages/sklearn/linear_model/_logistic.
py:1173: FutureWarning: `penalty='none'`has been deprecated in 1.2 and will be removed i
n 1.4. To keep the past behaviour, set `penalty=None`.
warnings.warn(
/Users/kunalpatil/anaconda3/lib/python3.10/site-packages/sklearn/linear_model/_logistic.
py:458: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
/Users/kunalpatil/anaconda3/lib/python3.10/site-packages/sklearn/linear_model/_logistic.
py:1173: FutureWarning: `penalty='none'`has been deprecated in 1.2 and will be removed i
n 1.4. To keep the past behaviour, set `penalty=None`.
warnings.warn(
/Users/kunalpatil/anaconda3/lib/python3.10/site-packages/sklearn/linear_model/_logistic.
py:1173: FutureWarning: `penalty='none'`has been deprecated in 1.2 and will be removed i
n 1.4. To keep the past behaviour, set `penalty=None`.
warnings.warn(
/Users/kunalpatil/anaconda3/lib/python3.10/site-packages/sklearn/linear_model/_logistic.
py:1173: FutureWarning: `penalty='none'`has been deprecated in 1.2 and will be removed i
n 1.4. To keep the past behaviour, set `penalty=None`.
warnings.warn(
/Users/kunalpatil/anaconda3/lib/python3.10/site-packages/scipy/optimize/_linesearch.py:4
57: LineSearchWarning: The line search algorithm did not converge
warn('The line search algorithm did not converge', LineSearchWarning)
/Users/kunalpatil/anaconda3/lib/python3.10/site-packages/scipy/optimize/_linesearch.py:3
06: LineSearchWarning: The line search algorithm did not converge
warn('The line search algorithm did not converge', LineSearchWarning)
/Users/kunalpatil/anaconda3/lib/python3.10/site-packages/sklearn/utils/optimize.py:210:
ConvergenceWarning: newton-cg failed to converge. Increase the number of iterations.
warnings.warn(
/Users/kunalpatil/anaconda3/lib/python3.10/site-packages/sklearn/linear_model/_logistic.
py:1173: FutureWarning: `penalty='none'`has been deprecated in 1.2 and will be removed i
n 1.4. To keep the past behaviour, set `penalty=None`.
warnings.warn(
/Users/kunalpatil/anaconda3/lib/python3.10/site-packages/sklearn/linear_model/_logistic.
py:1173: FutureWarning: `penalty='none'`has been deprecated in 1.2 and will be removed i
n 1.4. To keep the past behaviour, set `penalty=None`.
warnings.warn(
Out[ ]:
In [ ]:
#Put results into Dataframe
res
=
pd
.
DataFrame
(
grid
.
cv_results_
)
res
▸
GridSearchCV
▸
estimator: LogisticRegression
▸
LogisticRegression
Loading [MathJax]/extensions/Safe.js
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
mean_fit_time
std_fit_time
mean_score_time
std_score_time
param_C
param_penalty
param_solver
para
0
0.001854
0.000687
0.000315
0.000036
0.01
l2
lbfgs
{'C': 0
'pena
'solv
'lbf
1
0.000816
0.000111
0.000239
0.000015
0.01
l2
liblinear
{'C': 0
'pena
'solv
'lib
2
0.002188
0.000165
0.000252
0.000028
1
l2
lbfgs
{'C
'pena
'solv
'lbf
3
0.000956
0.000034
0.000232
0.000019
1
l2
liblinear
{'C
'pena
'solv
'libline
4
0.007632
0.000695
0.000230
0.000003
100
l2
lbfgs
{'C': 1
'pena
'solv
'lbf
5
0.001523
0.000059
0.000219
0.000009
100
l2
liblinear
{'C': 1
'pena
'solv
'liblin
6
0.025101
0.019992
0.000318
0.000037
1
none
lbfgs
{'C
'pena
'no
'solv
'lbf
7
2.844657
4.010496
0.000411
0.000169
1
none
newton-cg
{'C
'pena
'no
'solv
'newto
rank_test_score
param_C
param_penalty
param_solver
mean_test_score
0
7
0.01
l2
lbfgs
0.916463
1
4
0.01
l2
liblinear
0.934051
2
2
1
l2
lbfgs
0.956024
3
1
1
l2
liblinear
0.958232
4
6
100
l2
lbfgs
0.934007
5
3
100
l2
liblinear
0.936200
6
5
1
none
lbfgs
0.934007
7
8
1
none
newton-cg
0.820001
Out[ ]:
In [ ]:
#Extract the columns that specify the score and the parameters for each row
res
[[
"rank_test_score"
,
"param_C"
,
"param_penalty"
,
"param_solver"
,
"mean_test_score"
]]
Out[ ]:
Loading [MathJax]/extensions/Safe.js
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
We can see that the choice of these parameters can stronlgy affect performance of the classifier. Lets check
the performance of the best parameters on the test set.
Accuracy: 0.938596
Note that this test accuracy is not as good as some of the other logistic regression examples we've shown.
Speedtest between KNN and Logistic Regression
Lets see how long KNN and Logistic Regression take to perform training and testing.
/Users/kunalpatil/anaconda3/lib/python3.10/site-packages/sklearn/linear_model/_logistic.
py:1173: FutureWarning: `penalty='none'`has been deprecated in 1.2 and will be removed i
n 1.4. To keep the past behaviour, set `penalty=None`.
warnings.warn(
KNN Training Time : 0.00044083595275878906
Logistic Regression Training Time : 0.023360013961791992
KNN Testing Time : 0.01816701889038086
Logistic Regression Testing Time : 0.00018095970153808594
This simple test shows that Logistic Regression is slower than KNN during Training time but is much faster
during testing time.
Visualizing decision boundaries for Logistic Regression
Now, lets look at the decision boundary caused by Logistic Regression. Same as for KNN, we use the two
most correlated features to the target labels: concave_points_mean and perimeter_mean. This way, we can
In [ ]:
#Train raw is the data before preprocessing
predicted
=
grid
.
predict
(
test
)
print
(
"
%-12s
%f
"
%
(
'Accuracy:'
,
metrics
.
accuracy_score
(
target_test
,
predicted
)))
In [ ]:
scaler
=
StandardScaler
()
train
=
scaler
.
fit_transform
(
train_raw
)
test
=
scaler
.
fit_transform
(
test_raw
)
log_reg
=
LogisticRegression
(
penalty
=
"none"
,
max_iter
=
1000
)
knn
=
KNeighborsClassifier
(
n_neighbors
=
3
)
t0
=
time
.
time
()
knn
.
fit
(
train
,
target
)
t1
=
time
.
time
()
print
(
"KNN Training Time : "
,
t1
-
t0
)
t0
=
time
.
time
()
log_reg
.
fit
(
train
,
target
)
t1
=
time
.
time
()
print
(
"Logistic Regression Training Time : "
,
t1
-
t0
)
In [ ]:
t0
=
time
.
time
()
knn
.
predict
(
test
)
t1
=
time
.
time
()
print
(
"KNN Testing Time : "
,
t1
-
t0
)
t0
=
time
.
time
()
log_reg
.
predict
(
test
)
t1
=
time
.
time
()
print
(
"Logistic Regression Testing Time : "
,
t1
-
t0
)
Loading [MathJax]/extensions/Safe.js
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
visualize the 2D decision boundary.
In [ ]:
#Extract first two feature and use the standardscaler train_2
=
StandardScaler
()
.
fit_transform
(
train_raw
[[
'concave points_mean'
,
'perimeter_mea
Cs
=
[
0.001
,
0.1
,
1000
]
for
C
in
Cs
:
log_reg
=
LogisticRegression
(
penalty
=
"l2"
,
max_iter
=
1000
,
solver
=
"lbfgs"
,
C
=
C
)
log_reg
.
fit
(
train_2
,
target
)
draw_contour
(
train_2
,
target
,
log_reg
,
class_labels
=
[
'Benign'
,
'Malignant'
])
plt
.
title
(
f
"C =
{
C
}
"
)
Loading [MathJax]/extensions/Safe.js
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
We can see as the regularization strength changes, the decision boundary moves as well. Additionally, we
can clearly see that the decision boundary is a line since this is a linear model.
Models for Classification: SVM
Loading [MathJax]/extensions/Safe.js
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
We now discuss another type of linear classification model known as Support Vector Machines (SVM).
Where Logistic Regression was motivated probability theory, SVM is motivated by geometeric arguments.
Specifcally, SVM finds a separating hyperline that maximizes the margin (i.e. distance from each class). The
hyperplane is used to classify the points by designating every sample on of side of the hyperplane as the
positive class and the other side as the negative class.
The hyperplane is determine by a few sample points known as support vectors that uniquely characterize
the hyperplane.
svm_im
Note that it may not always be possible to find a hyperplane that completely separates the classes. Thus,
we use what is known as Soft-Margin SVM which aims to maximize the margin while minizming the
distance on the classes that are on the wrong side.
All Sci-kit learn implementations of SVM that we use are soft-margin SVM.
Simple SVM classification
Accuracy: 0.921053
Parameters for SVM
In Sci-kit Learn, the following are just some of the parameters we can pass into Logistic Regression:
C: positive float, default=1
Inverse of the regularization strength. You can treat C as $\frac{1}{\lambda}$ as shown in lecture.
Thus, as C gets smaller, the regularization strength increases. SVM only uses the L2
regularization.
kernel: {‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’}, default=’rbf’
Specifies the kernel type to be used in the algorithm. A kernel specifies a mapping into a higher
dimension space to allow for non-linear decision boundaries.
degree: int, default=3
Degree of the polynomial kernel function (‘poly’). Ignored by all other kernels.
Visualizing decision boundaries for SVM
Now, lets look at the decision boundary caused by SVM with different kernels. Same as for KNN and
Logistic Regression, we use the two most correlated features to the target labels: concave_points_mean
and perimeter_mean. This way, we can visualize the 2D decision boundary.
In [ ]:
svm
=
SVC
()
svm
.
fit
(
train
,
target
)
predicted
=
svm
.
predict
(
test
)
print
(
"
%-12s
%f
"
%
(
'Accuracy:'
,
metrics
.
accuracy_score
(
target_test
,
predicted
)))
In [ ]:
#Extract first two feature and use the standardscaler train_2
=
StandardScaler
()
.
fit_transform
(
train_raw
[[
'concave points_mean'
,
'perimeter_mea
kernel
=
[
'linear'
,
'poly'
,
'rbf'
,
'sigmoid'
]
for
ker
in
kernel
:
svm
=
SVC
(
kernel
=
ker
)
#will change parameters during CV
Loading [MathJax]/extensions/Safe.js
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
svm
.
fit
(
train_2
,
target
)
draw_contour
(
train_2
,
target
,
svm
,
class_labels
=
[
'Benign'
,
'Malignant'
])
plt
.
title
(
f
"Kernel =
{
ker
}
"
)
Loading [MathJax]/extensions/Safe.js
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
We can see that the decision boundary is not always linear because we are using non-linear kernels.
Loading [MathJax]/extensions/Safe.js
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Important Measures for Classifications
Now that we have gone over a few models for binary classification, let's explore the different ways we can
measure the performance of these models.
Here are just some of the most important measures of interest. We use the convention to refer to the class
labeled as $1$ as the positive class.
Accuracy:
The percentage of predictions that are correct. Use metrics.accuracy_score
Precision:
$\frac{\text{Number of labels correctly classified as positive}}{\text{Number of labels
classified as positives}}$. Percentage of predictions that are correctly positive among all the predictions
that were classified as positive. Use metrics.precision_score
Recall:
$\frac{\text{Number of labels correctly classified as positive}}{\text{Number of labels where the
true class is positive}}$. Percentage of predictions that are correctly positive among all the labels where
the true class is positive. Also known as the probability of detecting when a class is positive. Use
metrics.recall_score
F1 Score:
Harmonic mean of the precision and recall. Highest value is $1$ when both precision and
recall are $1$, i.e. perfect. Lowest value is $0$ when either precision or recall is zero. Provides an
aggregate score to analyze both precision and recall. Use metrics.f1_score
We can calculate these measures by using a confusion matrix as well.
Accuracy: 0.947368
Precision: 0.909091
Recall: 0.952381
F1 Score: 0.930233
Confusion Matrix: [[68 4]
[ 2 40]]
In [ ]:
#Example classifier
log_reg
=
LogisticRegression
(
max_iter
=
1000
)
log_reg
.
fit
(
train_raw
,
target
)
predicted
=
log_reg
.
predict
(
test_raw
)
In [ ]:
print
(
"
%-12s
%f
"
%
(
'Accuracy:'
,
metrics
.
accuracy_score
(
target_test
,
predicted
)))
print
(
"
%-12s
%f
"
%
(
'Precision:'
,
metrics
.
precision_score
(
target_test
,
predicted
,
labels
=
print
(
"
%-12s
%f
"
%
(
'Recall:'
,
metrics
.
recall_score
(
target_test
,
predicted
,
labels
=
None
,
print
(
"
%-12s
%f
"
%
(
'F1 Score:'
,
metrics
.
f1_score
(
target_test
,
predicted
,
labels
=
None
,
po
print
(
"Confusion Matrix: \n
"
,
metrics
.
confusion_matrix
(
target_test
,
predicted
))
#Draws confusion matrix
draw_confusion_matrix
(
target_test
,
predicted
,
[
'Benign'
,
'Malignant'
])
Loading [MathJax]/extensions/Safe.js
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
TODO: Using classification methods to classify
heart disease
Now that you have some examples of the classifiers that Sci-kit learn has to offers, let try to apply them to a
new dataset.
Background: The Dataset
For this exercise we will be using a subset of the UCI Heart Disease dataset, leveraging the fourteen most
commonly used attributes. All identifying information about the patient has been scrubbed. You will be asked
to classify whether a patient is suffering from heart disease
based on a host of potential medical factors.
The dataset includes 14 columns. The information provided by each column is as follows:
age:
Age in years
sex:
(1 = male; 0 = female)
cp:
Chest pain type (0 = asymptomatic; 1 = atypical angina; 2 = non-anginal pain; 3 = typical angina)
trestbps:
Resting blood pressure (in mm Hg on admission to the hospital)
chol:
cholesterol in mg/dl
fbs
Fasting blood sugar > 120 mg/dl (1 = true; 0 = false)
restecg:
Resting electrocardiographic results (0= showing probable or definite left ventricular
hypertrophy by Estes' criteria; 1 = normal; 2 = having ST-T wave abnormality (T wave inversions and/or
ST elevation or depression of > 0.05 mV))
thalach:
Maximum heart rate achieved
Loading [MathJax]/extensions/Safe.js
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
exang:
Exercise induced angina (1 = yes; 0 = no)
oldpeak:
Depression induced by exercise relative to rest
slope:
The slope of the peak exercise ST segment (0 = downsloping; 1 = flat; 2 = upsloping)
ca:
Number of major vessels (0-3) colored by flourosopy
thal:
1 = normal; 2 = fixed defect; 7 = reversable defect
sick:
Indicates the presence of Heart disease (True = Disease; False = No disease)
[25 pts] Part 1. Load the Data and Analyze
Let's first load our dataset so we'll be able to work with it. (correct the relative path if your notebook is in a
different directory than the csv file.)
[5 pts] Looking at the data
Now that our data is loaded, let's take a closer look at the dataset we're working with. Use the head method,
the describe method, and the info method to display some of the rows so we can visualize the types of data
fields we'll be working with.
age
sex
cp
trestbps
chol
fbs
restecg
thalach
exang
oldpeak
slope
ca
thal
sick
0
63
1
3
145
233
1
0
150
0
2.3
0
0
1
False
1
37
1
2
130
250
0
1
187
0
3.5
0
0
2
False
2
41
0
1
130
204
0
0
172
0
1.4
2
0
2
False
3
56
1
1
120
236
0
1
178
0
0.8
2
0
2
False
4
57
0
0
120
354
0
1
163
1
0.6
2
0
2
False
age
sex
cp
trestbps
chol
fbs
restecg
thalach
ex
count
303.000000
303.000000
303.000000
303.000000
303.000000
303.000000
303.000000
303.000000
303.000
mean
54.366337
0.683168
0.966997
131.623762
246.264026
0.148515
0.528053
149.646865
0.326
std
9.082101
0.466011
1.032052
17.538143
51.830751
0.356198
0.525860
22.905161
0.469
min
29.000000
0.000000
0.000000
94.000000
126.000000
0.000000
0.000000
71.000000
0.000
25%
47.500000
0.000000
0.000000
120.000000
211.000000
0.000000
0.000000
133.500000
0.000
50%
55.000000
1.000000
1.000000
130.000000
240.000000
0.000000
1.000000
153.000000
0.000
75%
61.000000
1.000000
2.000000
140.000000
274.500000
0.000000
1.000000
166.000000
1.000
max
77.000000
1.000000
3.000000
200.000000
564.000000
1.000000
2.000000
202.000000
1.000
In [ ]:
data
=
pd
.
read_csv
(
'datasets/heartdisease.csv'
)
In [ ]:
data
.
head
()
Out[ ]:
In [ ]:
data
.
describe
()
Out[ ]:
In [ ]:
data
.
info
()
Loading [MathJax]/extensions/Safe.js
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 14 columns):
# Column Non-Null Count Dtype --- ------ -------------- ----- 0 age 303 non-null int64 1 sex 303 non-null int64 2 cp 303 non-null int64 3 trestbps 303 non-null int64 4 chol 303 non-null int64 5 fbs 303 non-null int64 6 restecg 303 non-null int64 7 thalach 303 non-null int64 8 exang 303 non-null int64 9 oldpeak 303 non-null float64
10 slope 303 non-null int64 11 ca 303 non-null int64 12 thal 303 non-null int64 13 sick 303 non-null bool dtypes: bool(1), float64(1), int64(12)
memory usage: 31.2 KB
Sometimes data will be stored in different formats (e.g., string, date, boolean), but many learning
methods work strictly on numeric inputs. Additionally, some numerical features can represent
categorical features which need to be pre-processed. Are there any columns that need to be
transformed and why?
All the columns in our dataframe are numeric (either int or float), however our target variable 'sick' is a
boolean and may need to be modified. Additionally, several of the numerical features represent categorical
features which may need to be pre-processed/encoded, including sex, fbs, restecg, cp, thal, and slope.
Determine if we're dealing with any null values. If so, report on which columns?
age 0
sex 0
cp 0
trestbps 0
chol 0
fbs 0
restecg 0
thalach 0
exang 0
oldpeak 0
slope 0
ca 0
thal 0
sick 0
dtype: int64
There are no null columns.
In [ ]:
data
.
isnull
()
.
sum
()
Out[ ]:
Loading [MathJax]/extensions/Safe.js
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
[5 pts] Transform target label into numerical value
Before we begin our analysis, we need to fix the field(s) that will be problematic. Specifically, convert our
boolean "sick" variable into a binary numeric target variable (values of either '0' or '1') using the label
encoder from scikit-learn
, place this new array into a new column of the DataFrame named "target", and
then drop the original "sick" column from the dataframe. Afterward, use .head to print the first 5 rows
age
sex
cp
trestbps
chol
fbs
restecg
thalach
exang
oldpeak
slope
ca
thal
target
0
63
1
3
145
233
1
0
150
0
2.3
0
0
1
0
1
37
1
2
130
250
0
1
187
0
3.5
0
0
2
0
2
41
0
1
130
204
0
0
172
0
1.4
2
0
2
0
3
56
1
1
120
236
0
1
178
0
0.8
2
0
2
0
4
57
0
0
120
354
0
1
163
1
0.6
2
0
2
0
[5 pts] Plotting histogram of data
Now that we have a feel for the data-types for each of the variables, plot histograms of each field.
array([[<Axes: title={'center': 'age'}>, <Axes: title={'center': 'sex'}>,
<Axes: title={'center': 'cp'}>,
<Axes: title={'center': 'trestbps'}>],
[<Axes: title={'center': 'chol'}>,
<Axes: title={'center': 'fbs'}>,
<Axes: title={'center': 'restecg'}>,
<Axes: title={'center': 'thalach'}>],
[<Axes: title={'center': 'exang'}>,
<Axes: title={'center': 'oldpeak'}>,
<Axes: title={'center': 'slope'}>,
<Axes: title={'center': 'ca'}>],
[<Axes: title={'center': 'thal'}>,
<Axes: title={'center': 'target'}>, <Axes: >, <Axes: >]],
dtype=object)
In [ ]:
data
[
'target'
]
=
le
.
fit_transform
(
data
[
'sick'
])
data
=
data
.
drop
(
'sick'
,
axis
=
1
)
data
.
head
()
Out[ ]:
In [ ]:
data
.
hist
(
figsize
=
(
20
,
15
))
Out[ ]:
Loading [MathJax]/extensions/Safe.js
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
[5 pts] Looking at class balance
We also want to make sure we are dealing with a balanced dataset. In this case, we want to confirm
whether or not we have an equitable number of sick and healthy individuals to ensure that our classifier will
have a sufficiently balanced dataset to adequately classify the two. Plot a histogram specifically of the sick
target, and conduct a count of the number of sick and healthy individuals and report on the results:
0 165
1 138
Name: target, dtype: int64
In [ ]:
data
[
'target'
]
.
hist
(
bins
=
2
,
figsize
=
(
5
,
5
))
data
[
'target'
]
.
value_counts
()
Out[ ]:
Loading [MathJax]/extensions/Safe.js
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
There are about 30 more healthy (0) targets than there are sick (1), but overall the data set is well-balanced.
Balanced datasets are important to ensure that classifiers train adequately and don't overfit, however
arbitrary balancing of a dataset might introduce its own issues.
Discuss some of the problems that might arise by artificially balancing a dataset.
If we artifically balance a data set, we may reduce the accuracy of our model. Specifically, if we remove
training points corresponding to the overpowering class, we are training our data on a smaller sample and
thus may not generalize well. On the other hand, if we artifically insert data, our guesses of the correct label
may be noisy or inaccurate and thus will reduce the accuracy of our model.
[5 pts] Looking at Data Correlation
Now that we have our dataframe prepared let's start analyzing our data. For this next question let's look at
the correlations of our variables to our target value. First, use the heatmap function to plot the correlations
of the data.
In [ ]:
correlations
=
data
.
corr
()
columns
=
list
(
data
)
heatmap
(
correlations
.
values
,
columns
,
columns
,
figsize
=
(
20
,
12
),
cmap
=
"hsv"
)
Loading [MathJax]/extensions/Safe.js
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Next, show the correlation to the "target" feature only and sorr them in descending order.
target 1.000000
exang 0.436757
oldpeak 0.430696
ca 0.391724
thal 0.344029
sex 0.280937
age 0.225439
trestbps 0.144931
chol 0.085239
fbs 0.028046
restecg -0.137230
slope -0.345877
thalach -0.421741
cp -0.433798
Name: target, dtype: float64
From the heatmap values and the description of the features, why do you think some variables
correlate more highly than others?
(This question is just to get you thinking and there is no perfect
answer since we have no medical background)
In [ ]:
correlations
[
"target"
]
.
sort_values
(
ascending
=
False
)
Out[ ]:
Loading [MathJax]/extensions/Safe.js
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Some variables, such as exercise induced angina, may tell us more about whether a patient has heart
disease than other factors, such as cholesterol. There is probably some science behind why some of these
features are more related and thus have a higher coefficient than others.
[25 pts] Part 2. Prepare the Data and run a KNN Model
Before running our various learning methods, we need to do some additional prep to finalize our data.
Specifically you'll have to cut the classification target from the data that will be used to classify, and then
you'll have to divide the dataset into training and testing cohorts.
Specifically, we're going to ask you to prepare 2 batches of data. The first batch will simply be the raw
numeric data that hasn't gone through any additional pre-processing. The second batch will be data that you
will pipeline using pre-processing methods. We will then feed both of these datasets into a classifier to
showcase just how important this step can be!
[2 pts] Separate target labels from data
Save the label column as a separate array and then make a new dataframe without the target.
[5 pts] Balanced Train Test Split
Now, create your 'Raw' unprocessed training data by dividing your dataframe into training and testing
cohorts, with your training cohort consisting of 60% of your total dataframe. To ensure that the train and test
sets have balanced classes, use the stratify command of train_test_split
. Output the resulting shapes of
your training and testing samples to confirm that your split was successful. Additionally, output the class
counts for the training and testing cohorts to confirm that there is no artifical class imbalance.
Note: Use randomstate = 0 to ensure that the same train/test split happens everytime for ease of grading.
[5 pts] KNN on raw data
Now, let's try a classification model on this data. We'll first use KNN since it is the one we are most familiar
with.
One thing we noted in class was that because KNN relies on Euclidean distance, it is highly sensitive to the
relative magnitude of different features. Let's see that in action! Implement a K-Nearest Neighbor algorithm
on our data and report the results. For this initial implementation, simply use the default settings. Refer to
the KNN Documentation
for details on implementation. Report on the test accuracy of the resulting
model and print out the confusion matrix.
Recall that accurracy can be calculated easily using metrics.accuracy_score and that we have a helper
function to draw the confusion matrix.
In [ ]:
y
=
data
[
'target'
]
x
=
data
.
drop
(
'target'
,
axis
=
1
)
In [ ]:
train_raw
,
test_raw
,
target
,
target_test
=
train_test_split
(
x
,
y
,
test_size
=
0.2
,
stratify
Loading [MathJax]/extensions/Safe.js
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Accuracy: 0.606557
Confusion Matrix: [[24 9]
[15 13]]
[5 pts] KNN on preprocessed data
Now lets implement a pipeline to preprocess the data. For the pipeline, use StandardScaler on the
numerical features and one-hot encoding on the categorical features. For reference on how to make a
pipeline, please look at project 1.
For reference, the categorical features are ['sex', 'cp', 'fbs', 'restecg', 'exang', 'slope', 'ca','thal'].
Now use the pipeline to transform the data and then apply the same KNN classifier with this new
training/testing data. Report the test accuraccy. Discuss the implications of the different results you
are obtaining.
Note: Remember to use fit_transform on the training data and transform on the testing data.
Accuracy: 0.836066
The accuracy significantly improved, jumping from roughly 60% for the raw data to now roughly 84%.
[8 pts] KNN Parameter optimization for n_neighbors
The KNN Algorithm includes an n_neighbors attribute that specifies how many neighbors to use when
developing the cluster. (The default value is 5, which is what your previous model used.) Lets now try n
values of: 1, 2, 3, 5, 7, 9, 10, 20, and 50. Run your model for each value and report the test accuracy for
each. (HINT leverage python's ability to loop to run through the array and generate results without needing
to manually code each iteration).
In [ ]:
# k-Nearest Neighbors algorithm
knn
=
KNeighborsClassifier
(
n_neighbors
=
3
)
knn
.
fit
(
train_raw
,
target
)
predicted
=
knn
.
predict
(
test_raw
)
print
(
"
%-12s
%f
"
%
(
'Accuracy:'
,
metrics
.
accuracy_score
(
target_test
,
predicted
)))
print
(
"Confusion Matrix: \n
"
,
metrics
.
confusion_matrix
(
target_test
,
predicted
))
In [ ]:
features_num
=
[
'trestbps'
,
'chol'
,
'thalach'
,
'oldpeak'
]
features_cat
=
[
'sex'
,
'cp'
,
'fbs'
,
'restecg'
,
'exang'
,
'slope'
,
'ca'
,
'thal'
]
pipeline
=
ColumnTransformer
(
[(
"num"
,
StandardScaler
(),
features_num
),
(
"cat"
,
OneHotEncoder
(),
features_cat
)
]
)
In [ ]:
train
=
pipeline
.
fit_transform
(
train_raw
)
test
=
pipeline
.
transform
(
test_raw
)
knn
=
KNeighborsClassifier
(
n_neighbors
=
3
)
knn
.
fit
(
train
,
target
)
predicted
=
knn
.
predict
(
test
)
print
(
"
%-12s
%f
"
%
(
'Accuracy:'
,
metrics
.
accuracy_score
(
target_test
,
predicted
)))
Loading [MathJax]/extensions/Safe.js
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Accuracy for k=1: 0.8032786885245902
Accuracy for k=2: 0.7377049180327869
Accuracy for k=3: 0.8360655737704918
Accuracy for k=5: 0.8032786885245902
Accuracy for k=7: 0.7704918032786885
Accuracy for k=9: 0.7868852459016393
Accuracy for k=10: 0.7540983606557377
Accuracy for k=20: 0.7704918032786885
Accuracy for k=50: 0.7540983606557377
Comment for which value of n did the KNN model perform the best. Did the model perform strictly
better or stricly worse as the value of n increased?
The value of k=3 performed the best, with an accuracy of 83.6%. The accuracy neither strictly increased or
decreased with an increasing k; it went up and down a few times.
So we have a model that seems to work well. But let's see if we can do better! To do so we'll employ
Logistic Regression and SVM to improve upon the model and compare the results.
For the rest of the project, you will only be using the transformed data and not the raw data. DO NOT
USE THE RAW DATA ANYMORE.
[20 pts] Part 3. Additional Learning Methods: Logistic
Regression
Let's now try Logistic Regression. Recall that Logistic regression is a statistical model that in its basic form
uses a logistic function to model a binary dependent variable.
[5 pts] Run the default Logistic Regression
Implement a Logistical Regression Classifier. Review the Logistical Regression Documentation
for how to
implement the model. Use the default settings. Report on the test accuracy and print out the confusion
matrix.
In [ ]:
k_r
=
[
1
,
2
,
3
,
5
,
7
,
9
,
10
,
20
,
50
]
for
k
in
k_r
:
knn
=
KNeighborsClassifier
(
n_neighbors
=
k
)
knn
.
fit
(
train
,
target
)
predicted
=
knn
.
predict
(
test
)
print
(
'Accuracy for k='
,
k
,
': '
,
metrics
.
accuracy_score
(
target_test
,
predicted
),
se
In [ ]:
log_reg
=
LogisticRegression
()
log_reg
.
fit
(
train
,
target
)
testing_result
=
log_reg
.
predict
(
test
)
predicted
=
log_reg
.
predict
(
test
)
print
(
"
%-12s
%f
"
%
(
'Accuracy:'
,
metrics
.
accuracy_score
(
target_test
,
predicted
)))
print
(
"Confusion Matrix: \n
"
,
metrics
.
confusion_matrix
(
target_test
,
predicted
))
Loading [MathJax]/extensions/Safe.js
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Accuracy: 0.852459
Confusion Matrix: [[30 3]
[ 6 22]]
[5 pts] Compare Logistic Regression and KNN
In your own words, describe the key differences between Logistic Regression and KNN? When would you
use one over the other?
Logisitc regression leverages the sigmoid function and probability to make predictions on test data. KNN on
the other hand, uses some notion of distance of the closest points in the training data to make predictions.
In KNN, there is no notion of training or a loss function since there are no parameters. Runtime for fitting of
logisitc regression is a lot longer than KNN, but making test predictions is a lot shorter.
[5 pts] Tweaking the Logistic Regression
What are some parameters we can change that will affect the performance of Logistic Regression?
We can change parameters such as C, which is the strength of regularization, max_iter, which is the
maximum number of iterations the training will take to converge, penalty, which is the type of norm we use
for regularization, and solver, which is the alogrithm used to solve the optimization problem for the
parameters.
Implement Logistic Regression with solver= 'liblinear', max_iter= 1000, penalty = 'l2', and C=1.
Report on the test accuracy and print out the confusion matrix.
Accuracy: 0.852459
Confusion Matrix: [[30 3]
[ 6 22]]
Now, Implement Logistic Regression with solver= 'liblinear', max_iter= 1000, penalty = 'l2', and
C=0.0001. Report on the test accuracy and print out the confusion matrix.
Accuracy: 0.754098
Confusion Matrix: [[31 2]
[13 15]]
In [ ]:
log_reg
=
LogisticRegression
(
solver
=
'liblinear'
,
max_iter
=
1000
,
penalty
=
'l2'
,
C
=
1
)
log_reg
.
fit
(
train
,
target
)
testing_result
=
log_reg
.
predict
(
test
)
predicted
=
log_reg
.
predict
(
test
)
print
(
"
%-12s
%f
"
%
(
'Accuracy:'
,
metrics
.
accuracy_score
(
target_test
,
predicted
)))
print
(
"Confusion Matrix: \n
"
,
metrics
.
confusion_matrix
(
target_test
,
predicted
))
In [ ]:
log_reg
=
LogisticRegression
(
solver
=
'liblinear'
,
max_iter
=
1000
,
penalty
=
'l2'
,
C
=
.00
log_reg
.
fit
(
train
,
target
)
testing_result
=
log_reg
.
predict
(
test
)
predicted
=
log_reg
.
predict
(
test
)
print
(
"
%-12s
%f
"
%
(
'Accuracy:'
,
metrics
.
accuracy_score
(
target_test
,
predicted
)))
print
(
"Confusion Matrix: \n
"
,
metrics
.
confusion_matrix
(
target_test
,
predicted
))
Loading [MathJax]/extensions/Safe.js
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Did the accuraccy drop or improve? Why?
Accuracy dropped. This is because a low value of C indicates a very high level of regularization, so the
parameters were forced to be very small and thus they likely underfitteed the data.
[5 pts] Trying out different penalties
Now, Implement Logistic Regression with solver= 'liblinear', max_iter= 1000, penalty = 'l1', and C=1.
Report on the test accuracy and print out the confusion matrix.
Accuracy: 0.868852
Confusion Matrix: [[31 2]
[ 6 22]]
Describe what the purpose of a penalty term is and how the change from L2 to L1 affected the
model.
The purpose of a penalty term is for regulariztion of model parameters to ensure that the model is not
overfitting the data. The difference between L2 and L1 is that in L2, large parameter values are punished
more heavily whereas in L1, smaller parameter values are punished more heavily. Using L1 regularization in
this case actually increased then accuracy of the model slightly.
[20 pts] Part 4. Additional Learning Methods: SVM (Support
Vector Machine)
A Support Vector Machine (SVM) is a discriminative classifier formally defined by a separating hyperplane.
In other words, given labeled training data (supervised learning), the algorithm outputs an optimal
hyperplane which categorizes new examples. In two dimensional space this hyperplane is a line dividing a
plane in two parts each corresponding to one of the two classes.
Recall that Sci-kit learn uses soft-margin SVM to account for datasets that are not separable.
[5 pts] Run default SVM classifier
Implement a Support Vector Machine classifier on your pipelined data. Review the SVM Documentation
for
how to implement a model. For this implementation you can simply use the default settings. Report on the
test accuracy and print out the confusion matrix.
In [ ]:
log_reg
=
LogisticRegression
(
solver
=
'liblinear'
,
max_iter
=
1000
,
penalty
=
'l1'
,
C
=
1
)
log_reg
.
fit
(
train
,
target
)
testing_result
=
log_reg
.
predict
(
test
)
predicted
=
log_reg
.
predict
(
test
)
print
(
"
%-12s
%f
"
%
(
'Accuracy:'
,
metrics
.
accuracy_score
(
target_test
,
predicted
)))
print
(
"Confusion Matrix: \n
"
,
metrics
.
confusion_matrix
(
target_test
,
predicted
))
In [ ]:
svm
=
SVC
()
svm
.
fit
(
train
,
target
)
predicted
=
svm
.
predict
(
test
)
print
(
"
%-12s
%f
"
%
(
'Accuracy:'
,
metrics
.
accuracy_score
(
target_test
,
predicted
)))
print
(
"Confusion Matrix: \n
"
,
metrics
.
confusion_matrix
(
target_test
,
predicted
))
Loading [MathJax]/extensions/Safe.js
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Accuracy: 0.803279
Confusion Matrix: [[28 5]
[ 7 21]]
Print out the number of support vectors that SVC has determined. Look at the documentation for
how to get this.
[69 68]
You may find that there are quite a few support vectors. This is due in part to the small number of samples
in the training set and the choice of kernel.
[5 pts] Use a Linear SVM
Rerun your SVM, but now modify your model parameter kernel to equal 'linear'. Report on the test
accuracy and print out the confusion matrix. Also, print out the number of support vectors.
Accuracy: 0.852459
Confusion Matrix: [[30 3]
[ 6 22]]
[44 46]
You will notice that number of support vectors has decreased significantly.
[5 pts] Compare default SVM and Linear SVM
Explain the what the new results you've achieved mean. Read the documentation to understand what
you've changed about your model and explain why changing that input parameter might impact the results
in the manner you've observed.
By default, the kernel is 'rbf' or radial basis function. This enforces that the descision boundary is non-linear.
However, our data is actually better suited with a linear separator. Thus, when we use a linear kernel, we
make our decision boundary linear which fits the data better, resulting in an increased accuracy and less
support vectors.
[5 pts] Compare SVM and Logistic Regression
Both logistic regression and linear SVM are trying to classify data points using a linear decision boundary
but achieve it in different ways. In your own words, explain the difference between the ways that Logistic
Regression and Linear SVM find the boundary?
The loss function for logistic regression and svm are based on different principles and thus the algorithms
run differently. In logisitc regression, we look at the probability that a particular data point is positively
classified based on the sigmoid function, and is punshed for having too high or too low of a probability. On
In [ ]:
print
(
svm
.
n_support_
)
In [ ]:
svm
=
SVC
(
kernel
=
'linear'
)
svm
.
fit
(
train
,
target
)
predicted
=
svm
.
predict
(
test
)
print
(
"
%-12s
%f
"
%
(
'Accuracy:'
,
metrics
.
accuracy_score
(
target_test
,
predicted
)))
print
(
"Confusion Matrix: \n
"
,
metrics
.
confusion_matrix
(
target_test
,
predicted
))
print
(
svm
.
n_support_
)
Loading [MathJax]/extensions/Safe.js
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
the other hand, the loss function for SVM is based on geometry and aims to maximize the margin between
the separator and the dataset. Points that are too close to the separator are punished.
[10 pts] Part 5: Cross Validation and Model Selection
You've sampled a number of different classification techniques and have seen their performance on the
dataset. Before we draw any conclusions on which model is best, we want to ensure that our results are not
the result of the random sampling of our data we did with the Train-Test-Split. To ensure otherwise, we will
conduct a K-Fold Cross-Validation with GridSearch to determine which model perform best and assess its
performance on the test set.
[10 pts] Model Selection
Run a GridSearchCV with 3-Fold Cross Validation. You will be running each classification model with
different parameters.
KNN:
n_neighbors = [1,3,5,7]
metric = ["euclidean","manhattan"] #Different Distance functions
Logistic Regression:
penalty = ["l1","l2"]
solver = ["liblinear"]
C = [0.0001,0.1,10]
SVM:
kernel = ["linear","rbf"]
C = [0.0001,0.1,10]
Make sure to train and test your model on the transformed data and not on the raw data.
After using GridSearchCV, put the results into a pandas Dataframe and print out the whole table.
In [ ]:
parametersKNN
=
[
{
"n_neighbors"
:[
1
,
3
,
5
,
7
],
"metric"
:[
"euclidean"
,
"manhattan"
]
}
]
parametersLR
=
[
{
"penalty"
:[
"l1"
,
"l2"
],
"solver"
:[
"liblinear"
],
"C"
:[
.0001
,
.1
,
10
]
}
]
parametersSVM
=
[
{
"kernel"
:[
'linear'
,
'rbf'
],
C
:[
.0001
,
.1
,
10
]
}
]
Loading [MathJax]/extensions/Safe.js
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
param_n_neighbors
param_metric
mean_test_score
7
7
manhattan
0.830658
3
7
euclidean
0.830556
2
5
euclidean
0.822479
6
5
manhattan
0.814198
5
3
manhattan
0.805916
1
3
euclidean
0.785185
4
1
manhattan
0.777058
0
1
euclidean
0.772942
param_penalty
param_solver
param_C
mean_test_score
3
l2
liblinear
0.1
0.834877
5
l2
liblinear
10
0.826749
4
l1
liblinear
10
0.826698
1
l2
liblinear
0.0001
0.805813
2
l1
liblinear
0.1
0.789249
0
l1
liblinear
0.0001
0.545473
k
=
3
kf
=
KFold
(
n_splits
=
k
,
random_state
=
None
)
KNN
=
KNeighborsClassifier
()
gridKNN
=
GridSearchCV
(
KNN
,
parametersKNN
,
cv
=
kf
,
scoring
=
'accuracy'
)
gridKNN
.
fit
(
train
,
target
)
resKNN
=
pd
.
DataFrame
(
gridKNN
.
cv_results_
)
.
sort_values
(
by
=
[
"mean_test_score"
],
ascendin
resKNN
[[
"param_n_neighbors"
,
"param_metric"
,
"mean_test_score"
]]
Out[ ]:
In [ ]:
parametersLR
=
[
{
"penalty"
:[
"l1"
,
"l2"
],
"solver"
:[
"liblinear"
],
"C"
:[
.0001
,
.1
,
10
]
}
]
LR
=
LogisticRegression
()
gridLR
=
GridSearchCV
(
LR
,
parametersLR
,
cv
=
kf
,
scoring
=
'accuracy'
)
gridLR
.
fit
(
train
,
target
)
resLR
=
pd
.
DataFrame
(
gridLR
.
cv_results_
)
.
sort_values
(
by
=
[
"mean_test_score"
],
ascending
=
resLR
[[
"param_penalty"
,
"param_solver"
,
"param_C"
,
"mean_test_score"
]]
Out[ ]:
In [ ]:
parametersSVM
=
[
{
"kernel"
:[
'linear'
,
'rbf'
],
"C"
:[
.0001
,
.1
,
10
]
}
]
SVM
=
SVC
()
gridSVM
=
GridSearchCV
(
SVM
,
parametersSVM
,
cv
=
kf
,
scoring
=
'accuracy'
)
gridSVM
.
fit
(
train
,
target
)
Loading [MathJax]/extensions/Safe.js
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
param_C
param_kernel
mean_test_score
2
0.1
linear
0.830761
3
0.1
rbf
0.814095
4
10
linear
0.801749
5
10
rbf
0.764506
0
0.0001
linear
0.545473
1
0.0001
rbf
0.545473
What was the best model and what was it's score?
The best model was logisitc regression with the following parameters: penalty = l2, C=0.1, and
solver=liblinear. This model had a mean_test_score of 0.834877.
Using the best model you have, report the test accuracy and print out the confusion matrix
Accuracy: 0.836066
Confusion Matrix: [[30 3]
[ 7 21]]
resLR
=
pd
.
DataFrame
(
gridSVM
.
cv_results_
)
.
sort_values
(
by
=
[
"mean_test_score"
],
ascending
resLR
[[
"param_C"
,
"param_kernel"
,
"mean_test_score"
]]
Out[ ]:
In [ ]:
best_model
=
LogisticRegression
(
penalty
=
"l2"
,
C
=
0.1
,
solver
=
"liblinear"
)
best_model
.
fit
(
train
,
target
)
predicted
=
best_model
.
predict
(
test
)
print
(
"
%-12s
%f
"
%
(
'Accuracy:'
,
metrics
.
accuracy_score
(
target_test
,
predicted
)))
print
(
"Confusion Matrix: \n
"
,
metrics
.
confusion_matrix
(
target_test
,
predicted
))
Loading [MathJax]/extensions/Safe.js
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Recommended textbooks for you

Systems Architecture
Computer Science
ISBN:9781305080195
Author:Stephen D. Burd
Publisher:Cengage Learning
Np Ms Office 365/Excel 2016 I Ntermed
Computer Science
ISBN:9781337508841
Author:Carey
Publisher:Cengage

Database Systems: Design, Implementation, & Manag...
Computer Science
ISBN:9781305627482
Author:Carlos Coronel, Steven Morris
Publisher:Cengage Learning

Fundamentals of Information Systems
Computer Science
ISBN:9781305082168
Author:Ralph Stair, George Reynolds
Publisher:Cengage Learning

Principles of Information Systems (MindTap Course...
Computer Science
ISBN:9781285867168
Author:Ralph Stair, George Reynolds
Publisher:Cengage Learning
Recommended textbooks for you
- Systems ArchitectureComputer ScienceISBN:9781305080195Author:Stephen D. BurdPublisher:Cengage LearningNp Ms Office 365/Excel 2016 I NtermedComputer ScienceISBN:9781337508841Author:CareyPublisher:Cengage
- Database Systems: Design, Implementation, & Manag...Computer ScienceISBN:9781305627482Author:Carlos Coronel, Steven MorrisPublisher:Cengage LearningFundamentals of Information SystemsComputer ScienceISBN:9781305082168Author:Ralph Stair, George ReynoldsPublisher:Cengage LearningPrinciples of Information Systems (MindTap Course...Computer ScienceISBN:9781285867168Author:Ralph Stair, George ReynoldsPublisher:Cengage Learning

Systems Architecture
Computer Science
ISBN:9781305080195
Author:Stephen D. Burd
Publisher:Cengage Learning
Np Ms Office 365/Excel 2016 I Ntermed
Computer Science
ISBN:9781337508841
Author:Carey
Publisher:Cengage

Database Systems: Design, Implementation, & Manag...
Computer Science
ISBN:9781305627482
Author:Carlos Coronel, Steven Morris
Publisher:Cengage Learning

Fundamentals of Information Systems
Computer Science
ISBN:9781305082168
Author:Ralph Stair, George Reynolds
Publisher:Cengage Learning

Principles of Information Systems (MindTap Course...
Computer Science
ISBN:9781285867168
Author:Ralph Stair, George Reynolds
Publisher:Cengage Learning