Tutorial_W12 solution (3)
pdf
keyboard_arrow_up
School
Queensland University of Technology *
*We aren’t endorsed by this school
Course
IFN552
Subject
Computer Science
Date
Dec 6, 2023
Type
Pages
17
Uploaded by CountGalaxy14444
Week 12 Computer Tutorial - Predictive
mining using Neural networks
Dr Richi Nayak, r.nayak@qut.edu.au
Topics:
1. Loading the Pre-processed Dataset
2. Building the first neural network model
3. Finding optimal hyperparameters with GridSearchCV
4. Feature selection
5. Comparision and finding the best performing model using ROC curve
6. Ensemble Modeling
Part 1 - Reflective exercises
In this tutorial, you will be reflecting on the concepts of predictive modelling using Neural
networks.
Exercise 1: Neural Network - Basics
1. A feed-forward neural network is said to be fully connected when
a. all nodes are connected to each other. b. all nodes at the same layer are connected to each other. c. all nodes at one layer are connected to all nodes in the next higher layer. d. all hidden layer nodes are connected to all output layer nodes. Which one is true?
2. The values input into a feed-forward neural network
a. may be categorical or numeric. b. must be either all categorical or all numeric but not both. c. must be numeric. d. must be categorical. Which one is true?
3. The Neural network training is accomplished by repeatedly passing the training data
through the network while
a. individual network weights are modified. b. training instance attribute values are modified. c. the ordering of the training instances is modified. d. individual network nodes have the coefficients on their corresponding functional parameters modified. Which one is true?
4. Epochs represent the total number of
a. input layer nodes. b. passes of the training data through the network. c. network nodes. d. passes of the test data through the network.
5. What happens when a neural network is over trained?
6. Why Neural Networks are called as Universal Approximator?
7. Which statement is true about neural network and linear regression models?
a. Both models require input attributes to be numeric. b. Both models require numeric attributes to range between 0 and 1. c. The output of both models is a categorical attribute value. d. Both techniques build models whose output is determined by a linear sum of weighted input attribute values. e. More than one of a,b,c or d is true.
8. This supervised learning technique can process both numeric and categorical output
attribute.
a. linear regression b. decision tree c. logistic regression d. neural network learning
9. Compare classification algorithms (Linear Regression, DT & ANN).
Exercise 2: Predictive mining using Neural networks
1. Consider the following network with three input nodes, three links and one output node.
Calculate the output of f(x), assuming the node has a linear function as activation. What will
be the values if the logistic (or sigmoid) and Relu functions are applied on the otput node? 2. Consider the neural network shown below in (a) and the following table in (b) of the sample
data. Compute values for nodes D, E, and F using the logistic function as the activation
function. 3. The following table consists of training data from an employee database. The data have
been generalized. For a given row entry, the count represents the number of data tuples
having the values for the department, status, age, and salary given in that row. Design a
multilayer feed-forward neural network for the given data. Label the nodes in the input and
output layers. 4. The Neural network training is accomplished by repeatedly passing the training data
through the network while
a. Individual network weights are modified. b. Training instance attribute values are modified. c. The ordering of the training instances is modified. d. Individual network nodes have the coefficients on their corresponding functional parameters modified.
5. What happens when a neural network is over-trained?
6. Why Neural Networks are called as Universal Approximator?
Exercise 3: Comparison of algorithms
1. Discuss how will you control overfitting in algorithms such as decision tree, neural network,
and logistic regression.
2. Compare the 3 classification algorithms, Decision tree, Neural network, and Logistic
regression. Identify specific features of the dataset/learning such as the presence of missing
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
values, large size dataset, need of speed, and many others. Comment on the performance
of each algorithm in terms of how good it will work if that feature is present in the
data/problem.
3. Given the two data sets shown as in the following figures, explain how well/bad the three
different classification algorithms (DT, Logistic Regression and NN) will perform on these
data sets. Part 2 - Practical exercises
This practical notes contain the instruction of neural network modelling in Python. The objective
is to build a neural network to classify the lapsing donors based on their responses to the
greeting card mailing campaign conducted by the national veterans' organisation. We will use
the Veteran
dataset to predict the TARGETB
variable.
With its exotic sounding name, a neural network model is often regarded as mysterious yet
powerful predictive tool. Perhaps surprisingly, the most typical form of neural network, in fact, is
a natural extension of regression model. This form of neural network is called multilayer
perceptron
, which is the subject of our practical today. A single node neural network, called as
perceptron, with a linear activation function can be considered as a linear regression.
At the end of this practical, we would have built a series of predictive models including decision
tree, logistic regression and neural network. We will compare all the models to comprehend the
strengths and weaknesses of each modeling method.
In the financial and health domains, performance of a predictive model is crucial. To achieve a
better performance, multiple models can be combined together to achieve a higher predictive
performance than individual models. This approach is called ensemble modeling
and it will be
covered in the last part of this practical.
1. Loading the Pre-processed Dataset
We will reuse the code for data preprocessing developed in previous practicals.
In [1]:
# libraries
import
pandas as
pd import
numpy as
np from
sklearn.model_selection import
train_test_split from
sklearn.metrics import
classification_report
, accuracy_score
2. Building the first neural network model
Let us tart by importing a neural network model from the library. In sklearn
, a neural network
classifier is implemented in MLPClassifier
, short for the multilayer perceptron classifier.
Let's train our first MLPClassifier. Initiate the model without any additional parameter (other than
the random state for consistency), fit it to the training data and test its performance on the test
data.
Train accuracy: 0.8952802359882006
Test accuracy: 0.5350997935306263 precision recall f1-score support 0 0.53 0.55 0.54 1453 1 0.54 0.52 0.53 1453 accuracy 0.54 2906 macro avg 0.54 0.54 0.53 2906 weighted avg 0.54 0.54 0.53 2906 MLPClassifier(random_state=10) C:\Users\bsthi\anaconda3\lib\site-packages\sklearn\neural_network\_multilayer_percep
tron.py:614: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (200) reac
hed and the optimization hasn't converged yet. warnings.warn( This default neural network performed with high accuracy on the training dataset. However, the
test accuracy is comparatively much lower (53%), leading to overfitting to the training data. You
should also notice a convergence warning.
from
sklearn.model_selection import
GridSearchCV from
dm_tools import
data_prep from
sklearn.preprocessing import
StandardScaler # set the random seed - consistent
rs =
10 # load the data
df
,
X
,
y
,
X_train
, X_test
, y_train
, y_test =
data_prep
() scaler =
StandardScaler
() X_train =
scaler
.
fit_transform
(
X_train
, y_train
) X_test =
scaler
.
transform
(
X_test
) In [2]:
from
sklearn.neural_network import
MLPClassifier In [3]:
model_1 =
MLPClassifier
(
random_state
=
rs
) model_1
.
fit
(
X_train
, y_train
) print
(
"Train accuracy:"
, model_1
.
score
(
X_train
, y_train
)) print
(
"Test accuracy:"
, model_1
.
score
(
X_test
, y_test
)) y_pred =
model_1
.
predict
(
X_test
) print
(
classification_report
(
y_test
, y_pred
)) print
(
model_1
)
In sklearn, if a neural network does not achieve convergence before maximum iteration, it will
raise a "convergence is not reached" warning message. This first neural network raised the
convergence warning. If a warning message is raised, the max_iter
hyperparameter of the
neural network should be increased. However, if the network fails to reach convergence even
with the very large number of maximum iteration, this might indicate a problem with error
computation.
The following code demonstrates the status of MLP classifier training with 700 max_iter
.
Train accuracy: 0.9960176991150442
Test accuracy: 0.534755677907777 precision recall f1-score support 0 0.53 0.53 0.53 1453 1 0.53 0.54 0.54 1453 accuracy 0.53 2906 macro avg 0.53 0.53 0.53 2906 weighted avg 0.53 0.53 0.53 2906 MLPClassifier(max_iter=700, random_state=10) The neural network after setting the iteration of 700 performed with much higher accuracy on
the training dataset (99%). However, the test accuracy remains the same (53%). This clearly
indicates overfitting to the training data. Next, we will use the GridSearch tuning and
dimensionality reduction techniques to control the overfitting of the model to the training data.
Solver and activation function
Finding the best combination of weights in neural networks is a significant search problem. The
algorithm used to find this optimal weight set is called solver
in sklearn, and the most common
one is Gradient Descent. Gradient descent starts with a set of randomly generated weights. In
each iteration of gradient descent, predictions are made on X_train and the error value (cost) is
computed. The weight set is then altered to reduce this error value. Each iteration is called an
epoch. To stop gradient descent iterations, a strategy such as maximum iterations
, minimum
error threshold
or convergence reached
(error is not improved over a certain number of
epochs) is used. A combination of max iterations and convergence is the most commonly used
criterion.
Observe the MLPClassifier object hyperparameters printed out before. You should see that the solver
hyperparameter is set by default to adam
(stands for adaptive moment estimation).
Adam is an extension of gradient descent, designed to speed up the training process and be
In [4]:
model_2 =
MLPClassifier
(
max_iter
=
700
, random_state
=
rs
) model_2
.
fit
(
X_train
, y_train
) print
(
"Train accuracy:"
, model_2
.
score
(
X_train
, y_train
)) print
(
"Test accuracy:"
, model_2
.
score
(
X_test
, y_test
)) y_pred =
model_2
.
predict
(
X_test
) print
(
classification_report
(
y_test
, y_pred
)) print
(
model_2
)
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
more computationaly efficient. Adam is the solver algorithm of choice for many deep neural
networks for its efficiency and we will use adam instead of normal gradient descent here.
A great explaination of adam
Another important hyperparameter to observe is activation
, which refers to activation
function used in hidden layers of the neural network. There are a number of options to use,
including:
identity
: , no transformation
tanh
: sigmoid
: , commonly used in logistic regression
relu - rectified linear unit
: , default option in sklearn
The identity function will change neural network into a linear model, thus it is not commonly
used. In the past, tanh and sigmoid function were very popular. However, the recent research
suggests that the relu
function can produce similarly accurate results at much lower training
time. Therefore, we will use relu as the activation function in this practical.
3. Finding optimal hyperparameter with GridSearchCV
Next we will find the optimal hyperparameters using GridSearchCV. Neural network is harder to
tune than decision trees or regression models due to relatively many types of parameters and
the slow training process. In this practical, we will focus on tuning two parameters:
1. hidden_layer_sizes
: It has values of tuples, and within each tuple, element i-th
represents the number of neurons contained in each hidden layer.
2. alpha
: L2 regularization parameter used in each neuron's activation function.
Let us start by tuning the hidden layer sizes. There is no official guideline on how many neurons
we should have in each layer, but for most data mining tasks a single hidden layer with neurons
no more than the number of input variables and no less than output neurons (binary
classification task, hence 1) is sufficient.
Deep Learning
In the past decade, deep learning models have become highly popular. Deep
learning is a process of building very complex neural networks (up to hundreds of
layers and thousands of neurons, hence deep
). Deep neural networks are typically
used for complex tasks, like image recognition, Siri-like voice assistant, machine
translation and self-driving tasks.
See how many input features we have by printing out the train shape.
(6780, 85) With 85 features, we will start tuning with one hidden layer of 5 to 85 neurons, increment of 20.
This might take a bit of time.
In [5]:
print
(
X_train
.
shape
)
Tips: Setting the max_iter
to a higher value (say 700) can complete the GridSearchCV without
the convergence warning. However, the process will be expensive and take more time to
complete.
C:\Users\bsthi\anaconda3\lib\site-packages\sklearn\neural_network\_multilayer_percep
tron.py:614: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (200) reac
hed and the optimization hasn't converged yet. warnings.warn( GridSearchCV(cv=10, estimator=MLPClassifier(random_state=10), n_jobs=-1, param_grid={'hidden_layer_sizes': [(5,), (25,), (45,), (65,), (85,)]}, return_train_score=True)
{'mean_fit_time': array([ 5.53638704, 8.25172894, 10.54289422, 11.23771729, 12.7887
9881]), 'std_fit_time': array([0.20275097, 0.89427211, 1.1670461 , 0.4234314 , 2.094
95097]), 'mean_score_time': array([0.00199392, 0.00189514, 0.00268989, 0.00409091, 0.00269284]), 'std_score_time': array([0.00109221, 0.00029912, 0.00077612, 0.0010415
2, 0.00109638]), 'param_hidden_layer_sizes': masked_array(data=[(5,), (25,), (45,), (65,), (85,)], mask=[False, False, False, False, False], fill_value='?', dtype=object), 'params': [{'hidden_layer_sizes': (5,)}, {'hidden_layer_s
izes': (25,)}, {'hidden_layer_sizes': (45,)}, {'hidden_layer_sizes': (65,)}, {'hidde
n_layer_sizes': (85,)}], 'split0_test_score': array([0.55014749, 0.54129794, 0.51474
926, 0.52654867, 0.53687316]), 'split1_test_score': array([0.55899705, 0.59734513, 0.58702065, 0.57079646, 0.57669617]), 'split2_test_score': array([0.55457227, 0.5427
7286, 0.55899705, 0.55309735, 0.55014749]), 'split3_test_score': array([0.54129794, 0.54867257, 0.52212389, 0.51474926, 0.51769912]), 'split4_test_score': array([0.5471
9764, 0.54277286, 0.53982301, 0.53834808, 0.54129794]), 'split5_test_score': array
([0.53982301, 0.50737463, 0.51327434, 0.49115044, 0.53982301]), 'split6_test_score': array([0.54719764, 0.52064897, 0.53982301, 0.55014749, 0.5 ]), 'split7_test_sc
ore': array([0.55309735, 0.57964602, 0.52949853, 0.50589971, 0.52064897]), 'split8_t
est_score': array([0.52064897, 0.5280236 , 0.51179941, 0.51327434, 0.49262537]), 'sp
lit9_test_score': array([0.53834808, 0.52212389, 0.5280236 , 0.50442478, 0.5221238
9]), 'mean_test_score': array([0.54513274, 0.54306785, 0.53451327, 0.52684366, 0.529
79351]), 'std_test_score': array([0.0103287 , 0.02600057, 0.02234112, 0.0241708 , 0.
02335605]), 'rank_test_score': array([1, 2, 3, 5, 4]), 'split0_train_score': array
([0.63569322, 0.73549656, 0.79531301, 0.84152737, 0.89233038]), 'split1_train_scor
e': array([0.63176008, 0.7359882 , 0.78629957, 0.84791872, 0.87708948]), 'split2_tra
in_score': array([0.63552933, 0.7258276 , 0.7928548 , 0.84398558, 0.88888889]), 'spl
it3_train_score': array([0.63339889, 0.75008194, 0.80481809, 0.84824648, 0.8890527
7]), 'split4_train_score': array([0.63765978, 0.74434612, 0.8064569 , 0.85414618, 0.
88757784]), 'split5_train_score': array([0.64323173, 0.73566044, 0.80268764, 0.84955
752, 0.88872501]), 'split6_train_score': array([0.64044576, 0.74385447, 0.80694854, 0.85283514, 0.88659456]), 'split7_train_score': array([0.63225172, 0.74287119, 0.800
72108, 0.85365454, 0.88757784]), 'split8_train_score': array([0.63929859, 0.74451 , 0.80350705, 0.84873812, 0.89708292]), 'split9_train_score': array([0.6374959 , 0.7
4139626, 0.80612914, 0.84103573, 0.88643068]), 'mean_train_score': array([0.6366765 , 0.74000328, 0.80057358, 0.84816454, 0.88813504]), 'std_train_score': array([0.0034
9953, 0.00649607, 0.00654393, 0.00449897, 0.00476789])} As usual let us plot the train and test scores of split0.
In [6]:
params =
{
'hidden_layer_sizes'
: [(
x
,) for
x in
range
(
5
, 86
, 20
)]} cv_1 =
GridSearchCV
(
param_grid
=
params
, estimator
=
MLPClassifier
(
random_state
=
rs
),
retu
cv_1
.
fit
(
X_train
, y_train
) Out[6]:
In [7]:
result_set =
cv_1
.
cv_results_ print
(
result_set
)
Total number of models: 5 Now, plot the mean train and test scores of all runs.
Total number of models: 5 In [8]:
import
matplotlib.pyplot as
plt train_result =
result_set
[
'split0_train_score'
] test_result =
result_set
[
'split0_test_score'
] print
(
"Total number of models: "
, len
(
test_result
)) # plot hidden layers hyperparameter values vs training and test accuracy score
plt
.
plot
(
range
(
0
, len
(
train_result
)), train_result
, 'b'
, range
(
0
,
len
(
test_result
)), plt
.
xlabel
(
'Hyperparameter Hidden_layers\nBlue = training acc. Red = test acc.'
) plt
.
xticks
(
range
(
0
, len
(
train_result
)),
range
(
5
, 86
, 20
)) plt
.
ylabel
(
'score'
) plt
.
show
() In [9]:
### Enter your code
train_result =
result_set
[
'mean_train_score'
] test_result =
result_set
[
'mean_test_score'
] print
(
"Total number of models: "
, len
(
test_result
)) # plot hidden layers hyperparameter values vs training and test accuracy score
plt
.
plot
(
range
(
0
, len
(
train_result
)), train_result
, 'b'
, range
(
0
,
len
(
test_result
)), plt
.
xlabel
(
'Hyperparameter Hidden_layers\nBlue = training acc. Red = test acc.'
) plt
.
xticks
(
range
(
0
, len
(
train_result
)),
range
(
5
, 86
, 20
)) plt
.
ylabel
(
'score'
) plt
.
show
() In [10]:
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Train accuracy: 0.6368731563421829
Test accuracy: 0.5633172746042671 precision recall f1-score support 0 0.56 0.59 0.57 1453 1 0.57 0.54 0.55 1453 accuracy 0.56 2906 macro avg 0.56 0.56 0.56 2906 weighted avg 0.56 0.56 0.56 2906 {'hidden_layer_sizes': (5,)} The output of this GridSearchCV returns 5 neurons as the optimal number of neurons in the
hidden layer. Based on the performance of previous decision tree and regression models, it
seems the less complex models (smaller trees, smaller feature sets) tend to generalise better on
this dataset. We should attempt to tune the model with the lower number of neurons in the
hidden layer.
Train accuracy: 0.6122418879056047
Test accuracy: 0.569511355815554 precision recall f1-score support 0 0.56 0.61 0.58 1453 1 0.57 0.53 0.55 1453 accuracy 0.57 2906 macro avg 0.57 0.57 0.57 2906 weighted avg 0.57 0.57 0.57 2906 {'hidden_layer_sizes': (3,)} We now have the optimal value for neuron count in the hidden layer. Next, we will tune the
second hyperparameter, alpha, which is the learning rate for the gradient descent algorithm.
Larger alpha means the gradient descent will take "larger" steps and train faster, but it might
miss the optimal solution. Smaller alpha results in "smaller" steps, a slower training speed yet it
might stuck at the local minimum.
print
(
"Train accuracy:"
, cv_1
.
score
(
X_train
, y_train
)) print
(
"Test accuracy:"
, cv_1
.
score
(
X_test
, y_test
)) y_pred =
cv_1
.
predict
(
X_test
) print
(
classification_report
(
y_test
, y_pred
)) print
(
cv_1
.
best_params_
) In [11]:
# new parameters
params =
{
'hidden_layer_sizes'
: [(
3
,), (
5
,), (
7
,), (
9
,)]} cv_2 =
GridSearchCV
(
param_grid
=
params
, estimator
=
MLPClassifier
(
random_state
=
rs
), cv
=
cv_2
.
fit
(
X_train
, y_train
) print
(
"Train accuracy:"
, cv_2
.
score
(
X_train
, y_train
)) print
(
"Test accuracy:"
, cv_2
.
score
(
X_test
, y_test
)) y_pred =
cv_2
.
predict
(
X_test
) print
(
classification_report
(
y_test
, y_pred
)) print
(
cv_2
.
best_params_
)
The default value for alpha is 0.0001, thus we will try alpha values around this number.
Train accuracy: 0.6100294985250737
Test accuracy: 0.5667584308327598 precision recall f1-score support 0 0.56 0.60 0.58 1453 1 0.57 0.53 0.55 1453 accuracy 0.57 2906 macro avg 0.57 0.57 0.57 2906 weighted avg 0.57 0.57 0.57 2906 {'alpha': 1e-05, 'hidden_layer_sizes': (3,)} The GridSearch returned a hidden layer of 3 neurons and alpha value of 0.00001 as the optimal
hyperparameters. However, this is not better than the previous model cv_2
.
4. Dimensionality reduction
Next we will try to improve performance of the model through dimensionality reduction and
transformation techniques covered in the regression modelling practical.
4.1. Recursive Feature Elimination
We will first try to reduce the feature set size using RFE. We will need a base elimination model
and RFE requires the type of model that assigns weight/feature importance to each feature (like
regression/decision tree). Unfortunately, neural networks provide neither, thus we will try to use
Logistic Regression as the base elimination model.
40 The RFE with logistic regression has selected 40 features as the best set of features. With these
selected features, tune an MLPClassifier
model.
In [12]:
params =
{
'hidden_layer_sizes'
: [(
3
,), (
5
,), (
7
,), (
9
,)], 'alpha'
: [
0.01
,
0.001
, 0.00
cv_3 =
GridSearchCV
(
param_grid
=
params
, estimator
=
MLPClassifier
(
random_state
=
rs
), cv
=
cv_3
.
fit
(
X_train
, y_train
) print
(
"Train accuracy:"
, cv_3
.
score
(
X_train
, y_train
)) print
(
"Test accuracy:"
, cv_3
.
score
(
X_test
, y_test
)) y_pred =
cv_3
.
predict
(
X_test
) print
(
classification_report
(
y_test
, y_pred
)) print
(
cv_3
.
best_params_
) In [13]:
from
sklearn.feature_selection import
RFECV from
sklearn.linear_model import
LogisticRegression rfe =
RFECV
(
estimator =
LogisticRegression
(
random_state
=
rs
), cv
=
10
) rfe
.
fit
(
X_train
, y_train
) print
(
rfe
.
n_features_
) In [14]:
X_train_rfe =
rfe
.
transform
(
X_train
)
Train accuracy: 0.6128318584070797
Test accuracy: 0.5684790089470062 precision recall f1-score support 0 0.57 0.57 0.57 1453 1 0.57 0.57 0.57 1453 accuracy 0.57 2906 macro avg 0.57 0.57 0.57 2906 weighted avg 0.57 0.57 0.57 2906 {'alpha': 0.001, 'hidden_layer_sizes': (7,)} The RFE selected feature set showed improvements over the original data set. However,
compared with the previous best model (
cv_2
), the model performed slightly worse. This is an
indication that elimination with logistic regression did not produce the feature set suitable for
neural network.
4.2. Selecting using decision tree
Lastly, we will use the decision tree model to perform feature selection for neural network
modelling. To start, we need to tune a decision tree with GridSearchCV.
{'criterion': 'entropy', 'max_depth': 5, 'min_samples_leaf': 10} GiftCnt36 : 0.2786889679489785 DemMedHomeValue : 0.16527761052028647 GiftAvgLast : 0.12080196710144986 GiftCntAll : 0.06936170003476945 GiftTimeLast : 0.06852490725467034
StatusCatStarAll : 0.039760012572680054 DemAge : 0.0395320434572485 PromCnt36 : 0.034873357233237874 GiftCntCardAll : 0.030230871689451603 X_test_rfe =
rfe
.
transform
(
X_test
) # step = int((X_train_rfe.shape[1] + 5)/5);
params =
{
'hidden_layer_sizes'
: [(
3
,), (
5
,), (
7
,), (
9
,)], 'alpha'
: [
0.01
,
0.001
, 0.00
rfe_cv =
GridSearchCV
(
param_grid
=
params
, estimator
=
MLPClassifier
(
random_state
=
rs
), c
rfe_cv
.
fit
(
X_train_rfe
, y_train
) print
(
"Train accuracy:"
, rfe_cv
.
score
(
X_train_rfe
, y_train
)) print
(
"Test accuracy:"
, rfe_cv
.
score
(
X_test_rfe
, y_test
)) y_pred =
rfe_cv
.
predict
(
X_test_rfe
) print
(
classification_report
(
y_test
, y_pred
)) print
(
rfe_cv
.
best_params_
) In [15]:
import
pickle with
open
(
'DT.pickle'
, 'rb'
) as
f
: dt_best
,
roc_index_dt_cv
, fpr_dt_cv
, tpr_dt_cv =
pickle
.
load
(
f
) print
(
dt_best
.
best_params_
) In [16]:
from
dm_tools import
analyse_feature_importance analyse_feature_importance
(
dt_best
.
best_estimator_
, X
.
columns
)
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
GiftTimeFirst : 0.027615727171569734 DemPctVeterans : 0.025370100839886917 PromCnt12 : 0.021248703587791955 DemCluster_21 : 0.016907242527362424 PromCntCard36 : 0.016068732710812456 PromCntAll : 0.01598797726061107 PromCntCard12 : 0.015718616757198205 DemMedIncome : 0.014031461331994594 DemCluster_48 : 0.0 StatusCat96NK_L : 0.0 StatusCat96NK_A : 0.0 (6780, 17) The decision tree model identfies the set of 17 variables as important features. Proceed to tune a
MLPClassifier with this modified dataset.
Train accuracy: 0.6028023598820059
Test accuracy: 0.5615966964900206 precision recall f1-score support 0 0.56 0.58 0.57 1453 1 0.56 0.54 0.55 1453 accuracy 0.56 2906 macro avg 0.56 0.56 0.56 2906 weighted avg 0.56 0.56 0.56 2906 {'alpha': 1e-05, 'hidden_layer_sizes': (5,)} The Neural Network model trained with decision tree selected variables did not manage to
improve model performance. Therefore, we will keep the previous best model (
cv_2
) as the
best performing neural network.
5. Comparing the models to find the best performing
model
A total of seven models has been built:
In [17]:
from
sklearn.feature_selection import
SelectFromModel selectmodel =
SelectFromModel
(
dt_best
.
best_estimator_
, prefit
=
True
) X_train_sel_model =
selectmodel
.
transform
(
X_train
) X_test_sel_model =
selectmodel
.
transform
(
X_test
) print
(
X_train_sel_model
.
shape
) In [18]:
params =
{
'hidden_layer_sizes'
: [(
3
,), (
5
,), (
7
,), (
9
,)], 'alpha'
: [
0.01
,
0.001
, 0.00
cv_sel_model =
GridSearchCV
(
param_grid
=
params
, estimator
=
MLPClassifier
(
random_state
=
cv_sel_model
.
fit
(
X_train_sel_model
, y_train
) print
(
"Train accuracy:"
, cv_sel_model
.
score
(
X_train_sel_model
, y_train
)) print
(
"Test accuracy:"
, cv_sel_model
.
score
(
X_test_sel_model
, y_test
)) y_pred =
cv_sel_model
.
predict
(
X_test_sel_model
) print
(
classification_report
(
y_test
, y_pred
)) print
(
cv_sel_model
.
best_params_
)
1. Default neural network (`model_1`) 2. Neural network with relu (`model_2`) 3. Neural network + grid search (`cv_1`) 4. Neural network + grid search (`cv_2`) 5. Neural network + grid search (`cv_3`) 6. Neural network + feature selection + grid search (`rfe_cv`) 7. Neural network + feature selection using DT + grid search (`cv_sel_model`) Now, let us use ROC curve to compare these models to identify the best performing neural
network model considering both true and false positives.
ROC index on test for NN_default: 0.5496717757455563 ROC index on test for NN with relu: 0.5431669720998726 ROC index on test for NN with gridsearch 1: 0.587214008655704 ROC index on test for NN with gridsearch 2: 0.5897658640144107 ROC index on test for NN with gridsearch 3: 0.5894830876526199 ROC index on test for NN with feature selection and gridsearch: 0.5883865595495283 ROC index on test for NN with feature selection (model selection) and gridsearch: 0.
5939625115277549 In [19]:
from
sklearn.metrics import
roc_auc_score y_pred_proba_nn_1 =
model_1
.
predict_proba
(
X_test
) y_pred_proba_nn_2 =
model_2
.
predict_proba
(
X_test
) y_pred_proba_cv_1 =
cv_1
.
predict_proba
(
X_test
) y_pred_proba_cv_2 =
cv_2
.
predict_proba
(
X_test
) y_pred_proba_cv_3 =
cv_3
.
predict_proba
(
X_test
) y_pred_proba_rfe_cv =
rfe_cv
.
predict_proba
(
X_test_rfe
) y_pred_proba_cv_sel_model =
cv_sel_model
.
predict_proba
(
X_test_sel_model
) roc_index_nn_1 =
roc_auc_score
(
y_test
, y_pred_proba_nn_1
[:, 1
]) roc_index_nn_2 =
roc_auc_score
(
y_test
, y_pred_proba_nn_2
[:, 1
]) roc_index_cv_1 =
roc_auc_score
(
y_test
, y_pred_proba_cv_1
[:, 1
]) roc_index_cv_2 =
roc_auc_score
(
y_test
, y_pred_proba_cv_2
[:, 1
]) roc_index_cv_3 =
roc_auc_score
(
y_test
, y_pred_proba_cv_3
[:, 1
]) roc_index_rfe_cv =
roc_auc_score
(
y_test
, y_pred_proba_rfe_cv
[:, 1
]) roc_index_cv_sel_model =
roc_auc_score
(
y_test
, y_pred_proba_cv_sel_model
[:, 1
]) print
(
"ROC index on test for NN_default:"
, roc_index_nn_1
) print
(
"ROC index on test for NN with relu:"
, roc_index_nn_2
)
print
(
"ROC index on test for NN with gridsearch 1:"
, roc_index_cv_1
) print
(
"ROC index on test for NN with gridsearch 2:"
, roc_index_cv_2
) print
(
"ROC index on test for NN with gridsearch 3:"
, roc_index_cv_3
) print
(
"ROC index on test for NN with feature selection and gridsearch:"
, roc_index_r
print
(
"ROC index on test for NN with feature selection (model selection) and gridsea
from
sklearn.metrics import
roc_curve fpr_nn_1
, tpr_nn_1
, thresholds_nn_1 =
roc_curve
(
y_test
, y_pred_proba_nn_1
[:,
1
]) fpr_nn_2
, tpr_nn_2
, thresholds_nn_2 =
roc_curve
(
y_test
, y_pred_proba_nn_2
[:,
1
]) fpr_cv_1
, tpr_cv_1
, thresholds_cv_1 =
roc_curve
(
y_test
, y_pred_proba_cv_1
[:,
1
]) fpr_cv_2
, tpr_cv_2
, thresholds_cv_2 =
roc_curve
(
y_test
, y_pred_proba_cv_2
[:,
1
]) fpr_cv_3
, tpr_cv_3
, thresholds_cv_3 =
roc_curve
(
y_test
, y_pred_proba_cv_3
[:,
1
]) fpr_rfe_cv
, tpr_rfe_cv
, thresholds_rfe_cv =
roc_curve
(
y_test
, y_pred_proba_rfe_cv
[:,
fpr_cv_sel_model
, tpr_cv_sel_model
, thresholds_cv_sel_model =
roc_curve
(
y_test
, y_pr
In [20]:
Based on the ROC curve, the neural network model with decision tree selected features, with the
ROC index of 0.594, performs marginally better than other models.
Now, your task is to compare the best performing neural network with the best performing
decision tree and logistic regression. Identify which model performs the best on this dataset.
import
matplotlib.pyplot as
plt plt
.
plot
(
fpr_nn_1
, tpr_nn_1
, label
=
'NN_default {:.3f}'
.
format
(
roc_index_nn_1
), color
plt
.
plot
(
fpr_nn_2
, tpr_nn_2
, label
=
'NN with relu {:.3f}'
.
format
(
roc_index_nn_2
), col
plt
.
plot
(
fpr_cv_1
, tpr_cv_1
, label
=
'NN cv_1 {:.3f}'
.
format
(
roc_index_cv_1
), color
=
'b
plt
.
plot
(
fpr_cv_2
, tpr_cv_2
, label
=
'NN cv_2 {:.3f}'
.
format
(
roc_index_cv_2
), color
=
'y
plt
.
plot
(
fpr_cv_3
, tpr_cv_3
, label
=
'NN cv_3 {:.3f}'
.
format
(
roc_index_cv_3
), color
=
'c
plt
.
plot
(
fpr_rfe_cv
, tpr_rfe_cv
, label
=
'NN rfe_cv {:.3f}'
.
format
(
roc_index_rfe_cv
), plt
.
plot
(
fpr_cv_sel_model
, tpr_cv_sel_model
, label
=
'NN with cv_sel_model {:.3f}'
.
for
plt
.
plot
([
0
, 1
], [
0
, 1
], color
=
'navy'
, lw
=
0.5
, linestyle
=
'--'
) plt
.
xlim
([
0.0
, 1.0
]) plt
.
ylim
([
0.0
, 1.0
]) plt
.
xlabel
(
'False Positive Rate'
) plt
.
ylabel
(
'True Positive Rate'
) plt
.
title
(
'Receiver operating characteristic example'
) plt
.
legend
(
loc
=
"lower right"
) plt
.
show
() In [21]:
### Enter your code
import
pickle with
open
(
'DT.pickle'
, 'rb'
) as
f
: dt_best
,
roc_index_dt_cv
, fpr_dt_cv
, tpr_dt_cv =
pickle
.
load
(
f
) with
open
(
'LR.pickle'
, 'rb'
) as
f
: lr_best
,
roc_index_lr_cv
, fpr_lr_cv
, tpr_lr_cv =
pickle
.
load
(
f
) plt
.
plot
(
fpr_dt_cv
, tpr_dt_cv
, label
=
'DT {:.3f}'
.
format
(
roc_index_dt_cv
), color
=
'red
plt
.
plot
(
fpr_lr_cv
, tpr_lr_cv
, label
=
'LR {:.3f}'
.
format
(
roc_index_lr_cv
), color
=
'gre
plt
.
plot
(
fpr_cv_sel_model
, tpr_cv_sel_model
, label
=
'NN with cv_sel_model {:.3f}'
.
for
plt
.
plot
([
0
, 1
], [
0
, 1
], color
=
'navy'
, lw
=
0.5
, linestyle
=
'--'
) plt
.
xlim
([
0.0
, 1.0
]) plt
.
ylim
([
0.0
, 1.0
]) plt
.
xlabel
(
'False Positive Rate'
) plt
.
ylabel
(
'True Positive Rate'
) plt
.
title
(
'Receiver operating characteristic example'
)
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
6. Ensemble Modeling
Ensemble modeling is a supervised learning technique that combines predictions from multiple
models to produce a stronger model. Ensemble models are typically more accurate and more
robust than the individual models that they are built on. Typically, the individual models consist
of different classes (e.g. combining decision tree and logistic regression) or they are trained on
different subsets of the data (e.g. combining 2 decision trees, each trained on one half of
training data).
There are three major techniques in ensemble modelling:
1. Bagging
: Predictions from each models are combined through voting/averaging process.
An example of bagging model is Random Forest
, which consists of many simple decision
trees trained on random subsets of the data.
2. Boosting
: Models are combined through sequential learning. The first model is learned and
evaluated on the training data. A second model is created focusing on correcting the errors
from the first model. This process will be continued until a set limit is reached (commonly
the number of models or accuracy improvement convergence). The well-known boosting
models are Gradient Boosting
and AdaBoost
.
3. Stacking
: Stacking is similar to bagging. Instead of using simple voting/averaging process, a
stacking ensemble builds another model to assign weights to predictions of each individual
model.
A great introduction to ensemble learning and why it works very well
In this practical, we will build a simple voting-based bagging model to combine the best model
from each family of decision tree, logistic regression, and neural network models. Start by
importing the VotingClassifier
from sklearn.ensemble
. After that, initialise the voting
classifier as follows.
plt
.
legend
(
loc
=
"lower right"
) plt
.
show
() In [22]:
# import the model
from
sklearn.ensemble import
VotingClassifier
There are two possible values for the voting
hyperparameter, hard voting
and soft voting
.
With hard voting, the predictions are made using the majority predicted value, while soft voting
computes predictions based on softmax value of predicted probabilities. With well-calibrated
models (we did calibration through GridSearchCV), sklearn
recommends soft voting.
Fit the voting model to the training data. After that, evaluate the accuracy on test data.
Ensemble train accuracy: 0.6141592920353982 Ensemble test accuracy: 0.5774260151410874 ROC score of voting classifier: 0.6061105271908181 It can be seen that the ensemble method managed to produce slightly higher test accuracy and
ROC score compared to the three individual models.
End notes
In this practical, we learned how to build, tune and explore the structure of neural network
models. We explored dimensionality reduction and transformation techniques to reduce the size
of the feature set and improve the performance of neural network models. In addition, we used
ROC curves to compare end-to-end performance of all models we have built so far. Lastly, we
built a voting, bagging-based ensemble model combining the three individual predictive
models.
# load the best performing decision tree and logistic regression models that we have
import
pickle with
open
(
'DT.pickle'
, 'rb'
) as
f
: dt_best
,
roc_index_dt
, fpr_dt
, tpr_dt =
pickle
.
load
(
f
) with
open
(
'LR.pickle'
, 'rb'
) as
f
: lr_best
,
roc_index_lr
, fpr_lr
, tpr_lr =
pickle
.
load
(
f
) # select the best performing neural network
nn_best =
cv_sel_model # initialise the classifier with 3 different estimators
voting =
VotingClassifier
(
estimators
=
[(
'dt'
, dt_best
), (
'lr'
, lr_best
), (
'nn'
, nn_be
In [23]:
# fit the voting classifier to training data
voting
.
fit
(
X_train
, y_train
) # evaluate train and test accuracy
print
(
"Ensemble train accuracy:"
, voting
.
score
(
X_train
, y_train
)) print
(
"Ensemble test accuracy:"
, voting
.
score
(
X_test
, y_test
)) # evaluate ROC auc score
y_pred_proba_ensemble =
voting
.
predict_proba
(
X_test
) roc_index_ensemble =
roc_auc_score
(
y_test
, y_pred_proba_ensemble
[:, 1
]) print
(
"ROC score of voting classifier:"
, roc_index_ensemble
)
Related Documents
Recommended textbooks for you

Principles of Information Systems (MindTap Course...
Computer Science
ISBN:9781305971776
Author:Ralph Stair, George Reynolds
Publisher:Cengage Learning

Fundamentals of Information Systems
Computer Science
ISBN:9781305082168
Author:Ralph Stair, George Reynolds
Publisher:Cengage Learning

Fundamentals of Information Systems
Computer Science
ISBN:9781337097536
Author:Ralph Stair, George Reynolds
Publisher:Cengage Learning

Principles of Information Systems (MindTap Course...
Computer Science
ISBN:9781285867168
Author:Ralph Stair, George Reynolds
Publisher:Cengage Learning

Information Technology Project Management
Computer Science
ISBN:9781337101356
Author:Kathy Schwalbe
Publisher:Cengage Learning
Recommended textbooks for you
- Principles of Information Systems (MindTap Course...Computer ScienceISBN:9781305971776Author:Ralph Stair, George ReynoldsPublisher:Cengage LearningFundamentals of Information SystemsComputer ScienceISBN:9781305082168Author:Ralph Stair, George ReynoldsPublisher:Cengage Learning
- Fundamentals of Information SystemsComputer ScienceISBN:9781337097536Author:Ralph Stair, George ReynoldsPublisher:Cengage LearningPrinciples of Information Systems (MindTap Course...Computer ScienceISBN:9781285867168Author:Ralph Stair, George ReynoldsPublisher:Cengage LearningInformation Technology Project ManagementComputer ScienceISBN:9781337101356Author:Kathy SchwalbePublisher:Cengage Learning

Principles of Information Systems (MindTap Course...
Computer Science
ISBN:9781305971776
Author:Ralph Stair, George Reynolds
Publisher:Cengage Learning

Fundamentals of Information Systems
Computer Science
ISBN:9781305082168
Author:Ralph Stair, George Reynolds
Publisher:Cengage Learning

Fundamentals of Information Systems
Computer Science
ISBN:9781337097536
Author:Ralph Stair, George Reynolds
Publisher:Cengage Learning

Principles of Information Systems (MindTap Course...
Computer Science
ISBN:9781285867168
Author:Ralph Stair, George Reynolds
Publisher:Cengage Learning

Information Technology Project Management
Computer Science
ISBN:9781337101356
Author:Kathy Schwalbe
Publisher:Cengage Learning