Tutorial_W12 solution (3)

pdf

School

Queensland University of Technology *

*We aren’t endorsed by this school

Course

IFN552

Subject

Computer Science

Date

Dec 6, 2023

Type

pdf

Pages

17

Uploaded by CountGalaxy14444

Report
Week 12 Computer Tutorial - Predictive mining using Neural networks Dr Richi Nayak, r.nayak@qut.edu.au Topics: 1. Loading the Pre-processed Dataset 2. Building the first neural network model 3. Finding optimal hyperparameters with GridSearchCV 4. Feature selection 5. Comparision and finding the best performing model using ROC curve 6. Ensemble Modeling Part 1 - Reflective exercises In this tutorial, you will be reflecting on the concepts of predictive modelling using Neural networks. Exercise 1: Neural Network - Basics 1. A feed-forward neural network is said to be fully connected when a. all nodes are connected to each other. b. all nodes at the same layer are connected to each other. c. all nodes at one layer are connected to all nodes in the next higher layer. d. all hidden layer nodes are connected to all output layer nodes. Which one is true? 2. The values input into a feed-forward neural network a. may be categorical or numeric. b. must be either all categorical or all numeric but not both. c. must be numeric. d. must be categorical. Which one is true? 3. The Neural network training is accomplished by repeatedly passing the training data through the network while a. individual network weights are modified. b. training instance attribute values are modified. c. the ordering of the training instances is modified. d. individual network nodes have the coefficients on their corresponding functional parameters modified. Which one is true? 4. Epochs represent the total number of
a. input layer nodes. b. passes of the training data through the network. c. network nodes. d. passes of the test data through the network. 5. What happens when a neural network is over trained? 6. Why Neural Networks are called as Universal Approximator? 7. Which statement is true about neural network and linear regression models? a. Both models require input attributes to be numeric. b. Both models require numeric attributes to range between 0 and 1. c. The output of both models is a categorical attribute value. d. Both techniques build models whose output is determined by a linear sum of weighted input attribute values. e. More than one of a,b,c or d is true. 8. This supervised learning technique can process both numeric and categorical output attribute. a. linear regression b. decision tree c. logistic regression d. neural network learning 9. Compare classification algorithms (Linear Regression, DT & ANN). Exercise 2: Predictive mining using Neural networks 1. Consider the following network with three input nodes, three links and one output node. Calculate the output of f(x), assuming the node has a linear function as activation. What will be the values if the logistic (or sigmoid) and Relu functions are applied on the otput node? 2. Consider the neural network shown below in (a) and the following table in (b) of the sample data. Compute values for nodes D, E, and F using the logistic function as the activation
function. 3. The following table consists of training data from an employee database. The data have been generalized. For a given row entry, the count represents the number of data tuples having the values for the department, status, age, and salary given in that row. Design a multilayer feed-forward neural network for the given data. Label the nodes in the input and output layers. 4. The Neural network training is accomplished by repeatedly passing the training data through the network while a. Individual network weights are modified. b. Training instance attribute values are modified. c. The ordering of the training instances is modified. d. Individual network nodes have the coefficients on their corresponding functional parameters modified. 5. What happens when a neural network is over-trained? 6. Why Neural Networks are called as Universal Approximator? Exercise 3: Comparison of algorithms 1. Discuss how will you control overfitting in algorithms such as decision tree, neural network, and logistic regression. 2. Compare the 3 classification algorithms, Decision tree, Neural network, and Logistic regression. Identify specific features of the dataset/learning such as the presence of missing
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
values, large size dataset, need of speed, and many others. Comment on the performance of each algorithm in terms of how good it will work if that feature is present in the data/problem. 3. Given the two data sets shown as in the following figures, explain how well/bad the three different classification algorithms (DT, Logistic Regression and NN) will perform on these data sets. Part 2 - Practical exercises This practical notes contain the instruction of neural network modelling in Python. The objective is to build a neural network to classify the lapsing donors based on their responses to the greeting card mailing campaign conducted by the national veterans' organisation. We will use the Veteran dataset to predict the TARGETB variable. With its exotic sounding name, a neural network model is often regarded as mysterious yet powerful predictive tool. Perhaps surprisingly, the most typical form of neural network, in fact, is a natural extension of regression model. This form of neural network is called multilayer perceptron , which is the subject of our practical today. A single node neural network, called as perceptron, with a linear activation function can be considered as a linear regression. At the end of this practical, we would have built a series of predictive models including decision tree, logistic regression and neural network. We will compare all the models to comprehend the strengths and weaknesses of each modeling method. In the financial and health domains, performance of a predictive model is crucial. To achieve a better performance, multiple models can be combined together to achieve a higher predictive performance than individual models. This approach is called ensemble modeling and it will be covered in the last part of this practical. 1. Loading the Pre-processed Dataset We will reuse the code for data preprocessing developed in previous practicals. In [1]: # libraries import pandas as pd import numpy as np from sklearn.model_selection import train_test_split from sklearn.metrics import classification_report , accuracy_score
2. Building the first neural network model Let us tart by importing a neural network model from the library. In sklearn , a neural network classifier is implemented in MLPClassifier , short for the multilayer perceptron classifier. Let's train our first MLPClassifier. Initiate the model without any additional parameter (other than the random state for consistency), fit it to the training data and test its performance on the test data. Train accuracy: 0.8952802359882006 Test accuracy: 0.5350997935306263 precision recall f1-score support 0 0.53 0.55 0.54 1453 1 0.54 0.52 0.53 1453 accuracy 0.54 2906 macro avg 0.54 0.54 0.53 2906 weighted avg 0.54 0.54 0.53 2906 MLPClassifier(random_state=10) C:\Users\bsthi\anaconda3\lib\site-packages\sklearn\neural_network\_multilayer_percep tron.py:614: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (200) reac hed and the optimization hasn't converged yet. warnings.warn( This default neural network performed with high accuracy on the training dataset. However, the test accuracy is comparatively much lower (53%), leading to overfitting to the training data. You should also notice a convergence warning. from sklearn.model_selection import GridSearchCV from dm_tools import data_prep from sklearn.preprocessing import StandardScaler # set the random seed - consistent rs = 10 # load the data df , X , y , X_train , X_test , y_train , y_test = data_prep () scaler = StandardScaler () X_train = scaler . fit_transform ( X_train , y_train ) X_test = scaler . transform ( X_test ) In [2]: from sklearn.neural_network import MLPClassifier In [3]: model_1 = MLPClassifier ( random_state = rs ) model_1 . fit ( X_train , y_train ) print ( "Train accuracy:" , model_1 . score ( X_train , y_train )) print ( "Test accuracy:" , model_1 . score ( X_test , y_test )) y_pred = model_1 . predict ( X_test ) print ( classification_report ( y_test , y_pred )) print ( model_1 )
In sklearn, if a neural network does not achieve convergence before maximum iteration, it will raise a "convergence is not reached" warning message. This first neural network raised the convergence warning. If a warning message is raised, the max_iter hyperparameter of the neural network should be increased. However, if the network fails to reach convergence even with the very large number of maximum iteration, this might indicate a problem with error computation. The following code demonstrates the status of MLP classifier training with 700 max_iter . Train accuracy: 0.9960176991150442 Test accuracy: 0.534755677907777 precision recall f1-score support 0 0.53 0.53 0.53 1453 1 0.53 0.54 0.54 1453 accuracy 0.53 2906 macro avg 0.53 0.53 0.53 2906 weighted avg 0.53 0.53 0.53 2906 MLPClassifier(max_iter=700, random_state=10) The neural network after setting the iteration of 700 performed with much higher accuracy on the training dataset (99%). However, the test accuracy remains the same (53%). This clearly indicates overfitting to the training data. Next, we will use the GridSearch tuning and dimensionality reduction techniques to control the overfitting of the model to the training data. Solver and activation function Finding the best combination of weights in neural networks is a significant search problem. The algorithm used to find this optimal weight set is called solver in sklearn, and the most common one is Gradient Descent. Gradient descent starts with a set of randomly generated weights. In each iteration of gradient descent, predictions are made on X_train and the error value (cost) is computed. The weight set is then altered to reduce this error value. Each iteration is called an epoch. To stop gradient descent iterations, a strategy such as maximum iterations , minimum error threshold or convergence reached (error is not improved over a certain number of epochs) is used. A combination of max iterations and convergence is the most commonly used criterion. Observe the MLPClassifier object hyperparameters printed out before. You should see that the solver hyperparameter is set by default to adam (stands for adaptive moment estimation). Adam is an extension of gradient descent, designed to speed up the training process and be In [4]: model_2 = MLPClassifier ( max_iter = 700 , random_state = rs ) model_2 . fit ( X_train , y_train ) print ( "Train accuracy:" , model_2 . score ( X_train , y_train )) print ( "Test accuracy:" , model_2 . score ( X_test , y_test )) y_pred = model_2 . predict ( X_test ) print ( classification_report ( y_test , y_pred )) print ( model_2 )
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
more computationaly efficient. Adam is the solver algorithm of choice for many deep neural networks for its efficiency and we will use adam instead of normal gradient descent here. A great explaination of adam Another important hyperparameter to observe is activation , which refers to activation function used in hidden layers of the neural network. There are a number of options to use, including: identity : , no transformation tanh : sigmoid : , commonly used in logistic regression relu - rectified linear unit : , default option in sklearn The identity function will change neural network into a linear model, thus it is not commonly used. In the past, tanh and sigmoid function were very popular. However, the recent research suggests that the relu function can produce similarly accurate results at much lower training time. Therefore, we will use relu as the activation function in this practical. 3. Finding optimal hyperparameter with GridSearchCV Next we will find the optimal hyperparameters using GridSearchCV. Neural network is harder to tune than decision trees or regression models due to relatively many types of parameters and the slow training process. In this practical, we will focus on tuning two parameters: 1. hidden_layer_sizes : It has values of tuples, and within each tuple, element i-th represents the number of neurons contained in each hidden layer. 2. alpha : L2 regularization parameter used in each neuron's activation function. Let us start by tuning the hidden layer sizes. There is no official guideline on how many neurons we should have in each layer, but for most data mining tasks a single hidden layer with neurons no more than the number of input variables and no less than output neurons (binary classification task, hence 1) is sufficient. Deep Learning In the past decade, deep learning models have become highly popular. Deep learning is a process of building very complex neural networks (up to hundreds of layers and thousands of neurons, hence deep ). Deep neural networks are typically used for complex tasks, like image recognition, Siri-like voice assistant, machine translation and self-driving tasks. See how many input features we have by printing out the train shape. (6780, 85) With 85 features, we will start tuning with one hidden layer of 5 to 85 neurons, increment of 20. This might take a bit of time. In [5]: print ( X_train . shape )
Tips: Setting the max_iter to a higher value (say 700) can complete the GridSearchCV without the convergence warning. However, the process will be expensive and take more time to complete. C:\Users\bsthi\anaconda3\lib\site-packages\sklearn\neural_network\_multilayer_percep tron.py:614: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (200) reac hed and the optimization hasn't converged yet. warnings.warn( GridSearchCV(cv=10, estimator=MLPClassifier(random_state=10), n_jobs=-1, param_grid={'hidden_layer_sizes': [(5,), (25,), (45,), (65,), (85,)]}, return_train_score=True) {'mean_fit_time': array([ 5.53638704, 8.25172894, 10.54289422, 11.23771729, 12.7887 9881]), 'std_fit_time': array([0.20275097, 0.89427211, 1.1670461 , 0.4234314 , 2.094 95097]), 'mean_score_time': array([0.00199392, 0.00189514, 0.00268989, 0.00409091, 0.00269284]), 'std_score_time': array([0.00109221, 0.00029912, 0.00077612, 0.0010415 2, 0.00109638]), 'param_hidden_layer_sizes': masked_array(data=[(5,), (25,), (45,), (65,), (85,)], mask=[False, False, False, False, False], fill_value='?', dtype=object), 'params': [{'hidden_layer_sizes': (5,)}, {'hidden_layer_s izes': (25,)}, {'hidden_layer_sizes': (45,)}, {'hidden_layer_sizes': (65,)}, {'hidde n_layer_sizes': (85,)}], 'split0_test_score': array([0.55014749, 0.54129794, 0.51474 926, 0.52654867, 0.53687316]), 'split1_test_score': array([0.55899705, 0.59734513, 0.58702065, 0.57079646, 0.57669617]), 'split2_test_score': array([0.55457227, 0.5427 7286, 0.55899705, 0.55309735, 0.55014749]), 'split3_test_score': array([0.54129794, 0.54867257, 0.52212389, 0.51474926, 0.51769912]), 'split4_test_score': array([0.5471 9764, 0.54277286, 0.53982301, 0.53834808, 0.54129794]), 'split5_test_score': array ([0.53982301, 0.50737463, 0.51327434, 0.49115044, 0.53982301]), 'split6_test_score': array([0.54719764, 0.52064897, 0.53982301, 0.55014749, 0.5 ]), 'split7_test_sc ore': array([0.55309735, 0.57964602, 0.52949853, 0.50589971, 0.52064897]), 'split8_t est_score': array([0.52064897, 0.5280236 , 0.51179941, 0.51327434, 0.49262537]), 'sp lit9_test_score': array([0.53834808, 0.52212389, 0.5280236 , 0.50442478, 0.5221238 9]), 'mean_test_score': array([0.54513274, 0.54306785, 0.53451327, 0.52684366, 0.529 79351]), 'std_test_score': array([0.0103287 , 0.02600057, 0.02234112, 0.0241708 , 0. 02335605]), 'rank_test_score': array([1, 2, 3, 5, 4]), 'split0_train_score': array ([0.63569322, 0.73549656, 0.79531301, 0.84152737, 0.89233038]), 'split1_train_scor e': array([0.63176008, 0.7359882 , 0.78629957, 0.84791872, 0.87708948]), 'split2_tra in_score': array([0.63552933, 0.7258276 , 0.7928548 , 0.84398558, 0.88888889]), 'spl it3_train_score': array([0.63339889, 0.75008194, 0.80481809, 0.84824648, 0.8890527 7]), 'split4_train_score': array([0.63765978, 0.74434612, 0.8064569 , 0.85414618, 0. 88757784]), 'split5_train_score': array([0.64323173, 0.73566044, 0.80268764, 0.84955 752, 0.88872501]), 'split6_train_score': array([0.64044576, 0.74385447, 0.80694854, 0.85283514, 0.88659456]), 'split7_train_score': array([0.63225172, 0.74287119, 0.800 72108, 0.85365454, 0.88757784]), 'split8_train_score': array([0.63929859, 0.74451 , 0.80350705, 0.84873812, 0.89708292]), 'split9_train_score': array([0.6374959 , 0.7 4139626, 0.80612914, 0.84103573, 0.88643068]), 'mean_train_score': array([0.6366765 , 0.74000328, 0.80057358, 0.84816454, 0.88813504]), 'std_train_score': array([0.0034 9953, 0.00649607, 0.00654393, 0.00449897, 0.00476789])} As usual let us plot the train and test scores of split0. In [6]: params = { 'hidden_layer_sizes' : [( x ,) for x in range ( 5 , 86 , 20 )]} cv_1 = GridSearchCV ( param_grid = params , estimator = MLPClassifier ( random_state = rs ), retu cv_1 . fit ( X_train , y_train ) Out[6]: In [7]: result_set = cv_1 . cv_results_ print ( result_set )
Total number of models: 5 Now, plot the mean train and test scores of all runs. Total number of models: 5 In [8]: import matplotlib.pyplot as plt train_result = result_set [ 'split0_train_score' ] test_result = result_set [ 'split0_test_score' ] print ( "Total number of models: " , len ( test_result )) # plot hidden layers hyperparameter values vs training and test accuracy score plt . plot ( range ( 0 , len ( train_result )), train_result , 'b' , range ( 0 , len ( test_result )), plt . xlabel ( 'Hyperparameter Hidden_layers\nBlue = training acc. Red = test acc.' ) plt . xticks ( range ( 0 , len ( train_result )), range ( 5 , 86 , 20 )) plt . ylabel ( 'score' ) plt . show () In [9]: ### Enter your code train_result = result_set [ 'mean_train_score' ] test_result = result_set [ 'mean_test_score' ] print ( "Total number of models: " , len ( test_result )) # plot hidden layers hyperparameter values vs training and test accuracy score plt . plot ( range ( 0 , len ( train_result )), train_result , 'b' , range ( 0 , len ( test_result )), plt . xlabel ( 'Hyperparameter Hidden_layers\nBlue = training acc. Red = test acc.' ) plt . xticks ( range ( 0 , len ( train_result )), range ( 5 , 86 , 20 )) plt . ylabel ( 'score' ) plt . show () In [10]:
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Train accuracy: 0.6368731563421829 Test accuracy: 0.5633172746042671 precision recall f1-score support 0 0.56 0.59 0.57 1453 1 0.57 0.54 0.55 1453 accuracy 0.56 2906 macro avg 0.56 0.56 0.56 2906 weighted avg 0.56 0.56 0.56 2906 {'hidden_layer_sizes': (5,)} The output of this GridSearchCV returns 5 neurons as the optimal number of neurons in the hidden layer. Based on the performance of previous decision tree and regression models, it seems the less complex models (smaller trees, smaller feature sets) tend to generalise better on this dataset. We should attempt to tune the model with the lower number of neurons in the hidden layer. Train accuracy: 0.6122418879056047 Test accuracy: 0.569511355815554 precision recall f1-score support 0 0.56 0.61 0.58 1453 1 0.57 0.53 0.55 1453 accuracy 0.57 2906 macro avg 0.57 0.57 0.57 2906 weighted avg 0.57 0.57 0.57 2906 {'hidden_layer_sizes': (3,)} We now have the optimal value for neuron count in the hidden layer. Next, we will tune the second hyperparameter, alpha, which is the learning rate for the gradient descent algorithm. Larger alpha means the gradient descent will take "larger" steps and train faster, but it might miss the optimal solution. Smaller alpha results in "smaller" steps, a slower training speed yet it might stuck at the local minimum. print ( "Train accuracy:" , cv_1 . score ( X_train , y_train )) print ( "Test accuracy:" , cv_1 . score ( X_test , y_test )) y_pred = cv_1 . predict ( X_test ) print ( classification_report ( y_test , y_pred )) print ( cv_1 . best_params_ ) In [11]: # new parameters params = { 'hidden_layer_sizes' : [( 3 ,), ( 5 ,), ( 7 ,), ( 9 ,)]} cv_2 = GridSearchCV ( param_grid = params , estimator = MLPClassifier ( random_state = rs ), cv = cv_2 . fit ( X_train , y_train ) print ( "Train accuracy:" , cv_2 . score ( X_train , y_train )) print ( "Test accuracy:" , cv_2 . score ( X_test , y_test )) y_pred = cv_2 . predict ( X_test ) print ( classification_report ( y_test , y_pred )) print ( cv_2 . best_params_ )
The default value for alpha is 0.0001, thus we will try alpha values around this number. Train accuracy: 0.6100294985250737 Test accuracy: 0.5667584308327598 precision recall f1-score support 0 0.56 0.60 0.58 1453 1 0.57 0.53 0.55 1453 accuracy 0.57 2906 macro avg 0.57 0.57 0.57 2906 weighted avg 0.57 0.57 0.57 2906 {'alpha': 1e-05, 'hidden_layer_sizes': (3,)} The GridSearch returned a hidden layer of 3 neurons and alpha value of 0.00001 as the optimal hyperparameters. However, this is not better than the previous model cv_2 . 4. Dimensionality reduction Next we will try to improve performance of the model through dimensionality reduction and transformation techniques covered in the regression modelling practical. 4.1. Recursive Feature Elimination We will first try to reduce the feature set size using RFE. We will need a base elimination model and RFE requires the type of model that assigns weight/feature importance to each feature (like regression/decision tree). Unfortunately, neural networks provide neither, thus we will try to use Logistic Regression as the base elimination model. 40 The RFE with logistic regression has selected 40 features as the best set of features. With these selected features, tune an MLPClassifier model. In [12]: params = { 'hidden_layer_sizes' : [( 3 ,), ( 5 ,), ( 7 ,), ( 9 ,)], 'alpha' : [ 0.01 , 0.001 , 0.00 cv_3 = GridSearchCV ( param_grid = params , estimator = MLPClassifier ( random_state = rs ), cv = cv_3 . fit ( X_train , y_train ) print ( "Train accuracy:" , cv_3 . score ( X_train , y_train )) print ( "Test accuracy:" , cv_3 . score ( X_test , y_test )) y_pred = cv_3 . predict ( X_test ) print ( classification_report ( y_test , y_pred )) print ( cv_3 . best_params_ ) In [13]: from sklearn.feature_selection import RFECV from sklearn.linear_model import LogisticRegression rfe = RFECV ( estimator = LogisticRegression ( random_state = rs ), cv = 10 ) rfe . fit ( X_train , y_train ) print ( rfe . n_features_ ) In [14]: X_train_rfe = rfe . transform ( X_train )
Train accuracy: 0.6128318584070797 Test accuracy: 0.5684790089470062 precision recall f1-score support 0 0.57 0.57 0.57 1453 1 0.57 0.57 0.57 1453 accuracy 0.57 2906 macro avg 0.57 0.57 0.57 2906 weighted avg 0.57 0.57 0.57 2906 {'alpha': 0.001, 'hidden_layer_sizes': (7,)} The RFE selected feature set showed improvements over the original data set. However, compared with the previous best model ( cv_2 ), the model performed slightly worse. This is an indication that elimination with logistic regression did not produce the feature set suitable for neural network. 4.2. Selecting using decision tree Lastly, we will use the decision tree model to perform feature selection for neural network modelling. To start, we need to tune a decision tree with GridSearchCV. {'criterion': 'entropy', 'max_depth': 5, 'min_samples_leaf': 10} GiftCnt36 : 0.2786889679489785 DemMedHomeValue : 0.16527761052028647 GiftAvgLast : 0.12080196710144986 GiftCntAll : 0.06936170003476945 GiftTimeLast : 0.06852490725467034 StatusCatStarAll : 0.039760012572680054 DemAge : 0.0395320434572485 PromCnt36 : 0.034873357233237874 GiftCntCardAll : 0.030230871689451603 X_test_rfe = rfe . transform ( X_test ) # step = int((X_train_rfe.shape[1] + 5)/5); params = { 'hidden_layer_sizes' : [( 3 ,), ( 5 ,), ( 7 ,), ( 9 ,)], 'alpha' : [ 0.01 , 0.001 , 0.00 rfe_cv = GridSearchCV ( param_grid = params , estimator = MLPClassifier ( random_state = rs ), c rfe_cv . fit ( X_train_rfe , y_train ) print ( "Train accuracy:" , rfe_cv . score ( X_train_rfe , y_train )) print ( "Test accuracy:" , rfe_cv . score ( X_test_rfe , y_test )) y_pred = rfe_cv . predict ( X_test_rfe ) print ( classification_report ( y_test , y_pred )) print ( rfe_cv . best_params_ ) In [15]: import pickle with open ( 'DT.pickle' , 'rb' ) as f : dt_best , roc_index_dt_cv , fpr_dt_cv , tpr_dt_cv = pickle . load ( f ) print ( dt_best . best_params_ ) In [16]: from dm_tools import analyse_feature_importance analyse_feature_importance ( dt_best . best_estimator_ , X . columns )
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
GiftTimeFirst : 0.027615727171569734 DemPctVeterans : 0.025370100839886917 PromCnt12 : 0.021248703587791955 DemCluster_21 : 0.016907242527362424 PromCntCard36 : 0.016068732710812456 PromCntAll : 0.01598797726061107 PromCntCard12 : 0.015718616757198205 DemMedIncome : 0.014031461331994594 DemCluster_48 : 0.0 StatusCat96NK_L : 0.0 StatusCat96NK_A : 0.0 (6780, 17) The decision tree model identfies the set of 17 variables as important features. Proceed to tune a MLPClassifier with this modified dataset. Train accuracy: 0.6028023598820059 Test accuracy: 0.5615966964900206 precision recall f1-score support 0 0.56 0.58 0.57 1453 1 0.56 0.54 0.55 1453 accuracy 0.56 2906 macro avg 0.56 0.56 0.56 2906 weighted avg 0.56 0.56 0.56 2906 {'alpha': 1e-05, 'hidden_layer_sizes': (5,)} The Neural Network model trained with decision tree selected variables did not manage to improve model performance. Therefore, we will keep the previous best model ( cv_2 ) as the best performing neural network. 5. Comparing the models to find the best performing model A total of seven models has been built: In [17]: from sklearn.feature_selection import SelectFromModel selectmodel = SelectFromModel ( dt_best . best_estimator_ , prefit = True ) X_train_sel_model = selectmodel . transform ( X_train ) X_test_sel_model = selectmodel . transform ( X_test ) print ( X_train_sel_model . shape ) In [18]: params = { 'hidden_layer_sizes' : [( 3 ,), ( 5 ,), ( 7 ,), ( 9 ,)], 'alpha' : [ 0.01 , 0.001 , 0.00 cv_sel_model = GridSearchCV ( param_grid = params , estimator = MLPClassifier ( random_state = cv_sel_model . fit ( X_train_sel_model , y_train ) print ( "Train accuracy:" , cv_sel_model . score ( X_train_sel_model , y_train )) print ( "Test accuracy:" , cv_sel_model . score ( X_test_sel_model , y_test )) y_pred = cv_sel_model . predict ( X_test_sel_model ) print ( classification_report ( y_test , y_pred )) print ( cv_sel_model . best_params_ )
1. Default neural network (`model_1`) 2. Neural network with relu (`model_2`) 3. Neural network + grid search (`cv_1`) 4. Neural network + grid search (`cv_2`) 5. Neural network + grid search (`cv_3`) 6. Neural network + feature selection + grid search (`rfe_cv`) 7. Neural network + feature selection using DT + grid search (`cv_sel_model`) Now, let us use ROC curve to compare these models to identify the best performing neural network model considering both true and false positives. ROC index on test for NN_default: 0.5496717757455563 ROC index on test for NN with relu: 0.5431669720998726 ROC index on test for NN with gridsearch 1: 0.587214008655704 ROC index on test for NN with gridsearch 2: 0.5897658640144107 ROC index on test for NN with gridsearch 3: 0.5894830876526199 ROC index on test for NN with feature selection and gridsearch: 0.5883865595495283 ROC index on test for NN with feature selection (model selection) and gridsearch: 0. 5939625115277549 In [19]: from sklearn.metrics import roc_auc_score y_pred_proba_nn_1 = model_1 . predict_proba ( X_test ) y_pred_proba_nn_2 = model_2 . predict_proba ( X_test ) y_pred_proba_cv_1 = cv_1 . predict_proba ( X_test ) y_pred_proba_cv_2 = cv_2 . predict_proba ( X_test ) y_pred_proba_cv_3 = cv_3 . predict_proba ( X_test ) y_pred_proba_rfe_cv = rfe_cv . predict_proba ( X_test_rfe ) y_pred_proba_cv_sel_model = cv_sel_model . predict_proba ( X_test_sel_model ) roc_index_nn_1 = roc_auc_score ( y_test , y_pred_proba_nn_1 [:, 1 ]) roc_index_nn_2 = roc_auc_score ( y_test , y_pred_proba_nn_2 [:, 1 ]) roc_index_cv_1 = roc_auc_score ( y_test , y_pred_proba_cv_1 [:, 1 ]) roc_index_cv_2 = roc_auc_score ( y_test , y_pred_proba_cv_2 [:, 1 ]) roc_index_cv_3 = roc_auc_score ( y_test , y_pred_proba_cv_3 [:, 1 ]) roc_index_rfe_cv = roc_auc_score ( y_test , y_pred_proba_rfe_cv [:, 1 ]) roc_index_cv_sel_model = roc_auc_score ( y_test , y_pred_proba_cv_sel_model [:, 1 ]) print ( "ROC index on test for NN_default:" , roc_index_nn_1 ) print ( "ROC index on test for NN with relu:" , roc_index_nn_2 ) print ( "ROC index on test for NN with gridsearch 1:" , roc_index_cv_1 ) print ( "ROC index on test for NN with gridsearch 2:" , roc_index_cv_2 ) print ( "ROC index on test for NN with gridsearch 3:" , roc_index_cv_3 ) print ( "ROC index on test for NN with feature selection and gridsearch:" , roc_index_r print ( "ROC index on test for NN with feature selection (model selection) and gridsea from sklearn.metrics import roc_curve fpr_nn_1 , tpr_nn_1 , thresholds_nn_1 = roc_curve ( y_test , y_pred_proba_nn_1 [:, 1 ]) fpr_nn_2 , tpr_nn_2 , thresholds_nn_2 = roc_curve ( y_test , y_pred_proba_nn_2 [:, 1 ]) fpr_cv_1 , tpr_cv_1 , thresholds_cv_1 = roc_curve ( y_test , y_pred_proba_cv_1 [:, 1 ]) fpr_cv_2 , tpr_cv_2 , thresholds_cv_2 = roc_curve ( y_test , y_pred_proba_cv_2 [:, 1 ]) fpr_cv_3 , tpr_cv_3 , thresholds_cv_3 = roc_curve ( y_test , y_pred_proba_cv_3 [:, 1 ]) fpr_rfe_cv , tpr_rfe_cv , thresholds_rfe_cv = roc_curve ( y_test , y_pred_proba_rfe_cv [:, fpr_cv_sel_model , tpr_cv_sel_model , thresholds_cv_sel_model = roc_curve ( y_test , y_pr In [20]:
Based on the ROC curve, the neural network model with decision tree selected features, with the ROC index of 0.594, performs marginally better than other models. Now, your task is to compare the best performing neural network with the best performing decision tree and logistic regression. Identify which model performs the best on this dataset. import matplotlib.pyplot as plt plt . plot ( fpr_nn_1 , tpr_nn_1 , label = 'NN_default {:.3f}' . format ( roc_index_nn_1 ), color plt . plot ( fpr_nn_2 , tpr_nn_2 , label = 'NN with relu {:.3f}' . format ( roc_index_nn_2 ), col plt . plot ( fpr_cv_1 , tpr_cv_1 , label = 'NN cv_1 {:.3f}' . format ( roc_index_cv_1 ), color = 'b plt . plot ( fpr_cv_2 , tpr_cv_2 , label = 'NN cv_2 {:.3f}' . format ( roc_index_cv_2 ), color = 'y plt . plot ( fpr_cv_3 , tpr_cv_3 , label = 'NN cv_3 {:.3f}' . format ( roc_index_cv_3 ), color = 'c plt . plot ( fpr_rfe_cv , tpr_rfe_cv , label = 'NN rfe_cv {:.3f}' . format ( roc_index_rfe_cv ), plt . plot ( fpr_cv_sel_model , tpr_cv_sel_model , label = 'NN with cv_sel_model {:.3f}' . for plt . plot ([ 0 , 1 ], [ 0 , 1 ], color = 'navy' , lw = 0.5 , linestyle = '--' ) plt . xlim ([ 0.0 , 1.0 ]) plt . ylim ([ 0.0 , 1.0 ]) plt . xlabel ( 'False Positive Rate' ) plt . ylabel ( 'True Positive Rate' ) plt . title ( 'Receiver operating characteristic example' ) plt . legend ( loc = "lower right" ) plt . show () In [21]: ### Enter your code import pickle with open ( 'DT.pickle' , 'rb' ) as f : dt_best , roc_index_dt_cv , fpr_dt_cv , tpr_dt_cv = pickle . load ( f ) with open ( 'LR.pickle' , 'rb' ) as f : lr_best , roc_index_lr_cv , fpr_lr_cv , tpr_lr_cv = pickle . load ( f ) plt . plot ( fpr_dt_cv , tpr_dt_cv , label = 'DT {:.3f}' . format ( roc_index_dt_cv ), color = 'red plt . plot ( fpr_lr_cv , tpr_lr_cv , label = 'LR {:.3f}' . format ( roc_index_lr_cv ), color = 'gre plt . plot ( fpr_cv_sel_model , tpr_cv_sel_model , label = 'NN with cv_sel_model {:.3f}' . for plt . plot ([ 0 , 1 ], [ 0 , 1 ], color = 'navy' , lw = 0.5 , linestyle = '--' ) plt . xlim ([ 0.0 , 1.0 ]) plt . ylim ([ 0.0 , 1.0 ]) plt . xlabel ( 'False Positive Rate' ) plt . ylabel ( 'True Positive Rate' ) plt . title ( 'Receiver operating characteristic example' )
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
6. Ensemble Modeling Ensemble modeling is a supervised learning technique that combines predictions from multiple models to produce a stronger model. Ensemble models are typically more accurate and more robust than the individual models that they are built on. Typically, the individual models consist of different classes (e.g. combining decision tree and logistic regression) or they are trained on different subsets of the data (e.g. combining 2 decision trees, each trained on one half of training data). There are three major techniques in ensemble modelling: 1. Bagging : Predictions from each models are combined through voting/averaging process. An example of bagging model is Random Forest , which consists of many simple decision trees trained on random subsets of the data. 2. Boosting : Models are combined through sequential learning. The first model is learned and evaluated on the training data. A second model is created focusing on correcting the errors from the first model. This process will be continued until a set limit is reached (commonly the number of models or accuracy improvement convergence). The well-known boosting models are Gradient Boosting and AdaBoost . 3. Stacking : Stacking is similar to bagging. Instead of using simple voting/averaging process, a stacking ensemble builds another model to assign weights to predictions of each individual model. A great introduction to ensemble learning and why it works very well In this practical, we will build a simple voting-based bagging model to combine the best model from each family of decision tree, logistic regression, and neural network models. Start by importing the VotingClassifier from sklearn.ensemble . After that, initialise the voting classifier as follows. plt . legend ( loc = "lower right" ) plt . show () In [22]: # import the model from sklearn.ensemble import VotingClassifier
There are two possible values for the voting hyperparameter, hard voting and soft voting . With hard voting, the predictions are made using the majority predicted value, while soft voting computes predictions based on softmax value of predicted probabilities. With well-calibrated models (we did calibration through GridSearchCV), sklearn recommends soft voting. Fit the voting model to the training data. After that, evaluate the accuracy on test data. Ensemble train accuracy: 0.6141592920353982 Ensemble test accuracy: 0.5774260151410874 ROC score of voting classifier: 0.6061105271908181 It can be seen that the ensemble method managed to produce slightly higher test accuracy and ROC score compared to the three individual models. End notes In this practical, we learned how to build, tune and explore the structure of neural network models. We explored dimensionality reduction and transformation techniques to reduce the size of the feature set and improve the performance of neural network models. In addition, we used ROC curves to compare end-to-end performance of all models we have built so far. Lastly, we built a voting, bagging-based ensemble model combining the three individual predictive models. # load the best performing decision tree and logistic regression models that we have import pickle with open ( 'DT.pickle' , 'rb' ) as f : dt_best , roc_index_dt , fpr_dt , tpr_dt = pickle . load ( f ) with open ( 'LR.pickle' , 'rb' ) as f : lr_best , roc_index_lr , fpr_lr , tpr_lr = pickle . load ( f ) # select the best performing neural network nn_best = cv_sel_model # initialise the classifier with 3 different estimators voting = VotingClassifier ( estimators = [( 'dt' , dt_best ), ( 'lr' , lr_best ), ( 'nn' , nn_be In [23]: # fit the voting classifier to training data voting . fit ( X_train , y_train ) # evaluate train and test accuracy print ( "Ensemble train accuracy:" , voting . score ( X_train , y_train )) print ( "Ensemble test accuracy:" , voting . score ( X_test , y_test )) # evaluate ROC auc score y_pred_proba_ensemble = voting . predict_proba ( X_test ) roc_index_ensemble = roc_auc_score ( y_test , y_pred_proba_ensemble [:, 1 ]) print ( "ROC score of voting classifier:" , roc_index_ensemble )