Project2

pdf

School

University of California, Los Angeles *

*We aren’t endorsed by this school

Course

M148

Subject

Computer Science

Date

Jan 9, 2024

Type

pdf

Pages

45

Uploaded by patilkunal919

Report
Project 2 - Binary Classification Comparative Methods For this project we're going to attempt a binary classification of a dataset using multiple methods and compare results. Our goals for this project will be to introduce you to several of the most common classification techniques, how to perform them and tweek parameters to optimize outcomes, how to produce and interpret results, and compare performance. You will be asked to analyze your findings and provide explanations for observed performance. DEFINITIONS </u> Binary Classification: In this case a complex dataset has an added 'target' label with one of two options. Your learning algorithm will try to assign one of these labels to the data. Supervised Learning: This data is fully supervised, which means it's been fully labeled and we can trust the veracity of the labeling. Submission Details Project is due May 17th at 12:00 pm (Wednesday Noon). To submit the project, please save the notebook as a pdf file and submit the assignment via Gradescope. In addition, make sure that all figures are legible and su ff iciently large. For best pdf results, we recommend downloading Latex and print the notebook using Latex. Loading Essentials and Helper Functions In [ ]: #Here are a set of libraries we imported to complete this assignment. #Feel free to use these or equivalent libraries for your implementation import numpy as np # linear algebra import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv) import matplotlib.pyplot as plt # this is used for the plot the graph import matplotlib import os import time #Sklearn classes from sklearn.model_selection import train_test_split , cross_val_score , GridSearchCV , KFo from sklearn import metrics from sklearn.svm import SVC #SVM classifier from sklearn.linear_model import LogisticRegression from sklearn.neighbors import KNeighborsClassifier from sklearn.metrics import confusion_matrix import sklearn.metrics.cluster as smc from sklearn.pipeline import Pipeline from sklearn.preprocessing import StandardScaler , OneHotEncoder , Normalizer , MinMaxScale from sklearn.compose import ColumnTransformer , make_column_transformer Loading [MathJax]/extensions/Safe.js
from matplotlib import pyplot import itertools % matplotlib inline #Sets random seed import random random . seed ( 42 ) In [ ]: # Helper function allowing you to export a graph def save_fig ( fig_id , tight_layout = True , fig_extension = "png" , resolution = 300 ): path = os . path . join ( fig_id + "." + fig_extension ) print ( "Saving figure" , fig_id ) if tight_layout : plt . tight_layout () plt . savefig ( path , format = fig_extension , dpi = resolution ) In [ ]: # Helper function that allows you to draw nicely formatted confusion matrices def draw_confusion_matrix ( y , yhat , classes ): ''' Draws a confusion matrix for the given target and predictions Adapted from scikit-learn and discussion example. ''' plt . cla () plt . clf () matrix = confusion_matrix ( y , yhat ) plt . imshow ( matrix , interpolation = 'nearest' , cmap = plt . cm . YlOrBr ) plt . title ( "Confusion Matrix" ) plt . colorbar () num_classes = len ( classes ) plt . xticks ( np . arange ( num_classes ), classes , rotation = 90 ) plt . yticks ( np . arange ( num_classes ), classes ) fmt = 'd' thresh = matrix . max () / 2. for i , j in itertools . product ( range ( matrix . shape [ 0 ]), range ( matrix . shape [ 1 ])): plt . text ( j , i , format ( matrix [ i , j ], fmt ), horizontalalignment = "center" , color = "white" if matrix [ i , j ] > thresh else "black" ) plt . ylabel ( 'True label' ) plt . xlabel ( 'Predicted label' ) plt . tight_layout () plt . show () In [ ]: def heatmap ( data , row_labels , col_labels , figsize = ( 20 , 12 ), cmap = "YlGn" , cbar_kw = {}, cbarlabel = "" , valfmt = " {x:.2f} " , textcolors = ( "black" , "white" ), threshold = None ): """ Create a heatmap from a numpy array and two lists of labels. Taken from matplotlib example. Parameters ---------- data A 2D numpy array of shape (M, N). row_labels A list or array of length M with the labels for the rows. col_labels A list or array of length N with the labels for the columns. ax Loading [MathJax]/extensions/Safe.js
A `matplotlib.axes.Axes` instance to which the heatmap is plotted. If not provided, use current axes or create a new one. Optional. cmap A string that specifies the colormap to use. Look at matplotlib docs for informa Optional. cbar_kw A dictionary with arguments to `matplotlib.Figure.colorbar`. Optional. cbarlabel The label for the colorbar. Optional. valfmt The format of the annotations inside the heatmap. This should either use the string format method, e.g. "$ {x:.2f}", or be a `matplotlib.ticker.Formatter`. Optional. textcolors A pair of colors. The first is used for values below a threshold, the second for those above. Optional. threshold Value in data units according to which the colors from textcolors are applied. If None (the default) uses the middle of the colormap as """ plt . figure ( figsize = figsize ) ax = plt . gca () # Plot the heatmap im = ax . imshow ( data , cmap = cmap ) # Create colorbar cbar = ax . figure . colorbar ( im , ax = ax , ** cbar_kw ) cbar . ax . set_ylabel ( cbarlabel , rotation =- 90 , va = "bottom" ) # Show all ticks and label them with the respective list entries. ax . set_xticks ( np . arange ( data . shape [ 1 ]), labels = col_labels ) ax . set_yticks ( np . arange ( data . shape [ 0 ]), labels = row_labels ) # Let the horizontal axes labeling appear on top. ax . tick_params ( top = True , bottom = False , labeltop = True , labelbottom = False ) # Rotate the tick labels and set their alignment. plt . setp ( ax . get_xticklabels (), rotation =- 30 , ha = "right" , rotation_mode = "anchor" ) # Turn spines off and create white grid. ax . spines [:] . set_visible ( False ) ax . set_xticks ( np . arange ( data . shape [ 1 ] + 1 ) - .5 , minor = True ) ax . set_yticks ( np . arange ( data . shape [ 0 ] + 1 ) - .5 , minor = True ) ax . grid ( which = "minor" , color = "w" , linestyle = '-' , linewidth = 3 ) ax . tick_params ( which = "minor" , bottom = False , left = False ) # Normalize the threshold to the images color range. if threshold is not None : threshold = im . norm ( threshold ) else : threshold = im . norm ( data . max ()) / 2. # Set default alignment to center, but allow it to be # overwritten by textkw. kw = dict ( horizontalalignment = "center" , verticalalignment = "center" ) # Get the formatter in case a string is supplied Loading [MathJax]/extensions/Safe.js
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
if isinstance ( valfmt , str ): valfmt = matplotlib . ticker . StrMethodFormatter ( valfmt ) # Loop over the data and create a `Text` for each "pixel". # Change the text's color depending on the data. texts = [] for i in range ( data . shape [ 0 ]): for j in range ( data . shape [ 1 ]): kw . update ( color = textcolors [ int ( im . norm ( data [ i , j ]) > threshold )]) text = im . axes . text ( j , i , valfmt ( data [ i , j ], None ), ** kw ) texts . append ( text ) In [ ]: def make_meshgrid ( x , y , h = 0.02 ): """Create a mesh of points to plot in Parameters ---------- x: data to base x-axis meshgrid on y: data to base y-axis meshgrid on h: stepsize for meshgrid, optional Returns ------- xx, yy : ndarray """ x_min , x_max = x . min () - 1 , x . max () + 1 y_min , y_max = y . min () - 1 , y . max () + 1 xx , yy = np . meshgrid ( np . arange ( x_min , x_max , h ), np . arange ( y_min , y_max , h )) return xx , yy def plot_contours ( clf , xx , yy , ** params ): """Plot the decision boundaries for a classifier. Parameters ---------- ax: matplotlib axes object clf: a classifier xx: meshgrid ndarray yy: meshgrid ndarray params: dictionary of params to pass to contourf, optional """ Z = clf . predict ( np . c_ [ xx . ravel (), yy . ravel ()]) Z = Z . reshape ( xx . shape ) out = plt . contourf ( xx , yy , Z , ** params ) return out def draw_contour ( x , y , clf , class_labels = [ "Negative" , "Positive" ]): """ Draws a contour line for the predictor Assumption that x has only two features. This functions only plots the first two col """ X0 , X1 = x [:, 0 ], x [:, 1 ] xx0 , xx1 = make_meshgrid ( X0 , X1 ) plt . figure ( figsize = ( 10 , 6 )) plot_contours ( clf , xx0 , xx1 , cmap = "PiYG" , alpha = 0.8 ) scatter = plt . scatter ( X0 , X1 , c = y , cmap = "PiYG" , s = 30 , edgecolors = "k" ) plt . legend ( handles = scatter . legend_elements ()[ 0 ], labels = class_labels ) Loading [MathJax]/extensions/Safe.js
Example Project In this part, we will go over how to perform a Binary classification task using a variety of models. We will provide examples of how to train and evaluate these models. Dataset Description Healthcare is an important industry that uses machine learning to aid doctors in diagnosing many different kinds of illnesses and diseases. For this example project, we will be using the Breast Cancer Wisconsin Dataset to determine whether a mass found in a body is benign or malignant. Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image. Feature Information: Column 1: ID number Column 2: Diagnosis (M = malignant, B = benign) Ten real-valued features are computed for each cell nucleus: 1. radius (mean of distances from center to points on the perimeter) 2. texture (standard deviation of gray-scale values) 3. perimeter 4. area 5. smoothness (local variation in radius lengths) 6. compactness (perimeter^2 / area - 1.0) 7. concavity (severity of concave portions of the contour) 8. concave points (number of concave portions of the contour) 9. symmetry 10. fractal dimension ("coastline approximation" - 1) Due to the statistical nature of the test, we are not able to get exact measurements of the previous values. Instead, the dataset contains the mean and standard error of the real-valued features. Columns 3-12 present the mean of the measured values Columns 13-22 present the standard error of the measured values Load and Analyze the dataset plt . xlim ( xx0 . min (), xx0 . max ()) plt . ylim ( xx1 . min (), xx1 . max ()) In [ ]: #Load Data data = pd . read_csv ( 'datasets/breast_cancer_data.csv' ) Loading [MathJax]/extensions/Safe.js
Always look at your dataset after loading it. Use information from .describe and .info to learn more about the dataset. id diagnosis radius_mean texture_mean perimeter_mean area_mean smoothness_mean compactne 0 842302 M 17.99 10.38 122.80 1001.0 0.11840 1 842517 M 20.57 17.77 132.90 1326.0 0.08474 2 84300903 M 19.69 21.25 130.00 1203.0 0.10960 3 84348301 M 11.42 20.38 77.58 386.1 0.14250 4 84358402 M 20.29 14.34 135.10 1297.0 0.10030 5 rows × 22 columns id radius_mean texture_mean perimeter_mean area_mean smoothness_mean compactness count 5.690000e+02 569.000000 569.000000 569.000000 569.000000 569.000000 569 mean 3.037183e+07 14.127292 19.289649 91.969033 654.889104 0.096360 0 std 1.250206e+08 3.524049 4.301036 24.298981 351.914129 0.014064 0 min 8.670000e+03 6.981000 9.710000 43.790000 143.500000 0.052630 0 25% 8.692180e+05 11.700000 16.170000 75.170000 420.300000 0.086370 0 50% 9.060240e+05 13.370000 18.840000 86.240000 551.100000 0.095870 0 75% 8.813129e+06 15.780000 21.800000 104.100000 782.700000 0.105300 0 max 9.113205e+08 28.110000 39.280000 188.500000 2501.000000 0.163400 0 8 rows × 21 columns In [ ]: data . head ( 5 ) Out[ ]: In [ ]: data . describe () Out[ ]: In [ ]: data . info () Loading [MathJax]/extensions/Safe.js
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
<class 'pandas.core.frame.DataFrame'> RangeIndex: 569 entries, 0 to 568 Data columns (total 22 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 569 non-null int64 1 diagnosis 569 non-null object 2 radius_mean 569 non-null float64 3 texture_mean 569 non-null float64 4 perimeter_mean 569 non-null float64 5 area_mean 569 non-null float64 6 smoothness_mean 569 non-null float64 7 compactness_mean 569 non-null float64 8 concavity_mean 569 non-null float64 9 concave points_mean 569 non-null float64 10 symmetry_mean 569 non-null float64 11 fractal_dimension_mean 569 non-null float64 12 radius_se 569 non-null float64 13 texture_se 569 non-null float64 14 perimeter_se 569 non-null float64 15 area_se 569 non-null float64 16 smoothness_se 569 non-null float64 17 compactness_se 569 non-null float64 18 concavity_se 569 non-null float64 19 concave points_se 569 non-null float64 20 symmetry_se 569 non-null float64 21 fractal_dimension_se 569 non-null float64 dtypes: float64(20), int64(1), object(1) memory usage: 97.9+ KB While .info shows that every entry has 569 non-null and there are 569 entries, it is good to explicitly check for nulls. id 0 diagnosis 0 radius_mean 0 texture_mean 0 perimeter_mean 0 area_mean 0 smoothness_mean 0 compactness_mean 0 concavity_mean 0 concave points_mean 0 symmetry_mean 0 fractal_dimension_mean 0 radius_se 0 texture_se 0 perimeter_se 0 area_se 0 smoothness_se 0 compactness_se 0 concavity_se 0 concave points_se 0 symmetry_se 0 fractal_dimension_se 0 dtype: int64 Awesome! No need for imputation! While we are looking at the dataset, we shall remove the "id" column. In [ ]: data . isnull () . sum () Out[ ]: Loading [MathJax]/extensions/Safe.js
Looking at the target labels For this project, we wish to classify the diagnosis column. 0 M 1 M 2 M 3 M 4 M .. 564 M 565 M 566 M 567 M 568 B Name: diagnosis, Length: 569, dtype: object We need to transform this column into numerical column so that we may use them in our models. To do this, we will employ the LabelEncoder to automatically transform all the target label. ['B' 'M'] 0 1 1 1 2 1 3 1 4 1 .. 564 1 565 1 566 1 567 1 568 0 Name: diagnosis, Length: 569, dtype: int64 Let's look at a histogram of the full dataset. Its always good to get a global view of your datasets by looking at their histograms. You might see some interesting trends. In [ ]: data = data . drop ([ "id" ], axis = 1 ) In [ ]: data [ "diagnosis" ] Out[ ]: In [ ]: from sklearn.preprocessing import LabelEncoder le = LabelEncoder () data [ 'diagnosis' ] = le . fit_transform ( data [ 'diagnosis' ]) print ( le . classes_ ) In [ ]: data [ 'diagnosis' ] Out[ ]: In [ ]: data . hist ( figsize = ( 20 , 15 )) Loading [MathJax]/extensions/Safe.js
array([[<Axes: title={'center': 'diagnosis'}>, <Axes: title={'center': 'radius_mean'}>, <Axes: title={'center': 'texture_mean'}>, <Axes: title={'center': 'perimeter_mean'}>, <Axes: title={'center': 'area_mean'}>], [<Axes: title={'center': 'smoothness_mean'}>, <Axes: title={'center': 'compactness_mean'}>, <Axes: title={'center': 'concavity_mean'}>, <Axes: title={'center': 'concave points_mean'}>, <Axes: title={'center': 'symmetry_mean'}>], [<Axes: title={'center': 'fractal_dimension_mean'}>, <Axes: title={'center': 'radius_se'}>, <Axes: title={'center': 'texture_se'}>, <Axes: title={'center': 'perimeter_se'}>, <Axes: title={'center': 'area_se'}>], [<Axes: title={'center': 'smoothness_se'}>, <Axes: title={'center': 'compactness_se'}>, <Axes: title={'center': 'concavity_se'}>, <Axes: title={'center': 'concave points_se'}>, <Axes: title={'center': 'symmetry_se'}>], [<Axes: title={'center': 'fractal_dimension_se'}>, <Axes: >, <Axes: >, <Axes: >, <Axes: >]], dtype=object) Out[ ]: Loading [MathJax]/extensions/Safe.js
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
From the histograms, we can see some interesting trends. Possible observations: Many of the _se columns indicate a heavy skewness towards low values and have large tails Many of the _mean columns look more Gaussian in shape There is a large disparity between the ranges of certain features. For example, radius mean can go from 0 to 25 while smoothness_mean is in the range [0.050,0.150]. This indicates we will have to normalize or standardize the features if the models are sensitive to such measures. Looking at the correlation matrix to get an idea about which features are important In [ ]: correlations = data . corr () columns = list ( data ) #Creates the heatmap heatmap ( correlations . values , columns , columns , figsize = ( 20 , 12 ), cmap = "hsv" ) In [ ]: #Let's specifically look at the correlations of our target feature correlations [ "diagnosis" ] . sort_values ( ascending = False ) Loading [MathJax]/extensions/Safe.js
diagnosis 1.000000 concave points_mean 0.776614 perimeter_mean 0.742636 radius_mean 0.730029 area_mean 0.708984 concavity_mean 0.696360 compactness_mean 0.596534 radius_se 0.567134 perimeter_se 0.556141 area_se 0.548236 texture_mean 0.415185 concave points_se 0.408042 smoothness_mean 0.358560 symmetry_mean 0.330499 compactness_se 0.292999 concavity_se 0.253730 fractal_dimension_se 0.077972 symmetry_se -0.006522 texture_se -0.008303 fractal_dimension_mean -0.012838 smoothness_se -0.067016 Name: diagnosis, dtype: float64 We can see that there is a lot of correlation between the features and the target label. Thus, we can expect to learn something from the data When doing classification, check if classes are heavily imbalanced. It is important that the dataset does not prefer one class over any others. Otherwise, it may bias the model to not learn the minority classes well. Lets use a histogram and count the number of elements in each class. 0 357 1 212 Name: diagnosis, dtype: int64 Out[ ]: In [ ]: data [ 'diagnosis' ] . hist ( bins = 2 , figsize = ( 5 , 5 )) data [ 'diagnosis' ] . value_counts () Out[ ]: Loading [MathJax]/extensions/Safe.js
There is a bit of an imbalance which is something to keep in mind if we find that our models do not perform well on the minority classes. For our purposes, this imbalance is not big enough to be an issue so we will not perform balancing techniques for this dataset. Since the dataset is small though, we want to be careful when making training and testing splits to ensure that there is enough of each class for both splits. We will show how to perform this shortly. Setting up the data Before starting any model training, we have to split up the target labels from our features. Now, we also split the data into training and testing data. To ensure that there is not an imbalance of classes in the training and testing set, we will use the stratify parameter in train_test_split to perform stratified sampling on the data (Recall from lecture how stratified sampling is performed). Note that we named the input feature data as raw to indicate that there has been no pre-processing on them such as standardization. Shortly, we will show the affect that pre-processing has on the performance of the model. Let us quickly test that the splits are somewhat balanced. In [ ]: y = data [ "diagnosis" ] x = data . drop ([ "diagnosis" ], axis = 1 ) In [ ]: train_raw , test_raw , target , target_test = train_test_split ( x , y , test_size = 0.2 , stratify In [ ]: #Training classes target . hist ( bins = 2 , figsize = ( 5 , 5 )) target . value_counts () Loading [MathJax]/extensions/Safe.js
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
0 285 1 170 Name: diagnosis, dtype: int64 0 72 1 42 Name: diagnosis, dtype: int64 Out[ ]: In [ ]: #Testing classes target_test . hist ( bins = 2 , figsize = ( 5 , 5 )) target_test . value_counts () Out[ ]: Loading [MathJax]/extensions/Safe.js
We can see that the class balance is about the same as before the split. In fact, we can see that if a classifier just guessed class 0, it would have an accuracy of $100 * \frac{72}{72+42} = 63.15\%$. We can consider this the baseline accuracy to compare against. Models for Classification: KNN For our first model, we will use KNN classfication. This is a model we have seen many times throughout the course and it would be interesting to see how well it performs. Simple KNN classification with K = 3 Let us try KNN on the raw data with simply 3 nearest neighbors. We use the sklearn metric library to calculate the measures of interest. In this case, we focus on accuracy. Accuracy: 0.877193 We can see that there is already a huge improvement in accuracy on comparison to the baseline of 63.15%. Let's see the effect that standardizing the input features would have on the KNN performance. Affect of pre-processing on KNN Accuracy: 0.921053 We can see that with pre-processing we were able to get a much better classification accuracy. Here we only used StandardScaler. Lets see if other pre-processing techniques could have also worked. As such, lets look at MinMaxScaler and Normalizer: In [ ]: # k-Nearest Neighbors algorithm knn = KNeighborsClassifier ( n_neighbors = 3 ) knn . fit ( train_raw , target ) predicted = knn . predict ( test_raw ) In [ ]: print ( " %-12s %f " % ( 'Accuracy:' , metrics . accuracy_score ( target_test , predicted ))) In [ ]: #Since all features are real-valued, we only have one pipeline pipeline = Pipeline ([ ( 'scaler' , StandardScaler ()) ]) #Transform raw data train = pipeline . fit_transform ( train_raw ) test = pipeline . transform ( test_raw ) #Note that there is no fit calls In [ ]: # k-Nearest Neighbors algorithm knn = KNeighborsClassifier ( n_neighbors = 3 ) knn . fit ( train , target ) testing_result = knn . predict ( test ) predicted = knn . predict ( test ) In [ ]: print ( " %-12s %f " % ( 'Accuracy:' , metrics . accuracy_score ( target_test , predicted ))) In [ ]: preprocessors = [ StandardScaler (), MinMaxScaler (), Normalizer () ] Loading [MathJax]/extensions/Safe.js
StandardScaler() Accuracy: 0.903509 MinMaxScaler() Accuracy: 0.912281 Normalizer() Accuracy: 0.885965 We can see that MinMaxScaler had the same performance as StandardScaler. Yet, Normalizer did not improve the model. Visualizing decision boundaries for KNN Its always nice to see the decision boundaries a model decides upon. Let's see how the decision boundary changes as function of k when only using the two most correlated features to the target labels: concave_points_mean and perimeter_mean. for pre in preprocessors : pipeline = Pipeline ([ ( 'preprocessor' , pre ) ]) #Transform raw data train = pipeline . fit_transform ( train_raw ) test = pipeline . transform ( test_raw ) #Note that there is no fit calls # k-Nearest Neighbors algorithm knn = KNeighborsClassifier ( n_neighbors = 7 ) knn . fit ( train , target ) testing_result = knn . predict ( test ) predicted = knn . predict ( test ) print ( pre ) print ( " %-12s %f " % ( 'Accuracy:' , metrics . accuracy_score ( target_test , predicted ))) In [ ]: #Extract first two features and use the standardscaler train_2 = StandardScaler () . fit_transform ( train_raw [[ 'concave points_mean' , 'perimeter_mea k_r = [ 1 , 3 , 5 , 7 ] for k in k_r : knn = KNeighborsClassifier ( n_neighbors = k ) knn . fit ( train_2 , target ) draw_contour ( train_2 , target , knn , class_labels = [ 'Benign' , 'Malignant' ]) plt . title ( f "K = { k } " ) Loading [MathJax]/extensions/Safe.js
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Loading [MathJax]/extensions/Safe.js
We can see that as k gets larger, the decision boundary gets smoother. Models for Classification: Logistic Regression While KNN is a very powerful model, it does come with a few issues such as Loading [MathJax]/extensions/Safe.js
Require storing the full training dataset Prediction is done by comparing new sample will all samples in training set which is time-consuming These issues arise because KNN is a non-parametric model which means that it does not summarize the data into a finite set of parameters. Let us now look at Logistic Regression which is an example of a parametric model. Simple Logistic Regression First, let us see how logistic regression performs without any regularization. Accuracy: 0.964912 /Users/kunalpatil/anaconda3/lib/python3.10/site-packages/sklearn/linear_model/_logistic. py:458: ConvergenceWarning: lbfgs failed to converge (status=1): STOP: TOTAL NO. of ITERATIONS REACHED LIMIT. Increase the number of iterations (max_iter) or scale the data as shown in: https://scikit-learn.org/stable/modules/preprocessing.html Please also refer to the documentation for alternative solver options: https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression n_iter_i = _check_optimize_result( We can see that Logistic Regression is actually performing much better than any of the KNN models we tried. We can also see the parameters that the model learned. array([[ 9.49657288e+00, 5.29884067e-01, -1.70890677e+00, 1.80971480e-02, 4.25364740e+01, 2.49869264e+01, 7.58303604e+01, 6.71108523e+01, 2.97123432e+01, 1.51697880e-01, -3.38079100e+01, -2.88171687e+00, 8.24165745e-01, 4.32658867e-01, -2.05081068e+00, -4.98690890e+01, -6.82370605e+01, -7.81039978e+00, -1.70256855e+01, -1.11729757e+01]]) array([-17.051348]) Number of Features in data: 20 Number of Parameters: 20 Since we are using Logistic Regression where we model the log odds with a linear function, it makes sense that we have a parameter/coefficient for each input feature. In [ ]: log_reg = LogisticRegression ( penalty = "l2" , max_iter = 1000 , solver = "lbfgs" , C = ( 10 ** 30 #C is choosen to be high to remove regularization #We could have chosen penalty = "none" since lbfgs supports it but this option is not po log_reg . fit ( train_raw , target ) testing_result = log_reg . predict ( test_raw ) predicted = log_reg . predict ( test_raw ) print ( " %-12s %f " % ( 'Accuracy:' , metrics . accuracy_score ( target_test , predicted ))) In [ ]: #Parameters for each feature log_reg . coef_ Out[ ]: In [ ]: #Intercept term log_reg . intercept_ Out[ ]: In [ ]: print ( "Number of Features in data:" , train_raw . shape [ 1 ]) print ( "Number of Parameters:" , len ( log_reg . coef_ [ 0 ])) Loading [MathJax]/extensions/Safe.js
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Parameters for Logistic Regression In Sci-kit Learn, the following are just some of the parameters we can pass into Logistic Regression: penalty: {‘l1’, ‘l2’, ‘elasticnet’, ‘none’} default="l2" Specifies the type of regularization to use. Not all penalties work for each solver. C: positive float, default=1 Inverse of the regularization strength. You can treat C as $\frac{1}{\lambda}$ as shown in lecture. Thus, as C gets smaller, the regularization strength increases. solver: {‘newton-cg’, ‘lbfgs’, ‘liblinear’, ‘sag’, ‘saga’}, default=’lbfgs’ Algorithm to use in the optimization problem. Each algorithms solves logistic regression using different iterative methods that are based on the gradient. Read the sci-kit learn documentation for more information. max_iter: int, default=100 Maximum number of iterations taken for the solvers to converge. Each parameter has a different effect on the model. Let's look at how the choose of max_iter affects the model performance on the raw data and the standardized dataset. Raw Data Accuracy: 0.938596 Preprocessed Data Accuracy: 0.947368 We see that the accuraccies are pretty close to each other. Lets see what happens when we decrease the max_iter. In [ ]: #Since all features are real-valued, we only have one pipeline preprocesser = Pipeline ([ ( 'scaler' , StandardScaler ()) ]) #Transform raw data train = preprocesser . fit_transform ( train_raw ) test = preprocesser . transform ( test_raw ) #Note that there is no fit call In [ ]: log_reg = LogisticRegression ( penalty = "l2" , max_iter = 1000 , solver = "lbfgs" , C = 0.01 ) #Train raw is the data before preprocessing log_reg . fit ( train_raw , target ) predicted = log_reg . predict ( test_raw ) print ( " %-12s %f " % ( 'Raw Data Accuracy:' , metrics . accuracy_score ( target_test , predicted )) #Train is the data after preprocessing (using Standard scalar) log_reg . fit ( train , target ) predicted = log_reg . predict ( test ) print ( " %-12s %f " % ( 'Preprocessed Data Accuracy:' , metrics . accuracy_score ( target_test , pr In [ ]: log_reg = LogisticRegression ( penalty = "l2" , max_iter = 70 , solver = "lbfgs" , C = 0.01 ) #Train raw is the data before preprocessing log_reg . fit ( train_raw , target ) predicted = log_reg . predict ( test_raw ) print ( " %-12s %f " % ( 'Raw Data Accuracy:' , metrics . accuracy_score ( target_test , predicted )) Loading [MathJax]/extensions/Safe.js
Raw Data Accuracy: 0.921053 /Users/kunalpatil/anaconda3/lib/python3.10/site-packages/sklearn/linear_model/_logistic. py:458: ConvergenceWarning: lbfgs failed to converge (status=1): STOP: TOTAL NO. of ITERATIONS REACHED LIMIT. Increase the number of iterations (max_iter) or scale the data as shown in: https://scikit-learn.org/stable/modules/preprocessing.html Please also refer to the documentation for alternative solver options: https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression n_iter_i = _check_optimize_result( Ooops! The model did not seem to converge. Its seem that the scale of the features strongly affects the convergence speed of the iterative algorithm. As suggested, we can fix this issue by increaing the max_iter, re-scaling the data, or using a different solver. Preprocessed Data Accuracy: 0.947368 Cross Validation for Logistic Regression Let us do a little experiment using cross validation to see how each term affects the logistic regression. We will perform this example on the standardized data. In [ ]: #Train is the data after preprocessing (using Standard scalar) log_reg . fit ( train , target ) predicted = log_reg . predict ( test ) print ( " %-12s %f " % ( 'Preprocessed Data Accuracy:' , metrics . accuracy_score ( target_test , pr In [ ]: #You may even do Cross validation for classification from sklearn.model_selection import GridSearchCV #Note that this a list of dict #Each dict describes the combination of parameters to check parameters = [ { "penalty" : [ "l2" ], "C" : [ 0.01 , 1 , 100 ], "solver" : [ "lbfgs" , "liblinear" ]}, #These solvers support penalty = "l2" { "penalty" : [ "none" ], "C" : [ 1 ], #Specified to prevent error message "solver" : [ "lbfgs" , "newton-cg" ]}, #These solvers support penalty = "none" ] #instantiate model #Implementing cross validation k = 3 kf = KFold ( n_splits = k , random_state = None ) log_reg = LogisticRegression ( penalty = "none" , max_iter = 1000 , solver = "lbfgs" ) #will c grid = GridSearchCV ( log_reg , parameters , cv = kf , scoring = "accuracy" ) grid . fit ( train , target ) Loading [MathJax]/extensions/Safe.js
/Users/kunalpatil/anaconda3/lib/python3.10/site-packages/sklearn/linear_model/_logistic. py:1173: FutureWarning: `penalty='none'`has been deprecated in 1.2 and will be removed i n 1.4. To keep the past behaviour, set `penalty=None`. warnings.warn( /Users/kunalpatil/anaconda3/lib/python3.10/site-packages/sklearn/linear_model/_logistic. py:458: ConvergenceWarning: lbfgs failed to converge (status=1): STOP: TOTAL NO. of ITERATIONS REACHED LIMIT. Increase the number of iterations (max_iter) or scale the data as shown in: https://scikit-learn.org/stable/modules/preprocessing.html Please also refer to the documentation for alternative solver options: https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression n_iter_i = _check_optimize_result( /Users/kunalpatil/anaconda3/lib/python3.10/site-packages/sklearn/linear_model/_logistic. py:1173: FutureWarning: `penalty='none'`has been deprecated in 1.2 and will be removed i n 1.4. To keep the past behaviour, set `penalty=None`. warnings.warn( /Users/kunalpatil/anaconda3/lib/python3.10/site-packages/sklearn/linear_model/_logistic. py:1173: FutureWarning: `penalty='none'`has been deprecated in 1.2 and will be removed i n 1.4. To keep the past behaviour, set `penalty=None`. warnings.warn( /Users/kunalpatil/anaconda3/lib/python3.10/site-packages/sklearn/linear_model/_logistic. py:1173: FutureWarning: `penalty='none'`has been deprecated in 1.2 and will be removed i n 1.4. To keep the past behaviour, set `penalty=None`. warnings.warn( /Users/kunalpatil/anaconda3/lib/python3.10/site-packages/scipy/optimize/_linesearch.py:4 57: LineSearchWarning: The line search algorithm did not converge warn('The line search algorithm did not converge', LineSearchWarning) /Users/kunalpatil/anaconda3/lib/python3.10/site-packages/scipy/optimize/_linesearch.py:3 06: LineSearchWarning: The line search algorithm did not converge warn('The line search algorithm did not converge', LineSearchWarning) /Users/kunalpatil/anaconda3/lib/python3.10/site-packages/sklearn/utils/optimize.py:210: ConvergenceWarning: newton-cg failed to converge. Increase the number of iterations. warnings.warn( /Users/kunalpatil/anaconda3/lib/python3.10/site-packages/sklearn/linear_model/_logistic. py:1173: FutureWarning: `penalty='none'`has been deprecated in 1.2 and will be removed i n 1.4. To keep the past behaviour, set `penalty=None`. warnings.warn( /Users/kunalpatil/anaconda3/lib/python3.10/site-packages/sklearn/linear_model/_logistic. py:1173: FutureWarning: `penalty='none'`has been deprecated in 1.2 and will be removed i n 1.4. To keep the past behaviour, set `penalty=None`. warnings.warn( Out[ ]: In [ ]: #Put results into Dataframe res = pd . DataFrame ( grid . cv_results_ ) res GridSearchCV estimator: LogisticRegression LogisticRegression Loading [MathJax]/extensions/Safe.js
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
mean_fit_time std_fit_time mean_score_time std_score_time param_C param_penalty param_solver para 0 0.001854 0.000687 0.000315 0.000036 0.01 l2 lbfgs {'C': 0 'pena 'solv 'lbf 1 0.000816 0.000111 0.000239 0.000015 0.01 l2 liblinear {'C': 0 'pena 'solv 'lib 2 0.002188 0.000165 0.000252 0.000028 1 l2 lbfgs {'C 'pena 'solv 'lbf 3 0.000956 0.000034 0.000232 0.000019 1 l2 liblinear {'C 'pena 'solv 'libline 4 0.007632 0.000695 0.000230 0.000003 100 l2 lbfgs {'C': 1 'pena 'solv 'lbf 5 0.001523 0.000059 0.000219 0.000009 100 l2 liblinear {'C': 1 'pena 'solv 'liblin 6 0.025101 0.019992 0.000318 0.000037 1 none lbfgs {'C 'pena 'no 'solv 'lbf 7 2.844657 4.010496 0.000411 0.000169 1 none newton-cg {'C 'pena 'no 'solv 'newto rank_test_score param_C param_penalty param_solver mean_test_score 0 7 0.01 l2 lbfgs 0.916463 1 4 0.01 l2 liblinear 0.934051 2 2 1 l2 lbfgs 0.956024 3 1 1 l2 liblinear 0.958232 4 6 100 l2 lbfgs 0.934007 5 3 100 l2 liblinear 0.936200 6 5 1 none lbfgs 0.934007 7 8 1 none newton-cg 0.820001 Out[ ]: In [ ]: #Extract the columns that specify the score and the parameters for each row res [[ "rank_test_score" , "param_C" , "param_penalty" , "param_solver" , "mean_test_score" ]] Out[ ]: Loading [MathJax]/extensions/Safe.js
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
We can see that the choice of these parameters can stronlgy affect performance of the classifier. Lets check the performance of the best parameters on the test set. Accuracy: 0.938596 Note that this test accuracy is not as good as some of the other logistic regression examples we've shown. Speedtest between KNN and Logistic Regression Lets see how long KNN and Logistic Regression take to perform training and testing. /Users/kunalpatil/anaconda3/lib/python3.10/site-packages/sklearn/linear_model/_logistic. py:1173: FutureWarning: `penalty='none'`has been deprecated in 1.2 and will be removed i n 1.4. To keep the past behaviour, set `penalty=None`. warnings.warn( KNN Training Time : 0.00044083595275878906 Logistic Regression Training Time : 0.023360013961791992 KNN Testing Time : 0.01816701889038086 Logistic Regression Testing Time : 0.00018095970153808594 This simple test shows that Logistic Regression is slower than KNN during Training time but is much faster during testing time. Visualizing decision boundaries for Logistic Regression Now, lets look at the decision boundary caused by Logistic Regression. Same as for KNN, we use the two most correlated features to the target labels: concave_points_mean and perimeter_mean. This way, we can In [ ]: #Train raw is the data before preprocessing predicted = grid . predict ( test ) print ( " %-12s %f " % ( 'Accuracy:' , metrics . accuracy_score ( target_test , predicted ))) In [ ]: scaler = StandardScaler () train = scaler . fit_transform ( train_raw ) test = scaler . fit_transform ( test_raw ) log_reg = LogisticRegression ( penalty = "none" , max_iter = 1000 ) knn = KNeighborsClassifier ( n_neighbors = 3 ) t0 = time . time () knn . fit ( train , target ) t1 = time . time () print ( "KNN Training Time : " , t1 - t0 ) t0 = time . time () log_reg . fit ( train , target ) t1 = time . time () print ( "Logistic Regression Training Time : " , t1 - t0 ) In [ ]: t0 = time . time () knn . predict ( test ) t1 = time . time () print ( "KNN Testing Time : " , t1 - t0 ) t0 = time . time () log_reg . predict ( test ) t1 = time . time () print ( "Logistic Regression Testing Time : " , t1 - t0 ) Loading [MathJax]/extensions/Safe.js
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
visualize the 2D decision boundary. In [ ]: #Extract first two feature and use the standardscaler train_2 = StandardScaler () . fit_transform ( train_raw [[ 'concave points_mean' , 'perimeter_mea Cs = [ 0.001 , 0.1 , 1000 ] for C in Cs : log_reg = LogisticRegression ( penalty = "l2" , max_iter = 1000 , solver = "lbfgs" , C = C ) log_reg . fit ( train_2 , target ) draw_contour ( train_2 , target , log_reg , class_labels = [ 'Benign' , 'Malignant' ]) plt . title ( f "C = { C } " ) Loading [MathJax]/extensions/Safe.js
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
We can see as the regularization strength changes, the decision boundary moves as well. Additionally, we can clearly see that the decision boundary is a line since this is a linear model. Models for Classification: SVM Loading [MathJax]/extensions/Safe.js
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
We now discuss another type of linear classification model known as Support Vector Machines (SVM). Where Logistic Regression was motivated probability theory, SVM is motivated by geometeric arguments. Specifcally, SVM finds a separating hyperline that maximizes the margin (i.e. distance from each class). The hyperplane is used to classify the points by designating every sample on of side of the hyperplane as the positive class and the other side as the negative class. The hyperplane is determine by a few sample points known as support vectors that uniquely characterize the hyperplane. svm_im Note that it may not always be possible to find a hyperplane that completely separates the classes. Thus, we use what is known as Soft-Margin SVM which aims to maximize the margin while minizming the distance on the classes that are on the wrong side. All Sci-kit learn implementations of SVM that we use are soft-margin SVM. Simple SVM classification Accuracy: 0.921053 Parameters for SVM In Sci-kit Learn, the following are just some of the parameters we can pass into Logistic Regression: C: positive float, default=1 Inverse of the regularization strength. You can treat C as $\frac{1}{\lambda}$ as shown in lecture. Thus, as C gets smaller, the regularization strength increases. SVM only uses the L2 regularization. kernel: {‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’}, default=’rbf’ Specifies the kernel type to be used in the algorithm. A kernel specifies a mapping into a higher dimension space to allow for non-linear decision boundaries. degree: int, default=3 Degree of the polynomial kernel function (‘poly’). Ignored by all other kernels. Visualizing decision boundaries for SVM Now, lets look at the decision boundary caused by SVM with different kernels. Same as for KNN and Logistic Regression, we use the two most correlated features to the target labels: concave_points_mean and perimeter_mean. This way, we can visualize the 2D decision boundary. In [ ]: svm = SVC () svm . fit ( train , target ) predicted = svm . predict ( test ) print ( " %-12s %f " % ( 'Accuracy:' , metrics . accuracy_score ( target_test , predicted ))) In [ ]: #Extract first two feature and use the standardscaler train_2 = StandardScaler () . fit_transform ( train_raw [[ 'concave points_mean' , 'perimeter_mea kernel = [ 'linear' , 'poly' , 'rbf' , 'sigmoid' ] for ker in kernel : svm = SVC ( kernel = ker ) #will change parameters during CV Loading [MathJax]/extensions/Safe.js
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
svm . fit ( train_2 , target ) draw_contour ( train_2 , target , svm , class_labels = [ 'Benign' , 'Malignant' ]) plt . title ( f "Kernel = { ker } " ) Loading [MathJax]/extensions/Safe.js
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
We can see that the decision boundary is not always linear because we are using non-linear kernels. Loading [MathJax]/extensions/Safe.js
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Important Measures for Classifications Now that we have gone over a few models for binary classification, let's explore the different ways we can measure the performance of these models. Here are just some of the most important measures of interest. We use the convention to refer to the class labeled as $1$ as the positive class. Accuracy: The percentage of predictions that are correct. Use metrics.accuracy_score Precision: $\frac{\text{Number of labels correctly classified as positive}}{\text{Number of labels classified as positives}}$. Percentage of predictions that are correctly positive among all the predictions that were classified as positive. Use metrics.precision_score Recall: $\frac{\text{Number of labels correctly classified as positive}}{\text{Number of labels where the true class is positive}}$. Percentage of predictions that are correctly positive among all the labels where the true class is positive. Also known as the probability of detecting when a class is positive. Use metrics.recall_score F1 Score: Harmonic mean of the precision and recall. Highest value is $1$ when both precision and recall are $1$, i.e. perfect. Lowest value is $0$ when either precision or recall is zero. Provides an aggregate score to analyze both precision and recall. Use metrics.f1_score We can calculate these measures by using a confusion matrix as well. Accuracy: 0.947368 Precision: 0.909091 Recall: 0.952381 F1 Score: 0.930233 Confusion Matrix: [[68 4] [ 2 40]] In [ ]: #Example classifier log_reg = LogisticRegression ( max_iter = 1000 ) log_reg . fit ( train_raw , target ) predicted = log_reg . predict ( test_raw ) In [ ]: print ( " %-12s %f " % ( 'Accuracy:' , metrics . accuracy_score ( target_test , predicted ))) print ( " %-12s %f " % ( 'Precision:' , metrics . precision_score ( target_test , predicted , labels = print ( " %-12s %f " % ( 'Recall:' , metrics . recall_score ( target_test , predicted , labels = None , print ( " %-12s %f " % ( 'F1 Score:' , metrics . f1_score ( target_test , predicted , labels = None , po print ( "Confusion Matrix: \n " , metrics . confusion_matrix ( target_test , predicted )) #Draws confusion matrix draw_confusion_matrix ( target_test , predicted , [ 'Benign' , 'Malignant' ]) Loading [MathJax]/extensions/Safe.js
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
TODO: Using classification methods to classify heart disease Now that you have some examples of the classifiers that Sci-kit learn has to offers, let try to apply them to a new dataset. Background: The Dataset For this exercise we will be using a subset of the UCI Heart Disease dataset, leveraging the fourteen most commonly used attributes. All identifying information about the patient has been scrubbed. You will be asked to classify whether a patient is suffering from heart disease based on a host of potential medical factors. The dataset includes 14 columns. The information provided by each column is as follows: age: Age in years sex: (1 = male; 0 = female) cp: Chest pain type (0 = asymptomatic; 1 = atypical angina; 2 = non-anginal pain; 3 = typical angina) trestbps: Resting blood pressure (in mm Hg on admission to the hospital) chol: cholesterol in mg/dl fbs Fasting blood sugar > 120 mg/dl (1 = true; 0 = false) restecg: Resting electrocardiographic results (0= showing probable or definite left ventricular hypertrophy by Estes' criteria; 1 = normal; 2 = having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)) thalach: Maximum heart rate achieved Loading [MathJax]/extensions/Safe.js
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
exang: Exercise induced angina (1 = yes; 0 = no) oldpeak: Depression induced by exercise relative to rest slope: The slope of the peak exercise ST segment (0 = downsloping; 1 = flat; 2 = upsloping) ca: Number of major vessels (0-3) colored by flourosopy thal: 1 = normal; 2 = fixed defect; 7 = reversable defect sick: Indicates the presence of Heart disease (True = Disease; False = No disease) [25 pts] Part 1. Load the Data and Analyze Let's first load our dataset so we'll be able to work with it. (correct the relative path if your notebook is in a different directory than the csv file.) [5 pts] Looking at the data Now that our data is loaded, let's take a closer look at the dataset we're working with. Use the head method, the describe method, and the info method to display some of the rows so we can visualize the types of data fields we'll be working with. age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal sick 0 63 1 3 145 233 1 0 150 0 2.3 0 0 1 False 1 37 1 2 130 250 0 1 187 0 3.5 0 0 2 False 2 41 0 1 130 204 0 0 172 0 1.4 2 0 2 False 3 56 1 1 120 236 0 1 178 0 0.8 2 0 2 False 4 57 0 0 120 354 0 1 163 1 0.6 2 0 2 False age sex cp trestbps chol fbs restecg thalach ex count 303.000000 303.000000 303.000000 303.000000 303.000000 303.000000 303.000000 303.000000 303.000 mean 54.366337 0.683168 0.966997 131.623762 246.264026 0.148515 0.528053 149.646865 0.326 std 9.082101 0.466011 1.032052 17.538143 51.830751 0.356198 0.525860 22.905161 0.469 min 29.000000 0.000000 0.000000 94.000000 126.000000 0.000000 0.000000 71.000000 0.000 25% 47.500000 0.000000 0.000000 120.000000 211.000000 0.000000 0.000000 133.500000 0.000 50% 55.000000 1.000000 1.000000 130.000000 240.000000 0.000000 1.000000 153.000000 0.000 75% 61.000000 1.000000 2.000000 140.000000 274.500000 0.000000 1.000000 166.000000 1.000 max 77.000000 1.000000 3.000000 200.000000 564.000000 1.000000 2.000000 202.000000 1.000 In [ ]: data = pd . read_csv ( 'datasets/heartdisease.csv' ) In [ ]: data . head () Out[ ]: In [ ]: data . describe () Out[ ]: In [ ]: data . info () Loading [MathJax]/extensions/Safe.js
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
<class 'pandas.core.frame.DataFrame'> RangeIndex: 303 entries, 0 to 302 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 age 303 non-null int64 1 sex 303 non-null int64 2 cp 303 non-null int64 3 trestbps 303 non-null int64 4 chol 303 non-null int64 5 fbs 303 non-null int64 6 restecg 303 non-null int64 7 thalach 303 non-null int64 8 exang 303 non-null int64 9 oldpeak 303 non-null float64 10 slope 303 non-null int64 11 ca 303 non-null int64 12 thal 303 non-null int64 13 sick 303 non-null bool dtypes: bool(1), float64(1), int64(12) memory usage: 31.2 KB Sometimes data will be stored in different formats (e.g., string, date, boolean), but many learning methods work strictly on numeric inputs. Additionally, some numerical features can represent categorical features which need to be pre-processed. Are there any columns that need to be transformed and why? All the columns in our dataframe are numeric (either int or float), however our target variable 'sick' is a boolean and may need to be modified. Additionally, several of the numerical features represent categorical features which may need to be pre-processed/encoded, including sex, fbs, restecg, cp, thal, and slope. Determine if we're dealing with any null values. If so, report on which columns? age 0 sex 0 cp 0 trestbps 0 chol 0 fbs 0 restecg 0 thalach 0 exang 0 oldpeak 0 slope 0 ca 0 thal 0 sick 0 dtype: int64 There are no null columns. In [ ]: data . isnull () . sum () Out[ ]: Loading [MathJax]/extensions/Safe.js
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
[5 pts] Transform target label into numerical value Before we begin our analysis, we need to fix the field(s) that will be problematic. Specifically, convert our boolean "sick" variable into a binary numeric target variable (values of either '0' or '1') using the label encoder from scikit-learn , place this new array into a new column of the DataFrame named "target", and then drop the original "sick" column from the dataframe. Afterward, use .head to print the first 5 rows age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal target 0 63 1 3 145 233 1 0 150 0 2.3 0 0 1 0 1 37 1 2 130 250 0 1 187 0 3.5 0 0 2 0 2 41 0 1 130 204 0 0 172 0 1.4 2 0 2 0 3 56 1 1 120 236 0 1 178 0 0.8 2 0 2 0 4 57 0 0 120 354 0 1 163 1 0.6 2 0 2 0 [5 pts] Plotting histogram of data Now that we have a feel for the data-types for each of the variables, plot histograms of each field. array([[<Axes: title={'center': 'age'}>, <Axes: title={'center': 'sex'}>, <Axes: title={'center': 'cp'}>, <Axes: title={'center': 'trestbps'}>], [<Axes: title={'center': 'chol'}>, <Axes: title={'center': 'fbs'}>, <Axes: title={'center': 'restecg'}>, <Axes: title={'center': 'thalach'}>], [<Axes: title={'center': 'exang'}>, <Axes: title={'center': 'oldpeak'}>, <Axes: title={'center': 'slope'}>, <Axes: title={'center': 'ca'}>], [<Axes: title={'center': 'thal'}>, <Axes: title={'center': 'target'}>, <Axes: >, <Axes: >]], dtype=object) In [ ]: data [ 'target' ] = le . fit_transform ( data [ 'sick' ]) data = data . drop ( 'sick' , axis = 1 ) data . head () Out[ ]: In [ ]: data . hist ( figsize = ( 20 , 15 )) Out[ ]: Loading [MathJax]/extensions/Safe.js
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
[5 pts] Looking at class balance We also want to make sure we are dealing with a balanced dataset. In this case, we want to confirm whether or not we have an equitable number of sick and healthy individuals to ensure that our classifier will have a sufficiently balanced dataset to adequately classify the two. Plot a histogram specifically of the sick target, and conduct a count of the number of sick and healthy individuals and report on the results: 0 165 1 138 Name: target, dtype: int64 In [ ]: data [ 'target' ] . hist ( bins = 2 , figsize = ( 5 , 5 )) data [ 'target' ] . value_counts () Out[ ]: Loading [MathJax]/extensions/Safe.js
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
There are about 30 more healthy (0) targets than there are sick (1), but overall the data set is well-balanced. Balanced datasets are important to ensure that classifiers train adequately and don't overfit, however arbitrary balancing of a dataset might introduce its own issues. Discuss some of the problems that might arise by artificially balancing a dataset. If we artifically balance a data set, we may reduce the accuracy of our model. Specifically, if we remove training points corresponding to the overpowering class, we are training our data on a smaller sample and thus may not generalize well. On the other hand, if we artifically insert data, our guesses of the correct label may be noisy or inaccurate and thus will reduce the accuracy of our model. [5 pts] Looking at Data Correlation Now that we have our dataframe prepared let's start analyzing our data. For this next question let's look at the correlations of our variables to our target value. First, use the heatmap function to plot the correlations of the data. In [ ]: correlations = data . corr () columns = list ( data ) heatmap ( correlations . values , columns , columns , figsize = ( 20 , 12 ), cmap = "hsv" ) Loading [MathJax]/extensions/Safe.js
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Next, show the correlation to the "target" feature only and sorr them in descending order. target 1.000000 exang 0.436757 oldpeak 0.430696 ca 0.391724 thal 0.344029 sex 0.280937 age 0.225439 trestbps 0.144931 chol 0.085239 fbs 0.028046 restecg -0.137230 slope -0.345877 thalach -0.421741 cp -0.433798 Name: target, dtype: float64 From the heatmap values and the description of the features, why do you think some variables correlate more highly than others? (This question is just to get you thinking and there is no perfect answer since we have no medical background) In [ ]: correlations [ "target" ] . sort_values ( ascending = False ) Out[ ]: Loading [MathJax]/extensions/Safe.js
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Some variables, such as exercise induced angina, may tell us more about whether a patient has heart disease than other factors, such as cholesterol. There is probably some science behind why some of these features are more related and thus have a higher coefficient than others. [25 pts] Part 2. Prepare the Data and run a KNN Model Before running our various learning methods, we need to do some additional prep to finalize our data. Specifically you'll have to cut the classification target from the data that will be used to classify, and then you'll have to divide the dataset into training and testing cohorts. Specifically, we're going to ask you to prepare 2 batches of data. The first batch will simply be the raw numeric data that hasn't gone through any additional pre-processing. The second batch will be data that you will pipeline using pre-processing methods. We will then feed both of these datasets into a classifier to showcase just how important this step can be! [2 pts] Separate target labels from data Save the label column as a separate array and then make a new dataframe without the target. [5 pts] Balanced Train Test Split Now, create your 'Raw' unprocessed training data by dividing your dataframe into training and testing cohorts, with your training cohort consisting of 60% of your total dataframe. To ensure that the train and test sets have balanced classes, use the stratify command of train_test_split . Output the resulting shapes of your training and testing samples to confirm that your split was successful. Additionally, output the class counts for the training and testing cohorts to confirm that there is no artifical class imbalance. Note: Use randomstate = 0 to ensure that the same train/test split happens everytime for ease of grading. [5 pts] KNN on raw data Now, let's try a classification model on this data. We'll first use KNN since it is the one we are most familiar with. One thing we noted in class was that because KNN relies on Euclidean distance, it is highly sensitive to the relative magnitude of different features. Let's see that in action! Implement a K-Nearest Neighbor algorithm on our data and report the results. For this initial implementation, simply use the default settings. Refer to the KNN Documentation for details on implementation. Report on the test accuracy of the resulting model and print out the confusion matrix. Recall that accurracy can be calculated easily using metrics.accuracy_score and that we have a helper function to draw the confusion matrix. In [ ]: y = data [ 'target' ] x = data . drop ( 'target' , axis = 1 ) In [ ]: train_raw , test_raw , target , target_test = train_test_split ( x , y , test_size = 0.2 , stratify Loading [MathJax]/extensions/Safe.js
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Accuracy: 0.606557 Confusion Matrix: [[24 9] [15 13]] [5 pts] KNN on preprocessed data Now lets implement a pipeline to preprocess the data. For the pipeline, use StandardScaler on the numerical features and one-hot encoding on the categorical features. For reference on how to make a pipeline, please look at project 1. For reference, the categorical features are ['sex', 'cp', 'fbs', 'restecg', 'exang', 'slope', 'ca','thal']. Now use the pipeline to transform the data and then apply the same KNN classifier with this new training/testing data. Report the test accuraccy. Discuss the implications of the different results you are obtaining. Note: Remember to use fit_transform on the training data and transform on the testing data. Accuracy: 0.836066 The accuracy significantly improved, jumping from roughly 60% for the raw data to now roughly 84%. [8 pts] KNN Parameter optimization for n_neighbors The KNN Algorithm includes an n_neighbors attribute that specifies how many neighbors to use when developing the cluster. (The default value is 5, which is what your previous model used.) Lets now try n values of: 1, 2, 3, 5, 7, 9, 10, 20, and 50. Run your model for each value and report the test accuracy for each. (HINT leverage python's ability to loop to run through the array and generate results without needing to manually code each iteration). In [ ]: # k-Nearest Neighbors algorithm knn = KNeighborsClassifier ( n_neighbors = 3 ) knn . fit ( train_raw , target ) predicted = knn . predict ( test_raw ) print ( " %-12s %f " % ( 'Accuracy:' , metrics . accuracy_score ( target_test , predicted ))) print ( "Confusion Matrix: \n " , metrics . confusion_matrix ( target_test , predicted )) In [ ]: features_num = [ 'trestbps' , 'chol' , 'thalach' , 'oldpeak' ] features_cat = [ 'sex' , 'cp' , 'fbs' , 'restecg' , 'exang' , 'slope' , 'ca' , 'thal' ] pipeline = ColumnTransformer ( [( "num" , StandardScaler (), features_num ), ( "cat" , OneHotEncoder (), features_cat ) ] ) In [ ]: train = pipeline . fit_transform ( train_raw ) test = pipeline . transform ( test_raw ) knn = KNeighborsClassifier ( n_neighbors = 3 ) knn . fit ( train , target ) predicted = knn . predict ( test ) print ( " %-12s %f " % ( 'Accuracy:' , metrics . accuracy_score ( target_test , predicted ))) Loading [MathJax]/extensions/Safe.js
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Accuracy for k=1: 0.8032786885245902 Accuracy for k=2: 0.7377049180327869 Accuracy for k=3: 0.8360655737704918 Accuracy for k=5: 0.8032786885245902 Accuracy for k=7: 0.7704918032786885 Accuracy for k=9: 0.7868852459016393 Accuracy for k=10: 0.7540983606557377 Accuracy for k=20: 0.7704918032786885 Accuracy for k=50: 0.7540983606557377 Comment for which value of n did the KNN model perform the best. Did the model perform strictly better or stricly worse as the value of n increased? The value of k=3 performed the best, with an accuracy of 83.6%. The accuracy neither strictly increased or decreased with an increasing k; it went up and down a few times. So we have a model that seems to work well. But let's see if we can do better! To do so we'll employ Logistic Regression and SVM to improve upon the model and compare the results. For the rest of the project, you will only be using the transformed data and not the raw data. DO NOT USE THE RAW DATA ANYMORE. [20 pts] Part 3. Additional Learning Methods: Logistic Regression Let's now try Logistic Regression. Recall that Logistic regression is a statistical model that in its basic form uses a logistic function to model a binary dependent variable. [5 pts] Run the default Logistic Regression Implement a Logistical Regression Classifier. Review the Logistical Regression Documentation for how to implement the model. Use the default settings. Report on the test accuracy and print out the confusion matrix. In [ ]: k_r = [ 1 , 2 , 3 , 5 , 7 , 9 , 10 , 20 , 50 ] for k in k_r : knn = KNeighborsClassifier ( n_neighbors = k ) knn . fit ( train , target ) predicted = knn . predict ( test ) print ( 'Accuracy for k=' , k , ': ' , metrics . accuracy_score ( target_test , predicted ), se In [ ]: log_reg = LogisticRegression () log_reg . fit ( train , target ) testing_result = log_reg . predict ( test ) predicted = log_reg . predict ( test ) print ( " %-12s %f " % ( 'Accuracy:' , metrics . accuracy_score ( target_test , predicted ))) print ( "Confusion Matrix: \n " , metrics . confusion_matrix ( target_test , predicted )) Loading [MathJax]/extensions/Safe.js
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Accuracy: 0.852459 Confusion Matrix: [[30 3] [ 6 22]] [5 pts] Compare Logistic Regression and KNN In your own words, describe the key differences between Logistic Regression and KNN? When would you use one over the other? Logisitc regression leverages the sigmoid function and probability to make predictions on test data. KNN on the other hand, uses some notion of distance of the closest points in the training data to make predictions. In KNN, there is no notion of training or a loss function since there are no parameters. Runtime for fitting of logisitc regression is a lot longer than KNN, but making test predictions is a lot shorter. [5 pts] Tweaking the Logistic Regression What are some parameters we can change that will affect the performance of Logistic Regression? We can change parameters such as C, which is the strength of regularization, max_iter, which is the maximum number of iterations the training will take to converge, penalty, which is the type of norm we use for regularization, and solver, which is the alogrithm used to solve the optimization problem for the parameters. Implement Logistic Regression with solver= 'liblinear', max_iter= 1000, penalty = 'l2', and C=1. Report on the test accuracy and print out the confusion matrix. Accuracy: 0.852459 Confusion Matrix: [[30 3] [ 6 22]] Now, Implement Logistic Regression with solver= 'liblinear', max_iter= 1000, penalty = 'l2', and C=0.0001. Report on the test accuracy and print out the confusion matrix. Accuracy: 0.754098 Confusion Matrix: [[31 2] [13 15]] In [ ]: log_reg = LogisticRegression ( solver = 'liblinear' , max_iter = 1000 , penalty = 'l2' , C = 1 ) log_reg . fit ( train , target ) testing_result = log_reg . predict ( test ) predicted = log_reg . predict ( test ) print ( " %-12s %f " % ( 'Accuracy:' , metrics . accuracy_score ( target_test , predicted ))) print ( "Confusion Matrix: \n " , metrics . confusion_matrix ( target_test , predicted )) In [ ]: log_reg = LogisticRegression ( solver = 'liblinear' , max_iter = 1000 , penalty = 'l2' , C = .00 log_reg . fit ( train , target ) testing_result = log_reg . predict ( test ) predicted = log_reg . predict ( test ) print ( " %-12s %f " % ( 'Accuracy:' , metrics . accuracy_score ( target_test , predicted ))) print ( "Confusion Matrix: \n " , metrics . confusion_matrix ( target_test , predicted )) Loading [MathJax]/extensions/Safe.js
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Did the accuraccy drop or improve? Why? Accuracy dropped. This is because a low value of C indicates a very high level of regularization, so the parameters were forced to be very small and thus they likely underfitteed the data. [5 pts] Trying out different penalties Now, Implement Logistic Regression with solver= 'liblinear', max_iter= 1000, penalty = 'l1', and C=1. Report on the test accuracy and print out the confusion matrix. Accuracy: 0.868852 Confusion Matrix: [[31 2] [ 6 22]] Describe what the purpose of a penalty term is and how the change from L2 to L1 affected the model. The purpose of a penalty term is for regulariztion of model parameters to ensure that the model is not overfitting the data. The difference between L2 and L1 is that in L2, large parameter values are punished more heavily whereas in L1, smaller parameter values are punished more heavily. Using L1 regularization in this case actually increased then accuracy of the model slightly. [20 pts] Part 4. Additional Learning Methods: SVM (Support Vector Machine) A Support Vector Machine (SVM) is a discriminative classifier formally defined by a separating hyperplane. In other words, given labeled training data (supervised learning), the algorithm outputs an optimal hyperplane which categorizes new examples. In two dimensional space this hyperplane is a line dividing a plane in two parts each corresponding to one of the two classes. Recall that Sci-kit learn uses soft-margin SVM to account for datasets that are not separable. [5 pts] Run default SVM classifier Implement a Support Vector Machine classifier on your pipelined data. Review the SVM Documentation for how to implement a model. For this implementation you can simply use the default settings. Report on the test accuracy and print out the confusion matrix. In [ ]: log_reg = LogisticRegression ( solver = 'liblinear' , max_iter = 1000 , penalty = 'l1' , C = 1 ) log_reg . fit ( train , target ) testing_result = log_reg . predict ( test ) predicted = log_reg . predict ( test ) print ( " %-12s %f " % ( 'Accuracy:' , metrics . accuracy_score ( target_test , predicted ))) print ( "Confusion Matrix: \n " , metrics . confusion_matrix ( target_test , predicted )) In [ ]: svm = SVC () svm . fit ( train , target ) predicted = svm . predict ( test ) print ( " %-12s %f " % ( 'Accuracy:' , metrics . accuracy_score ( target_test , predicted ))) print ( "Confusion Matrix: \n " , metrics . confusion_matrix ( target_test , predicted )) Loading [MathJax]/extensions/Safe.js
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Accuracy: 0.803279 Confusion Matrix: [[28 5] [ 7 21]] Print out the number of support vectors that SVC has determined. Look at the documentation for how to get this. [69 68] You may find that there are quite a few support vectors. This is due in part to the small number of samples in the training set and the choice of kernel. [5 pts] Use a Linear SVM Rerun your SVM, but now modify your model parameter kernel to equal 'linear'. Report on the test accuracy and print out the confusion matrix. Also, print out the number of support vectors. Accuracy: 0.852459 Confusion Matrix: [[30 3] [ 6 22]] [44 46] You will notice that number of support vectors has decreased significantly. [5 pts] Compare default SVM and Linear SVM Explain the what the new results you've achieved mean. Read the documentation to understand what you've changed about your model and explain why changing that input parameter might impact the results in the manner you've observed. By default, the kernel is 'rbf' or radial basis function. This enforces that the descision boundary is non-linear. However, our data is actually better suited with a linear separator. Thus, when we use a linear kernel, we make our decision boundary linear which fits the data better, resulting in an increased accuracy and less support vectors. [5 pts] Compare SVM and Logistic Regression Both logistic regression and linear SVM are trying to classify data points using a linear decision boundary but achieve it in different ways. In your own words, explain the difference between the ways that Logistic Regression and Linear SVM find the boundary? The loss function for logistic regression and svm are based on different principles and thus the algorithms run differently. In logisitc regression, we look at the probability that a particular data point is positively classified based on the sigmoid function, and is punshed for having too high or too low of a probability. On In [ ]: print ( svm . n_support_ ) In [ ]: svm = SVC ( kernel = 'linear' ) svm . fit ( train , target ) predicted = svm . predict ( test ) print ( " %-12s %f " % ( 'Accuracy:' , metrics . accuracy_score ( target_test , predicted ))) print ( "Confusion Matrix: \n " , metrics . confusion_matrix ( target_test , predicted )) print ( svm . n_support_ ) Loading [MathJax]/extensions/Safe.js
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
the other hand, the loss function for SVM is based on geometry and aims to maximize the margin between the separator and the dataset. Points that are too close to the separator are punished. [10 pts] Part 5: Cross Validation and Model Selection You've sampled a number of different classification techniques and have seen their performance on the dataset. Before we draw any conclusions on which model is best, we want to ensure that our results are not the result of the random sampling of our data we did with the Train-Test-Split. To ensure otherwise, we will conduct a K-Fold Cross-Validation with GridSearch to determine which model perform best and assess its performance on the test set. [10 pts] Model Selection Run a GridSearchCV with 3-Fold Cross Validation. You will be running each classification model with different parameters. KNN: n_neighbors = [1,3,5,7] metric = ["euclidean","manhattan"] #Different Distance functions Logistic Regression: penalty = ["l1","l2"] solver = ["liblinear"] C = [0.0001,0.1,10] SVM: kernel = ["linear","rbf"] C = [0.0001,0.1,10] Make sure to train and test your model on the transformed data and not on the raw data. After using GridSearchCV, put the results into a pandas Dataframe and print out the whole table. In [ ]: parametersKNN = [ { "n_neighbors" :[ 1 , 3 , 5 , 7 ], "metric" :[ "euclidean" , "manhattan" ] } ] parametersLR = [ { "penalty" :[ "l1" , "l2" ], "solver" :[ "liblinear" ], "C" :[ .0001 , .1 , 10 ] } ] parametersSVM = [ { "kernel" :[ 'linear' , 'rbf' ], C :[ .0001 , .1 , 10 ] } ] Loading [MathJax]/extensions/Safe.js
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
param_n_neighbors param_metric mean_test_score 7 7 manhattan 0.830658 3 7 euclidean 0.830556 2 5 euclidean 0.822479 6 5 manhattan 0.814198 5 3 manhattan 0.805916 1 3 euclidean 0.785185 4 1 manhattan 0.777058 0 1 euclidean 0.772942 param_penalty param_solver param_C mean_test_score 3 l2 liblinear 0.1 0.834877 5 l2 liblinear 10 0.826749 4 l1 liblinear 10 0.826698 1 l2 liblinear 0.0001 0.805813 2 l1 liblinear 0.1 0.789249 0 l1 liblinear 0.0001 0.545473 k = 3 kf = KFold ( n_splits = k , random_state = None ) KNN = KNeighborsClassifier () gridKNN = GridSearchCV ( KNN , parametersKNN , cv = kf , scoring = 'accuracy' ) gridKNN . fit ( train , target ) resKNN = pd . DataFrame ( gridKNN . cv_results_ ) . sort_values ( by = [ "mean_test_score" ], ascendin resKNN [[ "param_n_neighbors" , "param_metric" , "mean_test_score" ]] Out[ ]: In [ ]: parametersLR = [ { "penalty" :[ "l1" , "l2" ], "solver" :[ "liblinear" ], "C" :[ .0001 , .1 , 10 ] } ] LR = LogisticRegression () gridLR = GridSearchCV ( LR , parametersLR , cv = kf , scoring = 'accuracy' ) gridLR . fit ( train , target ) resLR = pd . DataFrame ( gridLR . cv_results_ ) . sort_values ( by = [ "mean_test_score" ], ascending = resLR [[ "param_penalty" , "param_solver" , "param_C" , "mean_test_score" ]] Out[ ]: In [ ]: parametersSVM = [ { "kernel" :[ 'linear' , 'rbf' ], "C" :[ .0001 , .1 , 10 ] } ] SVM = SVC () gridSVM = GridSearchCV ( SVM , parametersSVM , cv = kf , scoring = 'accuracy' ) gridSVM . fit ( train , target ) Loading [MathJax]/extensions/Safe.js
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
param_C param_kernel mean_test_score 2 0.1 linear 0.830761 3 0.1 rbf 0.814095 4 10 linear 0.801749 5 10 rbf 0.764506 0 0.0001 linear 0.545473 1 0.0001 rbf 0.545473 What was the best model and what was it's score? The best model was logisitc regression with the following parameters: penalty = l2, C=0.1, and solver=liblinear. This model had a mean_test_score of 0.834877. Using the best model you have, report the test accuracy and print out the confusion matrix Accuracy: 0.836066 Confusion Matrix: [[30 3] [ 7 21]] resLR = pd . DataFrame ( gridSVM . cv_results_ ) . sort_values ( by = [ "mean_test_score" ], ascending resLR [[ "param_C" , "param_kernel" , "mean_test_score" ]] Out[ ]: In [ ]: best_model = LogisticRegression ( penalty = "l2" , C = 0.1 , solver = "liblinear" ) best_model . fit ( train , target ) predicted = best_model . predict ( test ) print ( " %-12s %f " % ( 'Accuracy:' , metrics . accuracy_score ( target_test , predicted ))) print ( "Confusion Matrix: \n " , metrics . confusion_matrix ( target_test , predicted )) Loading [MathJax]/extensions/Safe.js
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help