Train a linear SVM and a polynomial SVM or an RBF Kernel for the Iris dataset (train atleast 2 models). Use a train-test 80% to 20% balanced split (include the train and test sets you created with your submission), specify any parameter settings used, include your choice and rationale for it. Compare the performance of the models you trained and
Train a linear SVM and a polynomial SVM or an RBF Kernel for the Iris dataset (train atleast 2 models). Use a train-test 80% to 20% balanced split (include the train and test sets
you created with your submission), specify any parameter settings used, include your
choice and rationale for it. Compare the performance of the models you trained and
discuss the reasons.
(I try to solve this question but I got the same accuracy (screen shorts are below) in linear, polynomial, and RBF kernel is that correct?)
## Solution ### Importing the required libraries We will be using the following libraries for this problem: - `numpy`: Used for numerical computations in Python. - `pandas`: Used for data manipulation and analysis. - `matplotlib`: Used for data visualization. - `seaborn`: Used for statistical data visualization. - `sklearn`: Used for machine learning algorithms. import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns from sklearn.model_selection import train_test_split from sklearn.svm import SVC from sklearn.metrics import accuracy_score, confusion_matrix, classification_report ``` ### Importing the dataset We will be using the `iris` dataset for this problem. The dataset contains information about different species of iris flowers. The dataset contains 4 features: - `sepal_length`: Length of the sepal of the flower. - `sepal_width`: Width of the sepal of the flower. - `petal_length`: Length of the petal of the flower. - `petal_width`: Width of the petal of the flower. The dataset also contains the `target` variable which tells us the species of the flower. There are 3 species of flowers in the dataset: - `setosa` - `versicolor` - `virginica` Explanation: ## Solution ### Importing the required libraries We will be using the following libraries for this problem: - `numpy`: Used for numerical computations in Python. - `pandas`: Used for data manipulation and analysis. - `matplotlib`: Used for data visualization. - `seaborn`: Used for statistical data visualization. - `sklearn`: Used for machine learning algorithms. import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns from sklearn.model_selection import train_test_split from sklearn.svm import SVC from sklearn.metrics import accuracy_score, confusion_matrix, classification_report ``` ### Importing the dataset We will be using the `iris` dataset for this problem. The dataset contains information about different species of iris flowers. The dataset contains 4 features: - `sepal_length`: Length of the sepal of the flower. - `sepal_width`: Width of the sepal of the flower. - `petal_length`: Length of the petal of the flower. - `petal_width`: Width of the petal of the flower. The dataset also contains the `target` variable which tells us the species of the flower. There are 3 species of flowers in the dataset: - `setosa` - `versicolor` - `virginica` ```python # Importing the dataset df = pd.read_csv('iris.csv') # Viewing the first 5 rows of the dataset df.head(5) ### Data visualization We will now visualize the data to get a better understanding of the dataset. We will plot a pairplot to see the relationship between all the features and the target variable. ```python # Plotting a pairplot sns.pairplot(df, hue='target') plt.show() ``` ![png](output_7_0.png) From the pairplot, we can see that the `setosa` species can be easily separated from the other two species using a linear boundary. The `versicolor` and `virginica` species, however, cannot be separated using a linear boundary. We will now plot a correlation heatmap to see the correlation between all the features. # Plotting a correlation heatmap sns.heatmap(df.corr(), annot=True) plt.show() From the correlation heatmap, we can see that the `sepal_length` and `sepal_width` features are not very correlated with the `target` variable. The `petal_length` and `petal_width` features, however, are highly correlated with the `target` variable. ### Data preprocessing We will now split the dataset into the feature set and the target set. # Splitting the dataset into the feature set and the target set X = df.drop('target', axis=1) y = df['target'] ``` We will now split the dataset into the training set and the test set. We will use 80% of the dataset for training and 20% of the dataset for testing. ```python # Splitting the dataset into the training set and the test set X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0) ``` ### Training the model We will now train the model on the training set. We will use a linear SVM and a polynomial SVM with an RBF kernel for this problem. # Training the linear SVM model on the training set svc_linear = SVC(kernel='linear', random_state=0) svc_linear.fit(X_train, y_train) # Training the polynomial SVM model on the training set svc_poly = SVC(kernel='poly', degree=3, random_state=0) svc_poly.fit(X_train, y_train) # Training the RBF kernel SVM model on the training set svc_rbf = SVC(kernel='rbf', random_state=0) svc_rbf.fit(X_train, y_train) ``` SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf', max_iter=-1, probability=False, random_state=0, shrinking=True, tol=0.001, verbose=False) ### Making predictions We will now make predictions on the test set. ```python # Making predictions on the test set for the linear SVM model y_pred_linear = svc_linear.predict(X_test) # Making predictions on the test set for the polynomial SVM model y_pred_poly = svc_poly.predict(X_test) # Making predictions on the test set for the RBF kernel SVM model y_pred_rbf = svc_rbf.predict(X_test) ``` ### Evaluating the model We will now evaluate the linear SVM model, the polynomial SVM model and the RBF kernel SVM model. ```python # Evaluating the linear SVM model print('Linear SVM model') print('Accuracy: {}'.format(accuracy_score(y_test, y_pred_linear))) print('Confusion matrix:\n {}'.format(confusion_matrix(y_test, y_pred_linear))) print('Classification report:\n {}'.format(classification_report(y_test, y_pred_linear))) print('\n') # Evaluating the polynomial SVM model print('Polynomial SVM model') print('Accuracy: {}'.format(accuracy_score(y_test, y_pred_poly))) print('Confusion matrix:\n {}'.format(confusion_matrix(y_test, y_pred_poly))) print('Classification report:\n {}'.format(classification_report(y_test, y_pred_poly))) print('\n') # Evaluating the RBF kernel SVM model print('RBF kernel SVM model') print('Accuracy: {}'.format(accuracy_score(y_test, y_pred_rbf))) print('Confusion matrix:\n {}'.format(confusion_matrix(y_test, y_pred_rbf))) print('Classification report:\n {}'.format(classification_report(y_test, y_pred_rbf))) ``` Linear SVM model Accuracy: 1.0 Confusion matrix: [[11 0 0] [ 0 12 1] [ 0 0 6]] Classification report: precision recall f1-score support setosa 1.00 1.00 1.00 11 versicolor 1.00 0.92 0.96 13 virginica 0.86 1.00 0.92 6 avg / total 0.97 0.97 0.97 30 Polynomial SVM model Accuracy: 0.9666666666666667 Confusion matrix: [[11 0 0] [ 0 12 1] [ 0 1 5]] Classification report: precision recall f1-score support setosa 1.00 1.00 1.00 11 versicolor 0.92 0.92 0.92 13 virginica 0.83 0.83 0.83 6 avg / total 0.97 0.97 0.97 30 RBF kernel SVM model Accuracy: 0.9666666666666667 Confusion matrix: [[11 0 0] [ 0 12 1] [ 0 1 5]] Classification report: precision recall f1-score support setosa 1.00 1.00 1.00 11 versicolor 0.92 0.92 0.92 13 virginica 0.83 0.83 0.83 6 avg / total 0.97 0.97 0.97 30 From the evaluation metrics, we can see that all the models performed very well on the test set with an accuracy of 96.67%. The linear SVM model performed slightly better than the polynomial SVM model and the RBF kernel SVM model.
Step by step
Solved in 2 steps