Project_3

pdf

School

University of California, Los Angeles *

*We aren’t endorsed by this school

Course

M148

Subject

Computer Science

Date

Jan 9, 2024

Type

pdf

Pages

Uploaded by patilkunal919

Project 3 - Classify your own data For this project we're going to explore some of the new topics since the last project including Decision Trees and Un-supervised learning. The final part of the project will ask you to perform your own data science project to classify a new dataset. Submission Details Project is due June 14th at 11:59 am (Wednesday Afternoon). To submit the project, please save the notebook as a pdf file and submit the assignment via Gradescope. In addition, make sure that all figures are legible and su ff iciently large. For best pdf results, we recommend downloading Latex and print the notebook using Latex. Loading Essentials and Helper Functions Example Project using new techniques Since project 2, we have learned about a few new models for supervised learning(Decision Trees and Neural Networks) and un-supervised learning (Clustering and PCA). In this example portion, we will go over how to implement these techniques using the Sci-kit learn library. Load and Process Example Project Data For our example dataset, we will use the Breast Cancer Wisconsin Dataset to determine whether a mass found in a body is benign or malignant. Since this dataset was used as an example in project 2, you should be fairly familiar with it. Feature Information: Column 1: ID number Column 2: Diagnosis (M = malignant, B = benign) Ten real-valued features are computed for each cell nucleus: 1. radius (mean of distances from center to points on the perimeter) 2. texture (standard deviation of gray-scale values) 3. perimeter 4. area 5. smoothness (local variation in radius lengths) 6. compactness (perimeter^2 / area - 1.0) 7. concavity (severity of concave portions of the contour) 8. concave points (number of concave portions of the contour) 9. symmetry 10. fractal dimension ("coastline approximation" - 1) Due to the statistical nature of the test, we are not able to get exact measurements of the previous values. Instead, the dataset contains the mean and standard error of the real-valued features. Columns 3-12 present the mean of the measured values Columns 13-22 present the standard error of the measured values 0 285 1 170 Name: diagnosis, dtype: int64 Counts of each class in target_test: 0 72 1 42 Name: diagnosis, dtype: int64 Baseline Accuraccy of using Majority Class: 0.631578947368421 Supervised Learning: Decision Tree Classification with Decision Tree Accuracy: 0.894737 Confusion Matrix: [[63 9] [ 3 39]] Parameters for Decision Tree Classifier In Sci-kit Learn, the following are just some of the parameters we can pass into the Decision Tree Classifier: criterion: {‘gini’, ‘entropy’, ‘log_loss’} default="gini" The function to measure the quality of a split. Supported criteria are “gini” for the Gini impurity and “log_loss” and “entropy” both for the Shannon information gain splitter: {“best”, “random”}, default=”best” The strategy used to choose the split at each node. “best” aims to find the best feature split amongst all features. "random" only looks for the best split amongst a random subset of features. max_depth: int, default = 2 {‘newton-cg’, ‘lbfgs’, ‘liblinear’, ‘sag’, ‘saga’}, default=’lbfgs’ The maximum depth of the tree. min_samples_split: int or float, default=2 The minimum number of samples required to split an internal node. If int, then consider min_samples_split as the minimum number. If float, then min_samples_split is a fraction and ceil(min_samples_split * n_samples) are the minimum number of samples for each split. Visualizing Decision Trees Scikit-learn allows us to visualize the decision tree to see what features it choose to split and what the result is. Note that if the condition in the node is true, you traverse the left edge of the node. Otherwise, you traverse the right edge. [Text(0.5, 0.8333333333333334, 'concave points_mean <= 0.011\ngini = 0.468\nsamples = 100.0%\nvalue = [0.626, 0.374]'), Text(0.25, 0.5, 'area_mean <= 0.124\ngini = 0.101\nsamples = 61.5%\nvalue = [0.946, 0.054]'), Text(0.125, 0.16666666666666666, '\n (...) \n'), Text(0.375, 0.16666666666666666, '\n (...) \n'), Text(0.75, 0.5, 'concavity_mean <= 0.001\ngini = 0.202\nsamples = 38.5%\nvalue = [0.114, 0.886]'), Text(0.625, 0.16666666666666666, '\n (...) \n'), Text(0.875, 0.16666666666666666, '\n (...) \n')] We can even look at the tree in a textual format. |--- concave points_mean <= 0.01 | |--- area_mean <= 0.12 | | |--- area_se <= 0.04 | | | |--- compactness_mean <= 0.59 | | | | |--- fractal_dimension_se <= -0.83 | | | | | |--- fractal_dimension_se <= -0.84 | | | | | | |--- smoothness_se <= -1.22 | | | | | | | |--- compactness_se <= -0.98 | | | | | | | | |--- class: 0 | | | | | | | |--- compactness_se > -0.98 | | | | | | | | |--- class: 1 | | | | | | |--- smoothness_se > -1.22 | | | | | | | |--- class: 0 | | | | | |--- fractal_dimension_se > -0.84 | | | | | | |--- class: 1 | | | | |--- fractal_dimension_se > -0.83 | | | | | |--- class: 0 | | | |--- compactness_mean > 0.59 | | | | |--- symmetry_se <= 0.20 | | | | | |--- class: 1 | | | | |--- symmetry_se > 0.20 | | | | | |--- class: 0 | | |--- area_se > 0.04 | | | |--- symmetry_mean <= -0.57 | | | | |--- class: 1 | | | |--- symmetry_mean > -0.57 | | | | |--- class: 0 | |--- area_mean > 0.12 | | |--- texture_mean <= -0.72 | | | |--- class: 0 | | |--- texture_mean > -0.72 | | | |--- smoothness_mean <= -1.52 | | | | |--- class: 0 | | | |--- smoothness_mean > -1.52 | | | | |--- class: 1 |--- concave points_mean > 0.01 | |--- concavity_mean <= 0.00 | | |--- fractal_dimension_mean <= -0.83 | | | |--- class: 1 | | |--- fractal_dimension_mean > -0.83 | | | |--- concave points_mean <= 0.11 | | | | |--- concavity_se <= -0.35 | | | | | |--- class: 1 | | | | |--- concavity_se > -0.35 | | | | | |--- class: 0 | | | |--- concave points_mean > 0.11 | | | | |--- class: 0 | |--- concavity_mean > 0.00 | | |--- fractal_dimension_se <= 2.39 | | | |--- smoothness_se <= 1.87 | | | | |--- radius_se <= -0.77 | | | | | |--- class: 0 | | | | |--- radius_se > -0.77 | | | | | |--- concave points_se <= 2.59 | | | | | | |--- class: 1 | | | | | |--- concave points_se > 2.59 | | | | | | |--- radius_se <= 0.61 | | | | | | | |--- class: 0 | | | | | | |--- radius_se > 0.61 | | | | | | | |--- class: 1 | | | |--- smoothness_se > 1.87 | | | | |--- concave points_se <= 1.03 | | | | | |--- class: 1 | | | | |--- concave points_se > 1.03 | | | | | |--- class: 0 | | |--- fractal_dimension_se > 2.39 | | | |--- perimeter_mean <= 0.59 | | | | |--- class: 0 | | | |--- perimeter_mean > 0.59 | | | | |--- class: 1 Feature Importance in Decision Trees Decision Trees can also assign importance to features by measuring the average decrease in impurity (i.e. information gain) for each feature. The features with higher decreases are treated as more important. <Axes: > We can clearly see that "concave points_mean" has the largest importance due to it providing the most reduction in the impurity. Visualizing decision boundaries for Decision Trees Similar to project 2, lets see what decision boundaries that a Decision Tree creates. We use the two most correlated features to the target labels: concave_points_mean and perimeter_mean. We can see that the model gets more and more complex with increasing depth until it converges somewhere in between depth 10 and 15. Supervised Learning: Multi-Layer Perceptron (MLP) A neural network is a series of algorithms that endeavors to recognize underlying relationships in a set of data through a process that mimics the way the human brain operates. Neural networks are very powerful tools that are used a in a variety of applications including image and speech processing. In class, we have discussed one of the earliest types of neural networks known as a Multi-Layer Perceptron. steps Using MLP for classification /Users/kunalpatil/anaconda3/envs/m148/lib/python3.9/site-packages/sklearn/neural_network/_multilayer_perceptron.py:686: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (400) reached and the optimization hasn't c onverged yet. warnings.warn( Accuracy: 0.929825 Confusion Matrix: [[66 6] [ 2 40]] Parameters for MLP Classifier In Sci-kit Learn, the following are just some of the parameters we can pass into MLP Classifier: hidden_layer_sizes: tuple, length = n_layers - 2, default=(100,) The ith element represents the number of neurons in the ith hidden layer. activation: {‘identity’, ‘logistic’, ‘tanh’, ‘relu’}, default=’relu’ Activation function for the hidden layer. alpha: float, default = 0.0001 Strength of the L2 regularization term. The L2 regularization term is divided by the sample size when added to the loss. max_iter: int, default=200 Maximum number of iterations taken for the solvers to converge. Visualizing decision boundaries for MLP Now, lets see how the decision boundaries change as a function of both the activation function and the number of hidden layers. Unsupervised learning: PCA As shown in lecture, PCA is a valuable dimensionality reduction tool that can extract a small subset of valuable features. In this section, we shall demonstrate how PCA can extract important visual features from pictures of subjects faces. We shall use the AT&T Database of Faces . This dataset contains 40 different subjects with 10 samples per subject which means we a dataset of 400 samples. We extract the images from the scikit-learn dataset library . The library imports the images (faces.data), the flatten array of images (faces.images), and which subject eacj image belongs to (faces.target). Each image is a 64 by 64 image with pixels converted to floating point values in [0,1]. Eigenfaces The following codes downloads and loads the face data. Flattened Face Data shape: (400, 4096) Face Image Data Shape: (400, 64, 64) Shape of target data: (400,) Now, let us see what features we can extract from these face images. The following plots the top 30 PCA components with how much variance does this feature explain. Amazing! We can see that the model has learned to focus on many features that we as humans also look at when trying to identify a face such as the nose,eyes, eyebrows, etc. With this feature extraction, we can perform much more powerful learning. Feature Extraction for Classification Lets see if we can use PCA to improve the accuracy of the decision tree classifier. Accuracy without PCA Accuracy: 0.894737 Confusion Matrix: [[63 9] [ 3 39]] Accuracy with PCA Accuracy: 0.912281 Confusion Matrix: [[66 6] [ 4 38]] Number of Features without PCA: 20 Number of Features with PCA: 7 Clearly, we get a much better accuracy for the model while using fewer features. But does the features the PCA thought were important the same features that the decision tree used. Lets look at the feature importance of the tree. The following plot numbers the first principal component as 0, the second as 1, and so forth. <Axes: > Amazingly, the first and second components were the most important features in the decision tree. Thus, we can claim that PCA has significantly improved the performance of our model. Unsupervised learning: Clustering Clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups. One major algorithm we learned in class is the K-Means algorithm. Evaluating K-Means performance While there are many ways to evaluate the performance measure of clustering algorithsm , we will focus on the inertia score of the K-Means model. Inertia is another term for the sum of squared distances of samples to their closest cluster center. Let us look at how the Inertia changes as a function of the number of clusters for an artificial dataset. <matplotlib.collections.PathCollection at 0x7f8bd234cee0> Inertia for K = 2: 13293.997460961546 Inertia for K = 3: 7169.578996856773 Inertia for K = 4: 3247.8674040695832 Inertia for K = 5: 872.8554968701878 Inertia for K = 6: 803.846686425823 Inertia for K = 7: 739.5236191503768 Inertia for K = 8: 690.2530283275607 Inertia for K = 9: 614.5138307338655 /Users/kunalpatil/anaconda3/envs/m148/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to sup press the warning warnings.warn( /Users/kunalpatil/anaconda3/envs/m148/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to sup press the warning warnings.warn( /Users/kunalpatil/anaconda3/envs/m148/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to sup press the warning warnings.warn( /Users/kunalpatil/anaconda3/envs/m148/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to sup press the warning warnings.warn( /Users/kunalpatil/anaconda3/envs/m148/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to sup press the warning warnings.warn( /Users/kunalpatil/anaconda3/envs/m148/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to sup press the warning warnings.warn( /Users/kunalpatil/anaconda3/envs/m148/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to sup press the warning warnings.warn( /Users/kunalpatil/anaconda3/envs/m148/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to sup press the warning warnings.warn( From the plot, we can see that when the number of clusters of K-means is the correct number of clusters, Inertia starts decreasing at a much slower rate. This creates a kind of elbow shape in the graph. For K-means clustering, the elbow method selects the number of clusters where the elbow shape is formed. In this case, we see that this method would produce the correct number of clusters. Lets try it on the cancer dataset. /Users/kunalpatil/anaconda3/envs/m148/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to sup press the warning warnings.warn( /Users/kunalpatil/anaconda3/envs/m148/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to sup press the warning warnings.warn( /Users/kunalpatil/anaconda3/envs/m148/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to sup press the warning warnings.warn( /Users/kunalpatil/anaconda3/envs/m148/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to sup press the warning warnings.warn( /Users/kunalpatil/anaconda3/envs/m148/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to sup press the warning warnings.warn( /Users/kunalpatil/anaconda3/envs/m148/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to sup press the warning warnings.warn( /Users/kunalpatil/anaconda3/envs/m148/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to sup press the warning warnings.warn( /Users/kunalpatil/anaconda3/envs/m148/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to sup press the warning warnings.warn( /Users/kunalpatil/anaconda3/envs/m148/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to sup press the warning warnings.warn( /Users/kunalpatil/anaconda3/envs/m148/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to sup press the warning warnings.warn( /Users/kunalpatil/anaconda3/envs/m148/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to sup press the warning warnings.warn( /Users/kunalpatil/anaconda3/envs/m148/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to sup press the warning warnings.warn( /Users/kunalpatil/anaconda3/envs/m148/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to sup press the warning warnings.warn( /Users/kunalpatil/anaconda3/envs/m148/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to sup press the warning warnings.warn( Inertia for K = 2: 6381.278325955922 Inertia for K = 3: 5508.621446593709 Inertia for K = 4: 4972.231721973118 Inertia for K = 5: 4507.26713736607 Inertia for K = 6: 4203.777246823878 Inertia for K = 7: 3942.659550896411 Inertia for K = 8: 3745.1124228292692 Inertia for K = 9: 3532.7225156022073 Inertia for K = 10: 3371.033467027838 Inertia for K = 11: 3232.472758070737 Inertia for K = 12: 3135.1944201924534 Inertia for K = 13: 3033.3838427786477 Inertia for K = 14: 2958.3200036360367 Inertia for K = 15: 2893.798763511904 Inertia for K = 16: 2767.804761705547 Inertia for K = 17: 2737.4747101790635 Inertia for K = 18: 2662.1203080706655 Inertia for K = 19: 2617.90890694005 Inertia for K = 20: 2553.961378449726 Inertia for K = 21: 2491.9133737078346 Inertia for K = 22: 2448.777623600997 Inertia for K = 23: 2391.644588540416 /Users/kunalpatil/anaconda3/envs/m148/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to sup press the warning warnings.warn( /Users/kunalpatil/anaconda3/envs/m148/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to sup press the warning warnings.warn( /Users/kunalpatil/anaconda3/envs/m148/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to sup press the warning warnings.warn( /Users/kunalpatil/anaconda3/envs/m148/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to sup press the warning warnings.warn( /Users/kunalpatil/anaconda3/envs/m148/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to sup press the warning warnings.warn( /Users/kunalpatil/anaconda3/envs/m148/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to sup press the warning warnings.warn( /Users/kunalpatil/anaconda3/envs/m148/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to sup press the warning warnings.warn( /Users/kunalpatil/anaconda3/envs/m148/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to sup press the warning warnings.warn( /Users/kunalpatil/anaconda3/envs/m148/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to sup press the warning warnings.warn( Inertia for K = 24: 2374.1345787190176 Inertia for K = 25: 2334.794010981073 Inertia for K = 26: 2267.993521706617 Inertia for K = 27: 2233.585453239129 Inertia for K = 28: 2191.739402693569 Inertia for K = 29: 2165.254207641313 /Users/kunalpatil/anaconda3/envs/m148/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to sup press the warning warnings.warn( /Users/kunalpatil/anaconda3/envs/m148/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to sup press the warning warnings.warn( /Users/kunalpatil/anaconda3/envs/m148/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to sup press the warning warnings.warn( /Users/kunalpatil/anaconda3/envs/m148/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to sup press the warning warnings.warn( /Users/kunalpatil/anaconda3/envs/m148/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to sup press the warning warnings.warn( Here we see that the elbow is not as cleanly defined. This may be due to the dataset not being a good fit for K-means. Regardless, we can still apply the elbow method by noting that the slow down happens around 7~14. Kmeans on Eigenfaces Now, lets see how K-means performs in clustering the face data with PCA. [3 6 3 4 6 4 3 3 3 6 5 5 5 5 5 5 5 5 5 5 1 1 5 1 4 1 4 4 6 6 5 5 5 3 6 4 3 5 5 6 4 1 1 4 4 4 4 4 4 3 3 3 3 3 3 3 3 3 3 3 7 7 3 4 7 3 7 7 3 7 0 6 3 6 3 3 6 3 3 6 1 1 1 4 4 4 4 4 1 6 6 6 6 6 6 6 6 6 4 3 0 0 0 0 0 0 0 0 0 0 4 1 1 1 1 4 1 6 6 4 5 5 4 4 5 5 4 4 5 4 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 8 3 3 3 3 8 6 8 3 3 4 4 1 1 4 4 4 4 4 4 3 6 4 6 3 3 3 3 3 3 7 7 7 7 7 7 7 7 7 7 9 9 9 4 4 4 4 4 4 9 9 9 9 9 9 9 4 8 9 4 2 2 2 2 2 2 2 2 2 2 3 6 1 4 1 4 1 6 4 4 8 8 8 8 5 8 8 8 8 8 6 5 6 5 5 5 6 4 5 6 1 1 1 1 1 1 3 1 1 5 5 5 5 5 5 5 5 5 5 5 5 1 5 5 5 5 5 5 1 4 2 2 2 9 4 4 9 8 2 2 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 7 7 7 7 7 7 5 7 7 7 9 9 9 9 9 9 9 9 9 9 2 2 2 2 2 2 2 2 2 2 9 9 9 9 4 6 6 1 4 4 3 8 8 8 7 8 8 8 8 8 1 1 1 1 1 1 1 1 1 1 4 1 1 6 1 4 6 6 4 1 2 2 2 2 2 2 2 2 2 2 6 4 3 4 3 1 4 1 4 4] /Users/kunalpatil/anaconda3/envs/m148/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to sup press the warning warnings.warn( While the algorithm isn't perfect, we can see that K-means with PCA is picking up on some facial similarity or similar expressions. (100 pts) Todo: Use new methods to classify heart disease To compare how these new models perform with the other models discussed in the course, we will apply these new models on the heart disease dataset that was used in project 2. Background: The Dataset (Recap) For this exercise we will be using a subset of the UCI Heart Disease dataset, leveraging the fourteen most commonly used attributes. All identifying information about the patient has been scrubbed. You will be asked to classify whether a patient is suffering from heart disease based on a host of potential medical factors. The dataset includes 14 columns. The information provided by each column is as follows: age: Age in years sex: (1 = male; 0 = female) cp: Chest pain type (0 = asymptomatic; 1 = atypical angina; 2 = non-anginal pain; 3 = typical angina) trestbps: Resting blood pressure (in mm Hg on admission to the hospital) chol: cholesterol in mg/dl fbs Fasting blood sugar > 120 mg/dl (1 = true; 0 = false) restecg: Resting electrocardiographic results (0= showing probable or definite left ventricular hypertrophy by Estes' criteria; 1 = normal; 2 = having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)) thalach: Maximum heart rate achieved exang: Exercise induced angina (1 = yes; 0 = no) oldpeak: Depression induced by exercise relative to rest slope: The slope of the peak exercise ST segment (0 = downsloping; 1 = flat; 2 = upsloping) ca: Number of major vessels (0-3) colored by flourosopy thal: 1 = normal; 2 = fixed defect; 7 = reversable defect sick: Indicates the presence of Heart disease (True = Disease; False = No disease) Preprocess Data This part is done for you since you would have already completed it in project 2. Use the train, target, test, and target_test for all future parts. We also provide the column names for each transformed column for future use. Column names after transformation by pipeline: ['num__age' 'num__trestbps' 'num__chol' 'num__thalach' 'num__oldpeak' 'cat__sex_0' 'cat__sex_1' 'cat__cp_0' 'cat__cp_1' 'cat__cp_2' 'cat__cp_3' 'cat__fbs_0' 'cat__fbs_1' 'cat__restecg_0' 'cat__restecg_1' 'cat__restecg_2' 'cat__exang_0' 'cat__exang_1' 'cat__slope_0' 'cat__slope_1' 'cat__slope_2' 'cat__ca_0' 'cat__ca_1' 'cat__ca_2' 'cat__ca_3' 'cat__ca_4' 'cat__thal_0' 'cat__thal_1' 'cat__thal_2' 'cat__thal_3'] The following shows the baseline accuracy of simply classifying every sample as the majority class. Counts of each class in target_test: 0 66 1 56 Name: target, dtype: int64 Baseline Accuraccy of using Majority Class: 0.5409836065573771 (25 pts) Decision Trees [5 pts] Apply Decision Tree on Train Data Apply the decision tree on the train data with default parameters of the DecisionTreeClassifier. Report the accuracy and print the confusion matrix . Make sure to use random_state = 0 so that your results match ours. Accuracy: 0.696721 Confusion Matrix: [[53 13] [24 32]] [5 pts] Visualize the Decision Tree Visualize the first two layers of the decision tree that you trained. [Text(0.5, 0.8333333333333334, 'cat__cp_0 <= 0.5\ngini = 0.496\nsamples = 100.0%\nvalue = [0.547, 0.453]'), Text(0.25, 0.5, 'num__chol <= 0.223\ngini = 0.283\nsamples = 48.6%\nvalue = [0.83, 0.17]'), Text(0.125, 0.16666666666666666, '\n (...) \n'), Text(0.375, 0.16666666666666666, '\n (...) \n'), Text(0.75, 0.5, 'cat__ca_0 <= 0.5\ngini = 0.403\nsamples = 51.4%\nvalue = [0.28, 0.72]'), Text(0.625, 0.16666666666666666, '\n (...) \n'), Text(0.875, 0.16666666666666666, '\n (...) \n')] What is the gini index improvement of the first split? The gini index of the first split is .486(.283) + .514(.403) = 0.34468. This is a gini index improvement of .496-0.34468 = 0.15132. [5 pts] Plot the importance of each feature for the Decision Tree <Axes: > How many features have non-zero importance for the Decision Tree? If we remove the features with zero importance, will it change the decision tree for the same sampled dataset? We have 16 features with non-zero importance. If we remoev features with zero importance, it does not change the decision tree for the sampled dataset. [10 pts] Optimize Decision Tree While the default Decision Tree performs fairly well on the data, lets see if we can improve performance by optimizing the parameters. Run a GridSearchCV with 3-Fold Cross Validation for the Decision Tree. Find the best model parameters amongst the following: max_depth = [1,2,3,4,5,10,15] min_samples_split = [2,4,6,8] criterion = ["gini", "entropy"] After using GridSearchCV, print the best model parameters and the best score. param_max_depth param_min_samples_split param_criterion mean_test_score 39 3 8 entropy 0.798851 Using the best model you have, report the test accuracy and print out the confusion matrix Accuracy: 0.786885 Confusion Matrix: [[62 4] [22 34]] (20 pts) Multi-Layer Perceptron [5 pts] Applying a Multi-Layer Perceptron Apply the MLP on the train data with hidden_layer_sizes=(100,100) and max_iter = 800. Report the accuracy and print the confusion matrix . Make sure to set random_state=0. Accuracy: 0.819672 Confusion Matrix: [[63 3] [19 37]] [10 pts] Speedtest between Decision Tree and MLP Let us compare the training times and prediction times of a Decision Tree and an MLP. Time how long it takes for a Decision Tree and an MLP to perform a .fit operation (i.e. training the model). Then, time how long it takes for a Decision Tree and an MLP to perform a .predict operation (i.e. predicting the testing data). Print out the timings and specify which model was quicker for each operation. We recommend using the time python module to time your code. An example of the time module was shown in project 2. Use the default Decision Tree Classifier and the MLP with the previously mentioned parameters. Decision Tree Training Time : 0.0023469924926757812 MLP Training Time : 0.41370391845703125 Decision Tree Prediction Time : 0.00042891502380371094 MLP Prediction Time : 0.00022602081298828125 [5 pts] Compare and contrast Decision Trees and MLPs. Describe at least one advantage and disadvantage of using an MLP over a Decision Tree. An MLP classifier has a higher test accuracy, so it can better fit data and generalize to complex relationships. However, it is much slower to train than a decision tree. (35 pts) PCA [5 pts] Transform the train data using PCA Train a PCA model to project the train data on the top 10 components. Print out the 10 principal components . Look at the documentation of PCA for reference. [ 0.06099466 0.04034864 0.01924581 -0.1017322 0.11071541 -0.12331434 0.12331434 0.34265332 -0.13458918 -0.20936122 0.00129708 -0.01288493 0.01288493 0.19693729 -0.19855631 0.00161902 -0.35054419 0.35054419 0.04595587 0.29412324 -0.34007911 -0.20553518 0.07463263 0.08348053 0.06758334 -0.02016132 -0.00039038 0.04438528 -0.31408112 0.27008622] [ 0.05231789 0.02890251 0.03826504 -0.00733246 -0.0037285 0.44442215 -0.44442215 0.07362246 -0.03171478 -0.02860787 -0.01329982 -0.02106535 0.02106535 0.42589374 -0.44447634 0.0185826 0.0184908 -0.0184908 -0.02183111 0.1435457 -0.1217146 0.02011775 -0.03199655 0.03699266 0.00382438 -0.02893824 0.00336806 0.00301279 0.29007107 -0.29645192] [-0.0427616 -0.03742143 0.00354063 -0.04733566 0.01801283 0.30699875 -0.30699875 0.09347247 0.0329172 -0.0989118 -0.02747786 0.19990151 -0.19990151 -0.43118048 0.40996579 0.0212147 -0.13223456 0.13223456 -0.02514842 0.32090865 -0.29576023 0.29149476 -0.18165312 -0.05235916 -0.0472252 -0.01025728 0.00355497 -0.03312684 0.02754858 0.00202329] [-0.01085267 0.05128855 0.02043706 0.04685242 0.03207741 -0.02899318 0.02899318 -0.03499918 0.0751887 -0.14722255 0.10703303 0.07444483 -0.07444483 0.19169119 -0.19398602 0.00229484 0.37381386 -0.37381386 0.04159629 0.09054175 -0.13213804 0.38469377 -0.41161818 0.00135049 0.04404514 -0.01847122 0.00225008 0.04153747 -0.36627833 0.32249078] [ 0.04627938 0.01970448 -0.00381961 -0.03509902 0.00213482 -0.0641167 0.0641167 -0.40342028 -0.0378483 0.32955899 0.11170959 -0.25422537 0.25422537 -0.06936845 0.06174935 0.0076191 0.1794727 -0.1794727 -0.0691121 0.4997768 -0.4306647 -0.17536448 0.15450628 -0.00161821 -0.00409491 0.02657132 0.00376026 0.04672977 0.00227974 -0.05276977] [-0.06836847 -0.02106313 -0.0364071 0.00576781 -0.04376639 -0.35938801 0.35938801 -0.00508255 0.05584308 -0.1221728 0.07141227 -0.10049637 0.10049637 0.0879698 -0.09144775 0.00347796 -0.18119896 0.18119896 0.00840746 0.11524555 -0.12365301 0.48997794 -0.22229128 -0.15548202 -0.09522074 -0.0169839 0.00997476 0.08201132 0.30430633 -0.39629241] [-0.01781269 0.0730869 0.02016699 0.01826179 0.02836503 0.18963204 -0.18963204 -0.22891469 -0.09804043 0.32499933 0.00195579 -0.36267785 0.36267785 0.02767142 -0.01728665 -0.01038476 -0.29600786 0.29600786 0.06466713 -0.18871682 0.12404969 0.31270592 -0.17972228 -0.08238154 -0.01808524 -0.03251686 0.03177332 -0.09353408 -0.20218982 0.26395057] [ 0.04335226 0.05421452 -0.03470438 -0.01533346 0.01850783 0.06961587 -0.06961587 0.33466825 0.11123006 -0.41407483 -0.03182348 -0.47183824 0.47183824 -0.18841507 0.1853052 0.00310987 0.14171697 -0.14171697 0.01938161 -0.04763557 0.02825396 -0.11728067 -0.16652438 0.21168304 0.0921391 -0.02001708 0.0110368 0.12414356 -0.00323016 -0.13195021] [-0.05976132 -0.03693706 0.00661316 0.05615859 -0.04806144 0.07173048 -0.07173048 -0.33252042 0.6471931 -0.443929 0.12925632 -0.05004393 0.05004393 0.07308993 -0.03351071 -0.03957922 -0.08972835 0.08972835 -0.00474533 0.01650628 -0.01176095 -0.00075448 0.33237063 -0.22375775 -0.07770358 -0.03015483 -0.01306916 -0.11556595 -0.02840801 0.15704312] [-0.03956388 0.01449879 -0.0018639 0.07345678 0.09644551 -0.0121838 0.0121838 -0.36630173 0.20777859 -0.01802986 0.176553 0.10857767 -0.10857767 0.01552185 -0.00336271 -0.01215914 -0.16815714 0.16815714 -0.03179433 -0.0103539 0.04214823 -0.22392515 -0.41077763 0.66321302 -0.08581595 0.05730571 -0.01300773 0.11220997 -0.00547824 -0.09372401] [5 pts] Percentage of variance explained by top 10 principal components Using PCA's "explained_variance_ratio_", print the percentage of variance explained by the top 10 principal components. [0.23862094 0.1360394 0.10034179 0.08239361 0.07495304 0.06591197 0.05919248 0.04935616 0.0404145 0.0299425 ] [5 pts] Transform the train and test data into train_pca and test_pca using PCA Note: Use fit_transform for train and transform for test [5 pts] PCA+Decision Tree Train the default Decision Tree Classifier using train_pca. Report the accuracy using test_pca and print the confusion matrix . Accuracy with PCA Accuracy: 0.762295 Confusion Matrix: [[53 13] [16 40]] Does the model perform better with or without PCA? The model performs better with PCA. [5 pts] PCA+MLP Train the MLP classifier with the same parameters as before using train_pca. Report the accuracy using test_pca and print the confusion matrix . Accuracy with PCA Accuracy: 0.778689 Confusion Matrix: [[60 6] [21 35]] Does the model perform better with or without PCA? This model performed better without PCA. [10 pts] Pros and Cons of PCA In your own words, provide at least two pros and at least two cons for using PCA Response: Two pros of using PCA is that it decreases the amount of raw data you have significantly which can increase training speed, and also it helps us visualize high-dimensional features much better by translating them into lower dimensions. Two cons are that we may get a lower accuracy since we are effectively losing information about some of the features, and also it may be hard to interpret components. (20 pts) K-Means Clustering [5 pts] Apply K-means to the train data and print out the Inertia score Use n_cluster = 5 and random_state = 0. 491.0665663612592 /Users/kunalpatil/anaconda3/envs/m148/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to sup press the warning warnings.warn( [10 pts] Find the optimal cluster size using the elbow method. Use the elbow method to find the best cluster size or range of best cluster sizes for the train data. Check the cluster sizes from 2 to 20. Make sure to plot the Inertia and state where you think the elbow starts. Make sure to use random_state = 0. /Users/kunalpatil/anaconda3/envs/m148/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to sup press the warning warnings.warn( /Users/kunalpatil/anaconda3/envs/m148/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to sup press the warning warnings.warn( /Users/kunalpatil/anaconda3/envs/m148/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to sup press the warning warnings.warn( /Users/kunalpatil/anaconda3/envs/m148/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to sup press the warning warnings.warn( /Users/kunalpatil/anaconda3/envs/m148/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to sup press the warning warnings.warn( /Users/kunalpatil/anaconda3/envs/m148/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to sup press the warning warnings.warn( /Users/kunalpatil/anaconda3/envs/m148/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to sup press the warning warnings.warn( /Users/kunalpatil/anaconda3/envs/m148/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to sup press the warning warnings.warn( /Users/kunalpatil/anaconda3/envs/m148/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to sup press the warning warnings.warn( /Users/kunalpatil/anaconda3/envs/m148/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to sup press the warning warnings.warn( /Users/kunalpatil/anaconda3/envs/m148/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to sup press the warning warnings.warn( /Users/kunalpatil/anaconda3/envs/m148/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to sup press the warning warnings.warn( /Users/kunalpatil/anaconda3/envs/m148/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to sup press the warning warnings.warn( /Users/kunalpatil/anaconda3/envs/m148/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to sup press the warning warnings.warn( /Users/kunalpatil/anaconda3/envs/m148/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to sup press the warning warnings.warn( Inertia for K = 2: 619.2596852490838 Inertia for K = 3: 562.2941749488493 Inertia for K = 4: 515.3501104402982 Inertia for K = 5: 491.0665663612592 Inertia for K = 6: 458.3449062857246 Inertia for K = 7: 436.46731311592123 Inertia for K = 8: 427.64243132453544 Inertia for K = 9: 409.3453854307658 Inertia for K = 10: 393.8362013824142 Inertia for K = 11: 375.627142914177 Inertia for K = 12: 366.872574191804 Inertia for K = 13: 356.0704428612708 Inertia for K = 14: 353.0506627827143 Inertia for K = 15: 342.96648926542787 /Users/kunalpatil/anaconda3/envs/m148/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to sup press the warning warnings.warn( /Users/kunalpatil/anaconda3/envs/m148/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to sup press the warning warnings.warn( /Users/kunalpatil/anaconda3/envs/m148/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to sup press the warning warnings.warn( /Users/kunalpatil/anaconda3/envs/m148/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to sup press the warning warnings.warn( Inertia for K = 16: 335.8288846772815 Inertia for K = 17: 325.33654094415067 Inertia for K = 18: 310.3448076066395 Inertia for K = 19: 306.7095752890683 Inertia for K = 20: 295.92886291975304 It looks like the elbow starts at k=13. [5 pts] Find the optimal cluster size for the train_pca data Repeat the same experiment but use train_pca instead of train. Inertia for K = 2: 526.8986473659902 Inertia for K = 3: 469.48656326630146 Inertia for K = 4: 423.2161135203968 /Users/kunalpatil/anaconda3/envs/m148/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to sup press the warning warnings.warn( /Users/kunalpatil/anaconda3/envs/m148/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to sup press the warning warnings.warn( /Users/kunalpatil/anaconda3/envs/m148/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to sup press the warning warnings.warn( /Users/kunalpatil/anaconda3/envs/m148/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to sup press the warning warnings.warn( Inertia for K = 5: 403.5879123750675 Inertia for K = 6: 371.0784572148916 Inertia for K = 7: 350.3642628743978 Inertia for K = 8: 330.0476150629754 Inertia for K = 9: 314.69189201646253 Inertia for K = 10: 298.74474399366363 Inertia for K = 11: 288.50442948032435 Inertia for K = 12: 277.3547969513997 /Users/kunalpatil/anaconda3/envs/m148/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to sup press the warning warnings.warn( /Users/kunalpatil/anaconda3/envs/m148/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to sup press the warning warnings.warn( /Users/kunalpatil/anaconda3/envs/m148/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to sup press the warning warnings.warn( /Users/kunalpatil/anaconda3/envs/m148/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to sup press the warning warnings.warn( /Users/kunalpatil/anaconda3/envs/m148/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to sup press the warning warnings.warn( /Users/kunalpatil/anaconda3/envs/m148/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to sup press the warning warnings.warn( /Users/kunalpatil/anaconda3/envs/m148/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to sup press the warning warnings.warn( /Users/kunalpatil/anaconda3/envs/m148/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to sup press the warning warnings.warn( /Users/kunalpatil/anaconda3/envs/m148/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to sup press the warning warnings.warn( /Users/kunalpatil/anaconda3/envs/m148/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to sup press the warning warnings.warn( /Users/kunalpatil/anaconda3/envs/m148/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to sup press the warning warnings.warn( /Users/kunalpatil/anaconda3/envs/m148/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to sup press the warning warnings.warn( /Users/kunalpatil/anaconda3/envs/m148/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to sup press the warning warnings.warn( /Users/kunalpatil/anaconda3/envs/m148/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to sup press the warning warnings.warn( /Users/kunalpatil/anaconda3/envs/m148/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to sup press the warning warnings.warn( Inertia for K = 13: 266.20679282269055 Inertia for K = 14: 258.1239469563823 Inertia for K = 15: 250.34211491204923 Inertia for K = 16: 234.46445989492173 Inertia for K = 17: 231.17570259556672 Inertia for K = 18: 221.78373427947042 Inertia for K = 19: 220.04830474526796 Inertia for K = 20: 211.59508812745764 Notice that the inertia is much smaller for every cluster size when using PCA features. Why do you think this is happening? Hint: Think about what Inertia is calculating and consider the number of features that PCA outputs. Since there are less features in the PCA data, there is less distance between the clusters and each data point since we are in a lower dimensional space. Thus, the inertia is less. (100 pts) Putting it all together Through all the homeworks and projects, you have learned how to apply many different models to perform a supervised learning task. We are now asking you to take everything that you learned to create a model that can predict whether a hotel reservation will be canceled or not. Context Hotels see millions of people every year and always wants to keep rooms occupied and payed for. Cancellations make the business lose money since it may make it difficult to reserve to another customer on such short notice. As such, it is useful for a hotel to know whether a reservation is likely to cancel or not. The following dataset will provide a variety of information about a booking that you will use to predict whether that booking will cancel or not. Property Management System - PMS Attribute Information (C) is for Categorical (N) is for Numeric 1) is_canceled (C) : Value indicating if the booking was canceled (1) or not (0). 2) hotel (C) : The datasets contains the booking information of two hotel. One of the hotels is a resort hotel and the other is a city hotel. 3) arrival_date_month (C): Month of arrival date with 12 categories: “January” to “December” 4) stays_in_weekend_nights (N): Number of weekend nights (Saturday or Sunday) the guest stayed or booked to stay at the hotel 5) stays_in_week_nights (N): Number of week nights (Monday to Friday) the guest stayed or booked to stay at the hotel BO and BL/Calculated by counting the number of week nights 6) adults (N): Number of adults 7) children (N): Number of children 8) babies (N): Number of babies 9) meal (C): Type of meal 10) country (C): Country of origin. 11) previous_cancellations (N): Number of previous bookings that were canceled by the customer prior to the current booking 12) previous_bookings_not_canceled (N) : Number of previous bookings not canceled by the customer prior to the current booking 13) reserved_room_type (C): Code of room type reserved. Code is presented instead of designation for anonymity reasons 14) booking_changes (N) : Number of changes/amendments made to the booking from the moment the booking was entered on the PMS until the moment of check-in or cancellation 15) deposit_type (C) : No Deposit – no deposit was made; Non Refund – a deposit was made in the value of the total stay cost; Refundable – a deposit was made with a value under the total cost of stay 16) days_in_waiting_list (N): Number of days the booking was in the waiting list before it was confirmed to the customer 17) customer_type (C): Group – when the booking is associated to a group; Transient – when the booking is not part of a group or contract, and is not associated to other transient booking; Transient-party – when the booking is transient, but is associated to at least other transient booking 18) adr (N): Average Daily Rate (Calculated by dividing the sum of all lodging transactions by the total number of staying nights) 19) required_car_parking_spaces (N): Number of car parking spaces required by the customer 20) total_of_special_requests (N): Number of special requests made by the customer (e.g. twin bed or high floor) 21) name (C): Name of the Guest (Not Real) 22) email (C): Email (Not Real) 23) phone-number (C): Phone number (not real) This dataset is quite large with 86989 samples. This makes it difficult to just brute force running a lot of models. As such, you have to be thoughtful when designing your models. The file name for the training data is "hotel_booking.csv". Challenge This project is about being able to predict whether a reservation is likely to cancel based on the input parameters available to us. We will ask you to perform some specific instructions to lead you in the right direction but you are given free reign on which models to use and the preprocessing steps you make. We will ask you to write out a description of what models you choose and why you choose them . (50 pts) Preprocessing Preprocessing: For the dataset, the following are mandatory pre-processing steps for your data: Use One-Hot Encoding on all categorical features (specify whether you keep the extra feature or not for features with multiple values) Determine which fields need to be dropped Handle missing values (Specify your strategy) Rescale the real valued features using any strategy you choose (StandardScaler, MinMaxScaler, Normalizer, etc) Augment at least one feature Implement a train-test split with 20% of the data going to the test data. Make sure that the test and train data are balanced in terms of the desired class. After writing your preprocessing code, write out a description of what you did for each step and provide a justification for your choices. All descriptions should be written in the markdown cells of the jupyter notebook. Make sure your writing is clear and professional. We highly recommend reading through the scikit-learn documentation to make this part easier. hotel lead_time arrival_date_month stays_in_weekend_nights stays_in_week_nights adults children babies meal country previous_cancellations previous_bookings_not_canceled reserved_room_type booking_changes deposit_type days_in_waiting_list 0 Resort Hotel 4 February 1 2 2 0.0 0 FB ESP 0 0 A 0 No Deposit 0 1 City Hotel 172 June 0 2 1 0.0 0 BB PRT 0 0 A 0 No Deposit 0 2 City Hotel 4 November 2 1 1 0.0 0 BB PRT 0 0 A 0 No Deposit 0 3 City Hotel 68 September 0 2 2 0.0 0 HB PRT 0 0 A 0 No Deposit 0 4 City Hotel 149 July 2 4 3 0.0 0 BB DEU 0 0 D 0 No Deposit 0 I implemented several different strategies here. Firstly, I dropped any rows with null values so we don't have to deal with them and I also dropped irrelevant columns such as name, email, and phone number. Secondly, I applied a one-hot encoding to all categorical featurse and I used a standard scalar for all the numerical features. Additionally, I augmented a new feature called total_guests which takes the sum of children and adults and babies. (50 pts) Try out a few models Now that you have pre-processed your data, you are ready to try out different models. For this part of the project, we want you to experiment with all the different models demonstrated in the course to determine which one performs best on the dataset. You must perform classification using at least 3 of the following models: Logistic Regression K-nearest neighbors SVM Decision Tree Multi-Layer Perceptron Due to the size of the dataset, be careful which models you use and look at their documentation to see how you should tackle this size issue for each model. For full credit, you must perform some hyperparameter optimization on your models of choice. You may find the following scikit-learn library on hyperparameter optimization useful. For each model chosen, write a description of which models were chosen, which parameters you optimized, and which parameters you choose for your best model. While the previous part of the project asked you to pre-process the data in a specific manner, you may alter pre- processing step as you wish to adjust for your chosen classification models. First, let's try Logisitc Regression! /Users/kunalpatil/anaconda3/envs/m148/lib/python3.9/site-packages/sklearn/linear_model/_logistic.py:1173: FutureWarning: `penalty='none'`has been deprecated in 1.2 and will be removed in 1.4. To keep the past behaviour, set `penalty=None`. warnings.warn( /Users/kunalpatil/anaconda3/envs/m148/lib/python3.9/site-packages/sklearn/linear_model/_logistic.py:1173: FutureWarning: `penalty='none'`has been deprecated in 1.2 and will be removed in 1.4. To keep the past behaviour, set `penalty=None`. warnings.warn( /Users/kunalpatil/anaconda3/envs/m148/lib/python3.9/site-packages/sklearn/linear_model/_logistic.py:1173: FutureWarning: `penalty='none'`has been deprecated in 1.2 and will be removed in 1.4. To keep the past behaviour, set `penalty=None`. warnings.warn( /Users/kunalpatil/anaconda3/envs/m148/lib/python3.9/site-packages/sklearn/linear_model/_logistic.py:1173: FutureWarning: `penalty='none'`has been deprecated in 1.2 and will be removed in 1.4. To keep the past behaviour, set `penalty=None`. warnings.warn( /Users/kunalpatil/anaconda3/envs/m148/lib/python3.9/site-packages/sklearn/linear_model/_logistic.py:1173: FutureWarning: `penalty='none'`has been deprecated in 1.2 and will be removed in 1.4. To keep the past behaviour, set `penalty=None`. warnings.warn( /Users/kunalpatil/anaconda3/envs/m148/lib/python3.9/site-packages/sklearn/linear_model/_logistic.py:1173: FutureWarning: `penalty='none'`has been deprecated in 1.2 and will be removed in 1.4. To keep the past behaviour, set `penalty=None`. warnings.warn( mean_fit_time std_fit_time mean_score_time std_score_time param_C param_penalty param_solver params split0_test_score split1_test_score split2_test_score mean_test_score std_test_score rank_test_score 3 0.314936 0.011435 0.001873 0.000020 1 l2 liblinear {'C': 1, 'penalty': 'l2', 'solver': 'liblinear'} 0.810940 0.811889 0.803171 0.808667 0.003905 1 2 0.989980 0.067458 0.001791 0.000018 1 l2 lbfgs {'C': 1, 'penalty': 'l2', 'solver': 'lbfgs'} 0.810892 0.811889 0.803123 0.808635 0.003919 2 6 1.099155 0.229325 0.002223 0.000400 1 none lbfgs {'C': 1, 'penalty': 'none', 'solver': 'lbfgs'} 0.810653 0.811985 0.803219 0.808619 0.003857 3 5 0.396446 0.046772 0.002154 0.000455 100 l2 liblinear {'C': 100, 'penalty': 'l2', 'solver': 'libline... 0.810701 0.811937 0.803171 0.808603 0.003874 4 7 1.973559 0.788458 0.001831 0.000045 1 none newton-cg {'C': 1, 'penalty': 'none', 'solver': 'newton-... 0.810701 0.811889 0.803219 0.808603 0.003838 4 4 1.032954 0.146655 0.002191 0.000330 100 l2 lbfgs {'C': 100, 'penalty': 'l2', 'solver': 'lbfgs'} 0.810653 0.811937 0.803171 0.808587 0.003865 6 1 0.097108 0.003048 0.001777 0.000017 0.01 l2 liblinear {'C': 0.01, 'penalty': 'l2', 'solver': 'liblin... 0.810461 0.809686 0.801495 0.807214 0.004057 7 0 0.218072 0.016401 0.002211 0.000210 0.01 l2 lbfgs {'C': 0.01, 'penalty': 'l2', 'solver': 'lbfgs'} 0.810365 0.809590 0.801447 0.807134 0.004034 8 For the first model, I chose logistic regression. This is a linear-decision boundary model that works by applying the sigmoid function to predict probabilites that a certain input belongs to class 1. Based on my gridsearch hyperparameter optimization for logsitic regression, the best values to use for logistic regression are C=1, penality=l2, and solver=liblinear. This resulted in a mean test score of .809. Second, let's try KNN! mean_fit_time std_fit_time mean_score_time std_score_time param_metric param_n_neighbors params split0_test_score split1_test_score split2_test_score mean_test_score std_test_score rank_test_score 7 0.004696 0.000130 6.759436 0.012030 manhattan 7 {'metric': 'manhattan', 'n_neighbors': 7} 0.835992 0.830188 0.831050 0.832410 0.002557 1 6 0.004820 0.000088 6.788852 0.022589 manhattan 5 {'metric': 'manhattan', 'n_neighbors': 5} 0.832926 0.829038 0.833733 0.831899 0.002050 2 3 0.004869 0.000051 1.857856 0.011214 euclidean 7 {'metric': 'euclidean', 'n_neighbors': 7} 0.828903 0.829996 0.829469 0.829456 0.000447 3 2 0.004837 0.000042 1.845561 0.006627 euclidean 5 {'metric': 'euclidean', 'n_neighbors': 5} 0.830819 0.827984 0.828703 0.829169 0.001203 4 5 0.005099 0.000189 6.720227 0.035194 manhattan 3 {'metric': 'manhattan', 'n_neighbors': 3} 0.830148 0.826643 0.829373 0.828721 0.001503 5 1 0.004716 0.000026 1.834653 0.009913 euclidean 3 {'metric': 'euclidean', 'n_neighbors': 3} 0.827514 0.825925 0.828176 0.827205 0.000945 6 4 0.004715 0.000144 6.547981 0.066679 manhattan 1 {'metric': 'manhattan', 'n_neighbors': 1} 0.827466 0.821853 0.819985 0.823101 0.003179 7 0 0.004926 0.000311 1.806952 0.003616 euclidean 1 {'metric': 'euclidean', 'n_neighbors': 1} 0.824735 0.820128 0.819314 0.821393 0.002387 8 For my second model, I chose KNN. This model works by classifying data points based on the closest neighbors of the data point in euclidean space. I optimized the parameters of teh value of K (number of neighbors), as well as using the metric of either euclidean or manhattan distance. The best parameters to use were manhattan distance and 7 neighbors. Now, let's try using a Decision Tree Classiifer! mean_fit_time std_fit_time mean_score_time std_score_time param_criterion param_max_depth params split0_test_score split1_test_score split2_test_score mean_test_score std_test_score rank_test_score 7 0.073220 0.000809 0.002477 0.000019 entropy 7 {'criterion': 'entropy', 'max_depth': 7} 0.813000 0.813614 0.806476 0.811030 0.003230 1 3 0.080363 0.000050 0.002467 0.000044 gini 7 {'criterion': 'gini', 'max_depth': 7} 0.809216 0.809877 0.805087 0.808060 0.002119 2 2 0.060638 0.000098 0.002317 0.000022 gini 5 {'criterion': 'gini', 'max_depth': 5} 0.787134 0.789567 0.785447 0.787383 0.001691 3 6 0.056366 0.000779 0.002288 0.000027 entropy 5 {'criterion': 'entropy', 'max_depth': 5} 0.789098 0.788417 0.784010 0.787175 0.002255 4 1 0.039469 0.000217 0.002041 0.000017 gini 3 {'criterion': 'gini', 'max_depth': 3} 0.784404 0.782717 0.777783 0.781635 0.002809 5 5 0.038518 0.000544 0.002064 0.000029 entropy 3 {'criterion': 'entropy', 'max_depth': 3} 0.761556 0.757521 0.754742 0.757940 0.002797 6 0 0.019083 0.002645 0.002249 0.000440 gini 1 {'criterion': 'gini', 'max_depth': 1} 0.761029 0.757042 0.754503 0.757524 0.002686 7 4 0.017222 0.000044 0.001930 0.000018 entropy 1 {'criterion': 'entropy', 'max_depth': 1} 0.761029 0.757042 0.754503 0.757524 0.002686 7 The decision tree classiifer is a tree that splits the data at each level based on a certain feature, until each leaf node becomes purely of one class or there are no more features. I used this classifier and optimized the parameteres of criterion and max_depth and found the optimal set of parameters to be using entropy for the criterion and a max_depth of 7. As it turns out, the best classifier that I tested was the KNN with a mean accuracy of .83599. Extra Credit We have provided an extra test dataset named "hotel_booking_test.csv" that does not have the target labels. Classify the samples in the dataset with your best model and write them into a csv file. Submit your csv file to our Kaggle contest. The website will specify your classification accuracy on the test set. We will award a bonus point for the project for every percentage point over 75% that you get on your kaggle test accuracy. To get the bonus points, you must also write out a summary of the model that you submit including any changes you made to the pre-processing steps. The summary must be written in a markdown cell of the jupyter notebook. Note that you should not change earlier parts of the project to complete the extra credit. Kaggle Submission Instruction Submit a two column csv where the first column is named "ID" and is the row number. The second column is named "target" and is the classification for each sample. Make sure that the sample order is preserved. (78287, 55) (8699, 55) The model I used was a KNN with 7 neighbors and the manhattan distance metric, which was the optimal one from my GridSearch. Additionally, I had to process the test data to add one more row since my pre-processing resulted in more columns for the training data than the test data, likely due to the one-hot-encodings. In [ ]: #Here are a set of libraries we imported to complete this assignment. #Feel free to use these or equivalent libraries for your implementation import numpy as np # linear algebra import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv) import matplotlib.pyplot as plt # this is used for the plot the graph import matplotlib import os import time #Sklearn classes from sklearn.model_selection import train_test_split , cross_val_score , GridSearchCV , KFold from sklearn import metrics from sklearn.metrics import confusion_matrix , silhouette_score import sklearn.metrics.cluster as smc from sklearn.cluster import KMeans from sklearn.tree import DecisionTreeClassifier from sklearn.pipeline import Pipeline , FeatureUnion from sklearn.preprocessing import StandardScaler , OneHotEncoder , LabelEncoder , MinMaxScaler from sklearn.compose import ColumnTransformer , make_column_transformer from sklearn import datasets from sklearn.decomposition import PCA from sklearn.neural_network import MLPClassifier from sklearn.datasets import make_blobs from matplotlib import pyplot import itertools % matplotlib inline #Sets random seed import random random . seed ( 42 ) In [ ]: #Helper functions def draw_confusion_matrix ( y , yhat , classes ): ''' Draws a confusion matrix for the given target and predictions Adapted from scikit-learn and discussion example. ''' plt . cla () plt . clf () matrix = confusion_matrix ( y , yhat ) plt . imshow ( matrix , interpolation = 'nearest' , cmap = plt . cm . YlOrBr ) plt . title ( "Confusion Matrix" ) plt . colorbar () num_classes = len ( classes ) plt . xticks ( np . arange ( num_classes ), classes , rotation = 0 ) plt . yticks ( np . arange ( num_classes ), classes ) plt . tick_params ( top = True , bottom = False , labeltop = True , labelbottom = False ) fmt = 'd' thresh = matrix . max () / 2. for i , j in itertools . product ( range ( matrix . shape [ 0 ]), range ( matrix . shape [ 1 ])): plt . text ( j , i , format ( matrix [ i , j ], fmt ), horizontalalignment = "center" , color = "white" if matrix [ i , j ] > thresh else "black" ) plt . ylabel ( 'True label' ) plt . xlabel ( 'Predicted label' ) plt . gca () . xaxis . set_label_position ( 'top' ) plt . tight_layout () plt . show () def heatmap ( data , row_labels , col_labels , figsize = ( 20 , 12 ), cmap = "YlGn" , cbar_kw = {}, cbarlabel = "" , valfmt = "{x:.2f}" , textcolors = ( "black" , "white" ), threshold = None ): """ Create a heatmap from a numpy array and two lists of labels. Taken from matplotlib example. Parameters ---------- data A 2D numpy array of shape (M, N). row_labels A list or array of length M with the labels for the rows. col_labels A list or array of length N with the labels for the columns. ax A `matplotlib.axes.Axes` instance to which the heatmap is plotted. If not provided, use current axes or create a new one. Optional. cmap A string that specifies the colormap to use. Look at matplotlib docs for information. Optional. cbar_kw A dictionary with arguments to `matplotlib.Figure.colorbar`. Optional. cbarlabel The label for the colorbar. Optional. valfmt The format of the annotations inside the heatmap. This should either use the string format method, e.g. "$ {x:.2f}", or be a `matplotlib.ticker.Formatter`. Optional. textcolors A pair of colors. The first is used for values below a threshold, the second for those above. Optional. threshold Value in data units according to which the colors from textcolors are applied. If None (the default) uses the middle of the colormap as """ plt . figure ( figsize = figsize ) ax = plt . gca () # Plot the heatmap im = ax . imshow ( data , cmap = cmap ) # Create colorbar cbar = ax . figure . colorbar ( im , ax = ax , ** cbar_kw ) cbar . ax . set_ylabel ( cbarlabel , rotation =- 90 , va = "bottom" ) # Show all ticks and label them with the respective list entries. ax . set_xticks ( np . arange ( data . shape [ 1 ]), labels = col_labels ) ax . set_yticks ( np . arange ( data . shape [ 0 ]), labels = row_labels ) # Let the horizontal axes labeling appear on top. ax . tick_params ( top = True , bottom = False , labeltop = True , labelbottom = False ) # Rotate the tick labels and set their alignment. plt . setp ( ax . get_xticklabels (), rotation =- 30 , ha = "right" , rotation_mode = "anchor" ) # Turn spines off and create white grid. ax . spines [:] . set_visible ( False ) ax . set_xticks ( np . arange ( data . shape [ 1 ] + 1 ) - .5 , minor = True ) ax . set_yticks ( np . arange ( data . shape [ 0 ] + 1 ) - .5 , minor = True ) ax . grid ( which = "minor" , color = "w" , linestyle = '-' , linewidth = 3 ) ax . tick_params ( which = "minor" , bottom = False , left = False ) # Normalize the threshold to the images color range. if threshold is not None : threshold = im . norm ( threshold ) else : threshold = im . norm ( data . max ()) / 2. # Set default alignment to center, but allow it to be # overwritten by textkw. kw = dict ( horizontalalignment = "center" , verticalalignment = "center" ) # Get the formatter in case a string is supplied if isinstance ( valfmt , str ): valfmt = matplotlib . ticker . StrMethodFormatter ( valfmt ) # Loop over the data and create a `Text` for each "pixel". # Change the text's color depending on the data. texts = [] for i in range ( data . shape [ 0 ]): for j in range ( data . shape [ 1 ]): kw . update ( color = textcolors [ int ( im . norm ( data [ i , j ]) > threshold )]) text = im . axes . text ( j , i , valfmt ( data [ i , j ], None ), ** kw ) texts . append ( text ) def make_meshgrid ( x , y , h = 0.02 ): """Create a mesh of points to plot in Parameters ---------- x: data to base x-axis meshgrid on y: data to base y-axis meshgrid on h: stepsize for meshgrid, optional Returns ------- xx, yy : ndarray """ x_min , x_max = x . min () - 1 , x . max () + 1 y_min , y_max = y . min () - 1 , y . max () + 1 xx , yy = np . meshgrid ( np . arange ( x_min , x_max , h ), np . arange ( y_min , y_max , h )) return xx , yy def plot_contours ( clf , xx , yy , ** params ): """Plot the decision boundaries for a classifier. Parameters ---------- ax: matplotlib axes object clf: a classifier xx: meshgrid ndarray yy: meshgrid ndarray params: dictionary of params to pass to contourf, optional """ Z = clf . predict ( np . c_ [ xx . ravel (), yy . ravel ()]) Z = Z . reshape ( xx . shape ) out = plt . contourf ( xx , yy , Z , ** params ) return out def draw_contour ( x , y , clf , class_labels = [ "Negative" , "Positive" ]): """ Draws a contour line for the predictor Assumption that x has only two features. This functions only plots the first two columns of x. """ X0 , X1 = x [:, 0 ], x [:, 1 ] xx0 , xx1 = make_meshgrid ( X0 , X1 ) plt . figure ( figsize = ( 10 , 6 )) plot_contours ( clf , xx0 , xx1 , cmap = "PiYG" , alpha = 0.8 ) scatter = plt . scatter ( X0 , X1 , c = y , cmap = "PiYG" , s = 30 , edgecolors = "k" ) plt . legend ( handles = scatter . legend_elements ()[ 0 ], labels = class_labels ) plt . xlim ( xx0 . min (), xx0 . max ()) plt . ylim ( xx1 . min (), xx1 . max ()) In [ ]: #Preprocess Data #Load Data data = pd . read_csv ( 'datasets/breast_cancer_data.csv' ) #Drop id column data = data . drop ([ "id" ], axis = 1 ) #Transform target feature into numerical le = LabelEncoder () data [ 'diagnosis' ] = le . fit_transform ( data [ 'diagnosis' ]) #Split target and data y = data [ "diagnosis" ] x = data . drop ([ "diagnosis" ], axis = 1 ) #Train test split train_raw , test_raw , target , target_test = train_test_split ( x , y , test_size = 0.2 , stratify = y , random_state = 0 ) #Standardize data #Since all features are real-valued, we only have one pipeline pipeline = Pipeline ([ ( 'scaler' , StandardScaler ()) ]) #Transform raw data train = pipeline . fit_transform ( train_raw ) test = pipeline . transform ( test_raw ) #Note that there is no fit calls #Names of Features after Pipeline feature_names = list ( pipeline . get_feature_names_out ( list ( x . columns ))) In [ ]: target . value_counts () Out[ ]: In [ ]: #Baseline accuracy of using the majority class ct = target_test . value_counts () print ( "Counts of each class in target_test: " ) print ( ct ) print ( "Baseline Accuraccy of using Majority Class: " , np . max ( ct ) / np . sum ( ct )) In [ ]: from sklearn import tree from sklearn.tree import DecisionTreeClassifier clf = DecisionTreeClassifier ( criterion = "gini" , random_state = 0 ) clf . fit ( train , target ) predicted = clf . predict ( test ) In [ ]: print ( "%-12s %f" % ( 'Accuracy:' , metrics . accuracy_score ( target_test , predicted ))) print ( "Confusion Matrix: \n" , metrics . confusion_matrix ( target_test , predicted )) draw_confusion_matrix ( target_test , predicted , [ 'healthy' , 'sick' ]) In [ ]: plt . figure ( figsize = ( 30 , 15 )) #Note that we have to pass the feature names into the plotting function to get the actual names #We pass the column names through the pipeline in case any feature augmentation was made #For example, a categorical feature will be split into multiple features with one hot encoding #and this way assigns a name to each column based on the feature value and the original feature name tree . plot_tree ( clf , max_depth = 1 , proportion = True , feature_names = feature_names , filled = True ) Out[ ]: In [ ]: from sklearn.tree import export_text r = export_text ( clf , feature_names = feature_names ) print ( r ) In [ ]: imp_pd = pd . Series ( data = clf . feature_importances_ , index = feature_names ) imp_pd = imp_pd . sort_values ( ascending = False ) imp_pd . plot . bar () Out[ ]: In [ ]: #Extract first two feature and use the standardscaler train_2 = StandardScaler () . fit_transform ( train_raw [[ 'concave points_mean' , 'perimeter_mean' ]]) depth = [ 1 , 2 , 3 , 4 , 5 , 10 , 15 ] for d in depth : dt = DecisionTreeClassifier ( max_depth = d , min_samples_split = 7 ) dt . fit ( train_2 , target ) draw_contour ( train_2 , target , dt , class_labels = [ 'Benign' , 'Malignant' ]) plt . title ( f"Max Depth ={ d }" ) In [ ]: from sklearn.neural_network import MLPClassifier clf = MLPClassifier ( hidden_layer_sizes = ( 100 ,), max_iter = 400 ) clf . fit ( train , target ) predicted = clf . predict ( test ) In [ ]: print ( "%-12s %f" % ( 'Accuracy:' , metrics . accuracy_score ( target_test , predicted ))) print ( "Confusion Matrix: \n" , metrics . confusion_matrix ( target_test , predicted )) draw_confusion_matrix ( target_test , predicted , [ 'Benign' , 'Malignant' ]) In [ ]: #Example of using the default Relu activation while altering the number of hidden layers train_2 = StandardScaler () . fit_transform ( train_raw [[ 'concave points_mean' , 'perimeter_mean' ]]) layers = [ 50 , 100 , 150 , 200 ] for l in layers : mlp = MLPClassifier ( hidden_layer_sizes = ( l ,), max_iter = 400 ) mlp . fit ( train_2 , target ) draw_contour ( train_2 , target , mlp , class_labels = [ 'Benign' , 'Malignant' ]) plt . title ( f"Hidden Layer Size ={ l }" ) In [ ]: #Example of using the default Relu activation #while altering the number of hidden layers with 2 groups of hidden layers train_2 = StandardScaler () . fit_transform ( train_raw [[ 'concave points_mean' , 'perimeter_mean' ]]) layers = [ 50 , 100 , 150 , 200 ] for l in layers : mlp = MLPClassifier ( hidden_layer_sizes = ( l , l ), max_iter = 400 ) mlp . fit ( train_2 , target ) draw_contour ( train_2 , target , mlp , class_labels = [ 'Benign' , 'Malignant' ]) plt . title ( f"Hidden Layer Sizes ={ ( l , l ) }" ) In [ ]: #Example of using 2 hidden layers of 100 units each with varying activations train_2 = StandardScaler () . fit_transform ( train_raw [[ 'concave points_mean' , 'perimeter_mean' ]]) acts = [ 'identity' , 'logistic' , 'tanh' , 'relu' ] for act in acts : mlp = MLPClassifier ( hidden_layer_sizes = ( 100 , 100 ), activation = act , max_iter = 400 ) mlp . fit ( train_2 , target ) draw_contour ( train_2 , target , mlp , class_labels = [ 'Benign' , 'Malignant' ]) plt . title ( f"Activation = { act }" ) In [ ]: #Import faces from scikit library faces = datasets . fetch_olivetti_faces () print ( "Flattened Face Data shape:" , faces . data . shape ) print ( "Face Image Data Shape:" , faces . images . shape ) print ( "Shape of target data:" , faces . target . shape ) In [ ]: #Extract image shape for future use im_shape = faces . images [ 0 ] . shape In [ ]: #Prints some example faces faceimages = faces . images [ np . random . choice ( len ( faces . images ), size = 16 , replace = False )] # take random 16 images fig , axes = plt . subplots ( 4 , 4 , sharex = True , sharey = True , figsize = ( 8 , 10 )) for i in range ( 16 ): axes [ i % 4 ][i//4].imshow(faceimages[i], cmap="gray") plt . show () In [ ]: #Perform PCA from sklearn.decomposition import PCA pca = PCA () pca_pipe = Pipeline ([( "scaler" , StandardScaler ()), #Scikit learn PCA does not standardize so we need to add ( "pca" , pca )]) pca_pipe . fit ( faces . data ) Out[ ]: In [ ]: fig = plt . figure ( figsize = ( 16 , 6 )) for i in range ( 30 ): ax = fig . add_subplot ( 3 , 10 , i + 1 , xticks = [], yticks = []) ax . imshow ( pca . components_ [ i ] . reshape ( im_shape ), cmap = plt . cm . bone ) ax . set_title ( f"Var={ pca . explained_variance_ratio_ [ i ] :.2%}" ) In [ ]: #Without PCA clf = DecisionTreeClassifier ( random_state = 0 ) clf . fit ( train , target ) predicted = clf . predict ( test ) print ( "Accuracy without PCA" ) print ( "%-12s %f" % ( 'Accuracy:' , metrics . accuracy_score ( target_test , predicted ))) print ( "Confusion Matrix: \n" , metrics . confusion_matrix ( target_test , predicted )) draw_confusion_matrix ( target_test , predicted , [ 'Benign' , 'Malignant' ]) #With PCA pca = PCA ( n_components = 0.9 ) #Take components that explain at lest 90% variance train_new = pca . fit_transform ( train ) test_new = pca . transform ( test ) clf_pca = DecisionTreeClassifier ( random_state = 0 ) clf_pca . fit ( train_new , target ) predicted = clf_pca . predict ( test_new ) print ( "Accuracy with PCA" ) print ( "%-12s %f" % ( 'Accuracy:' , metrics . accuracy_score ( target_test , predicted ))) print ( "Confusion Matrix: \n" , metrics . confusion_matrix ( target_test , predicted )) draw_confusion_matrix ( target_test , predicted , [ 'Benign' , 'Malignant' ]) In [ ]: print ( "Number of Features without PCA: " , train . shape [ 1 ]) print ( "Number of Features with PCA: " , train_new . shape [ 1 ]) In [ ]: feature_names_new = list ( range ( train_new . shape [ 1 ])) imp_pd = pd . Series ( data = clf_pca . feature_importances_ , index = feature_names_new ) imp_pd = imp_pd . sort_values ( ascending = False ) imp_pd . plot . bar () Out[ ]: In [ ]: #Artifical Dataset X , y = make_blobs ( n_samples = 500 , n_features = 2 , centers = 5 , cluster_std = 1 , center_box = ( - 10.0 , 10.0 ), shuffle = True , random_state = 10 , ) # For reproducibility plt . scatter ( X [:, 0 ], X [:, 1 ], marker = "." , s = 30 , lw = 0 , alpha = 0.7 , edgecolor = "k" ) Out[ ]: In [ ]: ks = list ( range ( 2 , 10 )) inertia = [] for k in ks : kmeans = KMeans ( n_clusters = k , init = 'k-means++' , random_state = 0 ) kmeans . fit ( X ) # inertia method returns wcss for that model inertia . append ( kmeans . inertia_ ) print ( f"Inertia for K = { k }: { kmeans . inertia_ }" ) In [ ]: plt . figure ( figsize = ( 10 , 5 )) plt . plot ( ks , inertia , marker = 'o' , color = 'red' ) plt . title ( 'The Elbow Method' ) plt . xlabel ( 'Number of clusters' ) plt . ylabel ( 'Inertia' ) plt . show () In [ ]: ks = list ( range ( 2 , 30 )) inertia = [] for k in ks : kmeans = KMeans ( n_clusters = k , init = 'k-means++' , random_state = 0 ) kmeans . fit ( train ) # inertia method returns wcss for that model inertia . append ( kmeans . inertia_ ) print ( f"Inertia for K = { k }: { kmeans . inertia_ }" ) In [ ]: plt . figure ( figsize = ( 10 , 5 )) plt . plot ( ks , inertia , marker = 'o' , color = 'red' ) plt . title ( 'The Elbow Method' ) plt . xlabel ( 'Number of clusters' ) plt . ylabel ( 'Inertia' ) plt . show () In [ ]: from sklearn.cluster import KMeans n_clusters = 10 #We know there are 10 subjects km = KMeans ( n_clusters = n_clusters , random_state = 0 ) pipe = Pipeline ([( "scaler" , StandardScaler ()), #First standardize ( "pca" , PCA ()), #Transform using pca ( "kmeans" , km )]) #Then apply k means In [ ]: clusters = pipe . fit_predict ( faces . data ) print ( clusters ) In [ ]: for labelID in range ( n_clusters ): # find all indexes into the `data` array that belong to the # current label ID, then randomly sample a maximum of 25 indexes # from the set idxs = np . where ( clusters == labelID )[ 0 ] idxs = np . random . choice ( idxs , size = min ( 25 , len ( idxs )), replace = False ) # Extract the sampled indexes id_face = faces . images [ idxs ] #Plots sampled faces fig = plt . figure ( figsize = ( 10 , 5 )) for i in range ( min ( 25 , len ( idxs ))): ax = fig . add_subplot ( 5 , 5 , i + 1 , xticks = [], yticks = []) ax . imshow ( id_face [ i ], cmap = plt . cm . bone ) fig . suptitle ( f"Id={ labelID }" ) In [ ]: #Preprocess Data #Load Data data = pd . read_csv ( 'datasets/heartdisease.csv' ) #Transform target feature into numerical le = LabelEncoder () data [ 'target' ] = le . fit_transform ( data [ 'sick' ]) data = data . drop ([ "sick" ], axis = 1 ) #Split target and data y = data [ "target" ] x = data . drop ([ "target" ], axis = 1 ) #Train test split #40% in test data as was in project 2 train_raw , test_raw , target , target_test = train_test_split ( x , y , test_size = 0.4 , stratify = y , random_state = 0 ) #Feature Transformation #This is the only change from project 2 since we replaced standard scaler to minmax #This was done to ensure that the numerical features were still of the same scale #as the one hot encoded features num_pipeline = Pipeline ([ ( 'minmax' , MinMaxScaler ()) ]) heart_num = train_raw . drop ([ 'sex' , 'cp' , 'fbs' , 'restecg' , 'exang' , 'slope' , 'ca' , 'thal' ], axis = 1 ) numerical_features = list ( heart_num ) categorical_features = [ 'sex' , 'cp' , 'fbs' , 'restecg' , 'exang' , 'slope' , 'ca' , 'thal' ] full_pipeline = ColumnTransformer ([ ( "num" , num_pipeline , numerical_features ), ( "cat" , OneHotEncoder ( categories = 'auto' ), categorical_features ), ]) #Transform raw data train = full_pipeline . fit_transform ( train_raw ) test = full_pipeline . transform ( test_raw ) #Note that there is no fit calls #Extracts features names for each transformed column feature_names = full_pipeline . get_feature_names_out ( list ( x . columns )) In [ ]: print ( "Column names after transformation by pipeline: " , feature_names ) In [ ]: #Baseline accuracy of using the majority class ct = target_test . value_counts () print ( "Counts of each class in target_test: " ) print ( ct ) print ( "Baseline Accuraccy of using Majority Class: " , np . max ( ct ) / np . sum ( ct )) In [ ]: clf = DecisionTreeClassifier ( random_state = 0 ) clf . fit ( train , target ) predicted = clf . predict ( test ) print ( "%-12s %f" % ( 'Accuracy:' , metrics . accuracy_score ( target_test , predicted ))) print ( "Confusion Matrix: \n" , metrics . confusion_matrix ( target_test , predicted )) draw_confusion_matrix ( target_test , predicted , [ 'healthy' , 'sick' ]) In [ ]: plt . figure ( figsize = ( 30 , 15 )) tree . plot_tree ( clf , max_depth = 1 , proportion = True , feature_names = feature_names , filled = True ) Out[ ]: In [ ]: imp_pd = pd . Series ( data = clf . feature_importances_ , index = feature_names ) imp_pd = imp_pd . sort_values ( ascending = False ) imp_pd . plot . bar () Out[ ]: In [ ]: parameters = [ { "max_depth" :[ 1 , 2 , 3 , 4 , 5 , 10 , 15 ], "min_samples_split" :[ 2 , 4 , 6 , 8 ], "criterion" :[ "gini" , "entropy" ] } ] dt = DecisionTreeClassifier ( random_state = 0 ) kf = KFold ( n_splits = k , random_state = None ) grid = GridSearchCV ( dt , parameters , cv = kf , scoring = 'accuracy' ) grid . fit ( train , target ) res = pd . DataFrame ( grid . cv_results_ ) . sort_values ( by = [ "mean_test_score" ], ascending = False ) res [[ "param_max_depth" , "param_min_samples_split" , "param_criterion" , "mean_test_score" ]] . head ( 1 ) Out[ ]: In [ ]: clf = DecisionTreeClassifier ( random_state = 0 , max_depth = 3 , min_samples_split = 8 , criterion = "entropy" ) clf . fit ( train , target ) predicted = clf . predict ( test ) print ( "%-12s %f" % ( 'Accuracy:' , metrics . accuracy_score ( target_test , predicted ))) print ( "Confusion Matrix: \n" , metrics . confusion_matrix ( target_test , predicted )) draw_confusion_matrix ( target_test , predicted , [ 'healthy' , 'sick' ]) In [ ]: clf = MLPClassifier ( hidden_layer_sizes = ( 100 , 100 ), max_iter = 800 , random_state = 0 ) clf . fit ( train , target ) predicted = clf . predict ( test ) print ( "%-12s %f" % ( 'Accuracy:' , metrics . accuracy_score ( target_test , predicted ))) print ( "Confusion Matrix: \n" , metrics . confusion_matrix ( target_test , predicted )) draw_confusion_matrix ( target_test , predicted , [ 'healthy' , 'sick' ]) In [ ]: import time dt = DecisionTreeClassifier ( random_state = 0 ) mlp = MLPClassifier ( hidden_layer_sizes = ( 100 , 100 ), max_iter = 800 , random_state = 0 ) t0 = time . time () dt . fit ( train , target ) t1 = time . time () print ( "Decision Tree Training Time : " , t1 - t0 ) t0 = time . time () mlp . fit ( train , target ) t1 = time . time () print ( "MLP Training Time : " , t1 - t0 ) t0 = time . time () dt . predict ( test ) t1 = time . time () print ( "Decision Tree Prediction Time : " , t1 - t0 ) t0 = time . time () mlp . predict ( test ) t1 = time . time () print ( "MLP Prediction Time : " , t1 - t0 ) In [ ]: pca = PCA ( n_components = 10 ) pca . fit ( train ) for i in range ( 10 ): print ( pca . components_ [ i ]) In [ ]: print ( pca . explained_variance_ratio_ ) In [ ]: train_pca = pca . fit_transform ( train ) test_pca = pca . transform ( test ) In [ ]: clf_pca = DecisionTreeClassifier ( random_state = 0 ) clf_pca . fit ( train_pca , target ) predicted = clf_pca . predict ( test_pca ) print ( "Accuracy with PCA" ) print ( "%-12s %f" % ( 'Accuracy:' , metrics . accuracy_score ( target_test , predicted ))) print ( "Confusion Matrix: \n" , metrics . confusion_matrix ( target_test , predicted )) draw_confusion_matrix ( target_test , predicted , [ 'Benign' , 'Malignant' ]) In [ ]: clf_pca = MLPClassifier ( hidden_layer_sizes = ( 100 , 100 ), max_iter = 800 , random_state = 0 ) clf_pca . fit ( train_pca , target ) predicted = clf_pca . predict ( test_pca ) print ( "Accuracy with PCA" ) print ( "%-12s %f" % ( 'Accuracy:' , metrics . accuracy_score ( target_test , predicted ))) print ( "Confusion Matrix: \n" , metrics . confusion_matrix ( target_test , predicted )) draw_confusion_matrix ( target_test , predicted , [ 'Benign' , 'Malignant' ]) In [ ]: k = KMeans ( n_clusters = 5 , random_state = 0 ) k . fit ( train ) print ( k . inertia_ ) In [ ]: ks = list ( range ( 2 , 21 )) inertia = [] for k in ks : kmeans = KMeans ( n_clusters = k , init = 'k-means++' , random_state = 0 ) kmeans . fit ( train ) # inertia method returns wcss for that model inertia . append ( kmeans . inertia_ ) print ( f"Inertia for K = { k }: { kmeans . inertia_ }" ) plt . figure ( figsize = ( 10 , 5 )) plt . plot ( ks , inertia , marker = 'o' , color = 'red' ) plt . title ( 'The Elbow Method' ) plt . xlabel ( 'Number of clusters' ) plt . ylabel ( 'Inertia' ) plt . show () In [ ]: ks = list ( range ( 2 , 21 )) inertia = [] for k in ks : kmeans = KMeans ( n_clusters = k , init = 'k-means++' , random_state = 0 ) kmeans . fit ( train_pca ) # inertia method returns wcss for that model inertia . append ( kmeans . inertia_ ) print ( f"Inertia for K = { k }: { kmeans . inertia_ }" ) plt . figure ( figsize = ( 10 , 5 )) plt . plot ( ks , inertia , marker = 'o' , color = 'red' ) plt . title ( 'The Elbow Method' ) plt . xlabel ( 'Number of clusters' ) plt . ylabel ( 'Inertia' ) plt . show () In [ ]: data = pd . read_csv ( './datasets/hotel_booking.csv' ) data = data . drop ([ 'email' , 'phone-number' , 'name' ], axis = 1 ) data = data . dropna () features = data . drop ( 'is_canceled' , axis = 1 ) targets = data [ 'is_canceled' ] features . head () Out[ ]: In [ ]: from sklearn.impute import SimpleImputer from sklearn.base import BaseEstimator , TransformerMixin class AugmentFeatures ( BaseEstimator , TransformerMixin ): def fit ( self , X , y = None ): return self def transform ( self , X ): total_guests = X [ 'children' ] + X [ 'adults' ] + X [ 'babies' ] return np . c_ [ X , total_guests ] num_pipeline = Pipeline ( [ ( 'attribs_adder' , AugmentFeatures ()), ( 'std_scaler' , StandardScaler ()), ( 'imputer' , SimpleImputer ()) ] ) features_cat = [ 'hotel' , 'arrival_date_month' , 'meal' , 'country' , 'reserved_room_type' , 'deposit_type' , 'customer_type' ] features_num = list ( features . drop ( features_cat , axis = 1 )) full_pipeline = ColumnTransformer ([ ( "num" , num_pipeline , features_num ), ( "cat" , OneHotEncoder (), features_cat ) ] ) features_prep = full_pipeline . fit_transform ( features ) In [ ]: from sklearn.model_selection import train_test_split train , test , target , target_test = train_test_split ( features , targets , test_size = 0.2 , random_state = 0 ) train = full_pipeline . fit_transform ( train ) test = full_pipeline . fit_transform ( test ) In [ ]: from sklearn.linear_model import LogisticRegression from sklearn.model_selection import GridSearchCV #Note that this a list of dict #Each dict describes the combination of parameters to check parameters = [ { "penalty" : [ "l2" ], "C" : [ 0.01 , 1 , 100 ], "solver" : [ "lbfgs" , "liblinear" ]}, #These solvers support penalty = "l2" { "penalty" : [ "none" ], "C" : [ 1 ], #Specified to prevent error message "solver" : [ "lbfgs" , "newton-cg" ]}, #These solvers support penalty = "none" ] #Implementing cross validation k = 3 kf = KFold ( n_splits = k , random_state = None ) log_reg = LogisticRegression ( penalty = "none" , max_iter = 1000 , solver = "lbfgs" ) #will change parameters during CV grid = GridSearchCV ( log_reg , parameters , cv = kf , scoring = "accuracy" ) grid . fit ( train , target ) res = pd . DataFrame ( grid . cv_results_ ) . sort_values ( by = [ "mean_test_score" ], ascending = False ) res Out[ ]: In [ ]: from sklearn.neighbors import KNeighborsClassifier parametersKNN = [ { "n_neighbors" :[ 1 , 3 , 5 , 7 ], "metric" :[ "euclidean" , "manhattan" ] } ] k = 3 kf = KFold ( n_splits = k , random_state = None ) KNN = KNeighborsClassifier () gridKNN = GridSearchCV ( KNN , parametersKNN , cv = kf , scoring = 'accuracy' ) gridKNN . fit ( train , target ) resKNN = pd . DataFrame ( gridKNN . cv_results_ ) . sort_values ( by = [ "mean_test_score" ], ascending = False ) resKNN Out[ ]: In [ ]: DT = DecisionTreeClassifier () parameters = [ { "max_depth" :[ 1 , 3 , 5 , 7 ], "criterion" :[ "gini" , "entropy" ] } ] k = 3 kf = KFold ( n_splits = k , random_state = None ) gridDT = GridSearchCV ( DT , parameters , cv = kf , scoring = 'accuracy' ) gridDT . fit ( train , target ) res = pd . DataFrame ( gridDT . cv_results_ ) . sort_values ( by = [ "mean_test_score" ], ascending = False ) res Out[ ]: In [ ]: train = full_pipeline . fit_transform ( features ) data = pd . read_csv ( './datasets/hotel_booking_test.csv' ) data = data . drop ([ 'email' , 'phone-number' , 'name' ], axis = 1 ) data = full_pipeline . fit_transform ( data ) zeros = np . zeros (( 1 , 8699 )) data = np . c_ [ data , zeros . T ] knn = KNeighborsClassifier ( n_neighbors = 7 , metric = 'manhattan' ) knn . fit ( train , targets ) preds = knn . predict ( data ) In [ ]: p = [] cnt = 0 for pred in preds : p . append ([ str ( cnt ), str ( pred )]) cnt += 1 In [ ]: import csv with open ( 'eggs.csv' , 'w' , newline = '' ) as csvfile : spamwriter = csv . writer ( csvfile , delimiter = ',' , quoting = csv . QUOTE_MINIMAL ) spamwriter . writerow ([ "ID" , "target" ]) for i in p : spamwriter . writerow ( i ) ▸ Pipeline ▸ StandardScaler ▸ PCA

Discover more documents: Sign up today!

Unlock a world of knowledge! Explore tailored content for a richer learning experience. Here's what you'll get:

Access to all documents
Unlimited textbook solutions
24/7 expert homework help

Project_3

Related Documents