Project_3
pdf
keyboard_arrow_up
School
University of California, Los Angeles *
*We aren’t endorsed by this school
Course
M148
Subject
Computer Science
Date
Jan 9, 2024
Type
Pages
1
Uploaded by patilkunal919
Project 3 - Classify your own data
For this project we're going to explore some of the new topics since the last project including Decision Trees and Un-supervised learning. The final part of the project will ask you to perform your own data science project to classify a new dataset.
Submission Details
Project is due June 14th at 11:59 am (Wednesday Afternoon). To submit the project, please save the notebook as a pdf file and submit the assignment via Gradescope. In addition, make sure that all figures are legible and su
ff
iciently large. For best pdf results,
we recommend downloading Latex and print the notebook using Latex.
Loading Essentials and Helper Functions
Example Project using new techniques
Since project 2, we have learned about a few new models for supervised learning(Decision Trees and Neural Networks) and un-supervised learning (Clustering and PCA). In this example portion, we will go over how to implement these techniques using the Sci-kit learn library.
Load and Process Example Project Data
For our example dataset, we will use the Breast Cancer Wisconsin Dataset to determine whether a mass found in a body is benign or malignant. Since this dataset was used as an example in project 2, you should be fairly familiar with it.
Feature Information:
Column 1: ID number
Column 2: Diagnosis (M = malignant, B = benign)
Ten real-valued features are computed for each cell nucleus:
1. radius (mean of distances from center to points on the perimeter)
2. texture (standard deviation of gray-scale values)
3. perimeter
4. area
5. smoothness (local variation in radius lengths)
6. compactness (perimeter^2 / area - 1.0)
7. concavity (severity of concave portions of the contour)
8. concave points (number of concave portions of the contour)
9. symmetry
10. fractal dimension ("coastline approximation" - 1)
Due to the statistical nature of the test, we are not able to get exact measurements of the previous values. Instead, the dataset contains the mean and standard error of the real-valued features.
Columns 3-12 present the mean of the measured values
Columns 13-22 present the standard error of the measured values
0 285
1 170
Name: diagnosis, dtype: int64
Counts of each class in target_test: 0 72
1 42
Name: diagnosis, dtype: int64
Baseline Accuraccy of using Majority Class: 0.631578947368421
Supervised Learning: Decision Tree
Classification with Decision Tree
Accuracy: 0.894737
Confusion Matrix: [[63 9]
[ 3 39]]
Parameters for Decision Tree Classifier
In Sci-kit Learn, the following are just some of the parameters we can pass into the Decision Tree Classifier:
criterion: {‘gini’, ‘entropy’, ‘log_loss’} default="gini"
The function to measure the quality of a split. Supported criteria are “gini” for the Gini impurity and “log_loss” and “entropy” both for the Shannon information gain
splitter: {“best”, “random”}, default=”best”
The strategy used to choose the split at each node. “best” aims to find the best feature split amongst all features. "random" only looks for the best split amongst a random subset of features.
max_depth: int, default = 2 {‘newton-cg’, ‘lbfgs’, ‘liblinear’, ‘sag’, ‘saga’}, default=’lbfgs’
The maximum depth of the tree.
min_samples_split: int or float, default=2
The minimum number of samples required to split an internal node. If int, then consider min_samples_split as the minimum number. If float, then min_samples_split is a fraction and ceil(min_samples_split * n_samples) are the minimum number of samples for each
split.
Visualizing Decision Trees
Scikit-learn allows us to visualize the decision tree to see what features it choose to split and what the result is. Note that if the condition in the node is true, you traverse the left edge of the node. Otherwise, you traverse the right edge.
[Text(0.5, 0.8333333333333334, 'concave points_mean <= 0.011\ngini = 0.468\nsamples = 100.0%\nvalue = [0.626, 0.374]'),
Text(0.25, 0.5, 'area_mean <= 0.124\ngini = 0.101\nsamples = 61.5%\nvalue = [0.946, 0.054]'),
Text(0.125, 0.16666666666666666, '\n (...) \n'),
Text(0.375, 0.16666666666666666, '\n (...) \n'),
Text(0.75, 0.5, 'concavity_mean <= 0.001\ngini = 0.202\nsamples = 38.5%\nvalue = [0.114, 0.886]'),
Text(0.625, 0.16666666666666666, '\n (...) \n'),
Text(0.875, 0.16666666666666666, '\n (...) \n')]
We can even look at the tree in a textual format.
|--- concave points_mean <= 0.01
| |--- area_mean <= 0.12
| | |--- area_se <= 0.04
| | | |--- compactness_mean <= 0.59
| | | | |--- fractal_dimension_se <= -0.83
| | | | | |--- fractal_dimension_se <= -0.84
| | | | | | |--- smoothness_se <= -1.22
| | | | | | | |--- compactness_se <= -0.98
| | | | | | | | |--- class: 0
| | | | | | | |--- compactness_se > -0.98
| | | | | | | | |--- class: 1
| | | | | | |--- smoothness_se > -1.22
| | | | | | | |--- class: 0
| | | | | |--- fractal_dimension_se > -0.84
| | | | | | |--- class: 1
| | | | |--- fractal_dimension_se > -0.83
| | | | | |--- class: 0
| | | |--- compactness_mean > 0.59
| | | | |--- symmetry_se <= 0.20
| | | | | |--- class: 1
| | | | |--- symmetry_se > 0.20
| | | | | |--- class: 0
| | |--- area_se > 0.04
| | | |--- symmetry_mean <= -0.57
| | | | |--- class: 1
| | | |--- symmetry_mean > -0.57
| | | | |--- class: 0
| |--- area_mean > 0.12
| | |--- texture_mean <= -0.72
| | | |--- class: 0
| | |--- texture_mean > -0.72
| | | |--- smoothness_mean <= -1.52
| | | | |--- class: 0
| | | |--- smoothness_mean > -1.52
| | | | |--- class: 1
|--- concave points_mean > 0.01
| |--- concavity_mean <= 0.00
| | |--- fractal_dimension_mean <= -0.83
| | | |--- class: 1
| | |--- fractal_dimension_mean > -0.83
| | | |--- concave points_mean <= 0.11
| | | | |--- concavity_se <= -0.35
| | | | | |--- class: 1
| | | | |--- concavity_se > -0.35
| | | | | |--- class: 0
| | | |--- concave points_mean > 0.11
| | | | |--- class: 0
| |--- concavity_mean > 0.00
| | |--- fractal_dimension_se <= 2.39
| | | |--- smoothness_se <= 1.87
| | | | |--- radius_se <= -0.77
| | | | | |--- class: 0
| | | | |--- radius_se > -0.77
| | | | | |--- concave points_se <= 2.59
| | | | | | |--- class: 1
| | | | | |--- concave points_se > 2.59
| | | | | | |--- radius_se <= 0.61
| | | | | | | |--- class: 0
| | | | | | |--- radius_se > 0.61
| | | | | | | |--- class: 1
| | | |--- smoothness_se > 1.87
| | | | |--- concave points_se <= 1.03
| | | | | |--- class: 1
| | | | |--- concave points_se > 1.03
| | | | | |--- class: 0
| | |--- fractal_dimension_se > 2.39
| | | |--- perimeter_mean <= 0.59
| | | | |--- class: 0
| | | |--- perimeter_mean > 0.59
| | | | |--- class: 1
Feature Importance in Decision Trees
Decision Trees can also assign importance to features by measuring the average decrease in impurity (i.e. information gain) for each feature. The features with higher decreases are treated as more important.
<Axes: >
We can clearly see that "concave points_mean" has the largest importance due to it providing the most reduction in the impurity.
Visualizing decision boundaries for Decision Trees
Similar to project 2, lets see what decision boundaries that a Decision Tree creates. We use the two most correlated features to the target labels: concave_points_mean and perimeter_mean.
We can see that the model gets more and more complex with increasing depth until it converges somewhere in between depth 10 and 15.
Supervised Learning: Multi-Layer Perceptron (MLP)
A neural network is a series of algorithms that endeavors to recognize underlying relationships in a set of data through a process that mimics the way the human brain operates. Neural networks are very powerful tools that are used a in a variety of applications including image
and speech processing. In class, we have discussed one of the earliest types of neural networks known as a Multi-Layer Perceptron.
steps
Using MLP for classification
/Users/kunalpatil/anaconda3/envs/m148/lib/python3.9/site-packages/sklearn/neural_network/_multilayer_perceptron.py:686: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (400) reached and the optimization hasn't c
onverged yet.
warnings.warn(
Accuracy: 0.929825
Confusion Matrix: [[66 6]
[ 2 40]]
Parameters for MLP Classifier
In Sci-kit Learn, the following are just some of the parameters we can pass into MLP Classifier:
hidden_layer_sizes: tuple, length = n_layers - 2, default=(100,)
The ith element represents the number of neurons in the ith hidden layer.
activation: {‘identity’, ‘logistic’, ‘tanh’, ‘relu’}, default=’relu’
Activation function for the hidden layer.
alpha: float, default = 0.0001
Strength of the L2 regularization term. The L2 regularization term is divided by the sample size when added to the loss.
max_iter: int, default=200
Maximum number of iterations taken for the solvers to converge.
Visualizing decision boundaries for MLP
Now, lets see how the decision boundaries change as a function of both the activation function and the number of hidden layers.
Unsupervised learning: PCA
As shown in lecture, PCA is a valuable dimensionality reduction tool that can extract a small subset of valuable features. In this section, we shall demonstrate how PCA can extract important visual features from pictures of subjects faces. We shall use the AT&T Database of
Faces
. This dataset contains 40 different subjects with 10 samples per subject which means we a dataset of 400 samples.
We extract the images from the scikit-learn dataset library
. The library imports the images (faces.data), the flatten array of images (faces.images), and which subject eacj image belongs to (faces.target). Each image is a 64 by 64 image with pixels converted to floating point
values in [0,1].
Eigenfaces
The following codes downloads and loads the face data.
Flattened Face Data shape: (400, 4096)
Face Image Data Shape: (400, 64, 64)
Shape of target data: (400,)
Now, let us see what features we can extract from these face images.
The following plots the top 30 PCA components with how much variance does this feature explain.
Amazing! We can see that the model has learned to focus on many features that we as humans also look at when trying to identify a face such as the nose,eyes, eyebrows, etc.
With this feature extraction, we can perform much more powerful learning.
Feature Extraction for Classification
Lets see if we can use PCA to improve the accuracy of the decision tree classifier.
Accuracy without PCA
Accuracy: 0.894737
Confusion Matrix: [[63 9]
[ 3 39]]
Accuracy with PCA
Accuracy: 0.912281
Confusion Matrix: [[66 6]
[ 4 38]]
Number of Features without PCA: 20
Number of Features with PCA: 7
Clearly, we get a much better accuracy for the model while using fewer features. But does the features the PCA thought were important the same features that the decision tree used. Lets look at the feature importance of the tree. The following plot numbers the first principal
component as 0, the second as 1, and so forth.
<Axes: >
Amazingly, the first and second components were the most important features in the decision tree. Thus, we can claim that PCA has significantly improved the performance of our model.
Unsupervised learning: Clustering
Clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups. One major algorithm we learned in class is the K-Means algorithm.
Evaluating K-Means performance
While there are many ways to evaluate the performance measure of clustering algorithsm
, we will focus on the inertia score of the K-Means model. Inertia is another term for the sum of squared distances of samples to their closest cluster center.
Let us look at how the Inertia changes as a function of the number of clusters for an artificial dataset.
<matplotlib.collections.PathCollection at 0x7f8bd234cee0>
Inertia for K = 2: 13293.997460961546
Inertia for K = 3: 7169.578996856773
Inertia for K = 4: 3247.8674040695832
Inertia for K = 5: 872.8554968701878
Inertia for K = 6: 803.846686425823
Inertia for K = 7: 739.5236191503768
Inertia for K = 8: 690.2530283275607
Inertia for K = 9: 614.5138307338655
/Users/kunalpatil/anaconda3/envs/m148/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to sup
press the warning
warnings.warn(
/Users/kunalpatil/anaconda3/envs/m148/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to sup
press the warning
warnings.warn(
/Users/kunalpatil/anaconda3/envs/m148/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to sup
press the warning
warnings.warn(
/Users/kunalpatil/anaconda3/envs/m148/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to sup
press the warning
warnings.warn(
/Users/kunalpatil/anaconda3/envs/m148/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to sup
press the warning
warnings.warn(
/Users/kunalpatil/anaconda3/envs/m148/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to sup
press the warning
warnings.warn(
/Users/kunalpatil/anaconda3/envs/m148/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to sup
press the warning
warnings.warn(
/Users/kunalpatil/anaconda3/envs/m148/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to sup
press the warning
warnings.warn(
From the plot, we can see that when the number of clusters of K-means is the correct number of clusters, Inertia starts decreasing at a much slower rate. This creates a kind of elbow shape in the graph. For K-means clustering, the elbow method selects the number of clusters
where the elbow shape is formed. In this case, we see that this method would produce the correct number of clusters.
Lets try it on the cancer dataset.
/Users/kunalpatil/anaconda3/envs/m148/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to sup
press the warning
warnings.warn(
/Users/kunalpatil/anaconda3/envs/m148/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to sup
press the warning
warnings.warn(
/Users/kunalpatil/anaconda3/envs/m148/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to sup
press the warning
warnings.warn(
/Users/kunalpatil/anaconda3/envs/m148/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to sup
press the warning
warnings.warn(
/Users/kunalpatil/anaconda3/envs/m148/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to sup
press the warning
warnings.warn(
/Users/kunalpatil/anaconda3/envs/m148/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to sup
press the warning
warnings.warn(
/Users/kunalpatil/anaconda3/envs/m148/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to sup
press the warning
warnings.warn(
/Users/kunalpatil/anaconda3/envs/m148/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to sup
press the warning
warnings.warn(
/Users/kunalpatil/anaconda3/envs/m148/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to sup
press the warning
warnings.warn(
/Users/kunalpatil/anaconda3/envs/m148/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to sup
press the warning
warnings.warn(
/Users/kunalpatil/anaconda3/envs/m148/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to sup
press the warning
warnings.warn(
/Users/kunalpatil/anaconda3/envs/m148/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to sup
press the warning
warnings.warn(
/Users/kunalpatil/anaconda3/envs/m148/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to sup
press the warning
warnings.warn(
/Users/kunalpatil/anaconda3/envs/m148/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to sup
press the warning
warnings.warn(
Inertia for K = 2: 6381.278325955922
Inertia for K = 3: 5508.621446593709
Inertia for K = 4: 4972.231721973118
Inertia for K = 5: 4507.26713736607
Inertia for K = 6: 4203.777246823878
Inertia for K = 7: 3942.659550896411
Inertia for K = 8: 3745.1124228292692
Inertia for K = 9: 3532.7225156022073
Inertia for K = 10: 3371.033467027838
Inertia for K = 11: 3232.472758070737
Inertia for K = 12: 3135.1944201924534
Inertia for K = 13: 3033.3838427786477
Inertia for K = 14: 2958.3200036360367
Inertia for K = 15: 2893.798763511904
Inertia for K = 16: 2767.804761705547
Inertia for K = 17: 2737.4747101790635
Inertia for K = 18: 2662.1203080706655
Inertia for K = 19: 2617.90890694005
Inertia for K = 20: 2553.961378449726
Inertia for K = 21: 2491.9133737078346
Inertia for K = 22: 2448.777623600997
Inertia for K = 23: 2391.644588540416
/Users/kunalpatil/anaconda3/envs/m148/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to sup
press the warning
warnings.warn(
/Users/kunalpatil/anaconda3/envs/m148/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to sup
press the warning
warnings.warn(
/Users/kunalpatil/anaconda3/envs/m148/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to sup
press the warning
warnings.warn(
/Users/kunalpatil/anaconda3/envs/m148/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to sup
press the warning
warnings.warn(
/Users/kunalpatil/anaconda3/envs/m148/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to sup
press the warning
warnings.warn(
/Users/kunalpatil/anaconda3/envs/m148/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to sup
press the warning
warnings.warn(
/Users/kunalpatil/anaconda3/envs/m148/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to sup
press the warning
warnings.warn(
/Users/kunalpatil/anaconda3/envs/m148/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to sup
press the warning
warnings.warn(
/Users/kunalpatil/anaconda3/envs/m148/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to sup
press the warning
warnings.warn(
Inertia for K = 24: 2374.1345787190176
Inertia for K = 25: 2334.794010981073
Inertia for K = 26: 2267.993521706617
Inertia for K = 27: 2233.585453239129
Inertia for K = 28: 2191.739402693569
Inertia for K = 29: 2165.254207641313
/Users/kunalpatil/anaconda3/envs/m148/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to sup
press the warning
warnings.warn(
/Users/kunalpatil/anaconda3/envs/m148/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to sup
press the warning
warnings.warn(
/Users/kunalpatil/anaconda3/envs/m148/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to sup
press the warning
warnings.warn(
/Users/kunalpatil/anaconda3/envs/m148/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to sup
press the warning
warnings.warn(
/Users/kunalpatil/anaconda3/envs/m148/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to sup
press the warning
warnings.warn(
Here we see that the elbow is not as cleanly defined. This may be due to the dataset not being a good fit for K-means. Regardless, we can still apply the elbow method by noting that the slow down happens around 7~14.
Kmeans on Eigenfaces
Now, lets see how K-means performs in clustering the face data with PCA.
[3 6 3 4 6 4 3 3 3 6 5 5 5 5 5 5 5 5 5 5 1 1 5 1 4 1 4 4 6 6 5 5 5 3 6 4 3
5 5 6 4 1 1 4 4 4 4 4 4 3 3 3 3 3 3 3 3 3 3 3 7 7 3 4 7 3 7 7 3 7 0 6 3 6
3 3 6 3 3 6 1 1 1 4 4 4 4 4 1 6 6 6 6 6 6 6 6 6 4 3 0 0 0 0 0 0 0 0 0 0 4
1 1 1 1 4 1 6 6 4 5 5 4 4 5 5 4 4 5 4 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5
5 5 8 3 3 3 3 8 6 8 3 3 4 4 1 1 4 4 4 4 4 4 3 6 4 6 3 3 3 3 3 3 7 7 7 7 7
7 7 7 7 7 9 9 9 4 4 4 4 4 4 9 9 9 9 9 9 9 4 8 9 4 2 2 2 2 2 2 2 2 2 2 3 6
1 4 1 4 1 6 4 4 8 8 8 8 5 8 8 8 8 8 6 5 6 5 5 5 6 4 5 6 1 1 1 1 1 1 3 1 1
5 5 5 5 5 5 5 5 5 5 5 5 1 5 5 5 5 5 5 1 4 2 2 2 9 4 4 9 8 2 2 9 9 9 9 9 9
9 9 9 9 9 9 9 9 9 9 9 9 9 9 7 7 7 7 7 7 5 7 7 7 9 9 9 9 9 9 9 9 9 9 2 2 2
2 2 2 2 2 2 2 9 9 9 9 4 6 6 1 4 4 3 8 8 8 7 8 8 8 8 8 1 1 1 1 1 1 1 1 1 1
4 1 1 6 1 4 6 6 4 1 2 2 2 2 2 2 2 2 2 2 6 4 3 4 3 1 4 1 4 4]
/Users/kunalpatil/anaconda3/envs/m148/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to sup
press the warning
warnings.warn(
While the algorithm isn't perfect, we can see that K-means with PCA is picking up on some facial similarity or similar expressions.
(100 pts) Todo: Use new methods to classify heart disease
To compare how these new models perform with the other models discussed in the course, we will apply these new models on the heart disease dataset that was used in project 2.
Background: The Dataset (Recap)
For this exercise we will be using a subset of the UCI Heart Disease dataset, leveraging the fourteen most commonly used attributes. All identifying information about the patient has been scrubbed. You will be asked to classify whether a patient is suffering from heart disease
based on a host of potential medical factors.
The dataset includes 14 columns. The information provided by each column is as follows:
age:
Age in years
sex:
(1 = male; 0 = female)
cp:
Chest pain type (0 = asymptomatic; 1 = atypical angina; 2 = non-anginal pain; 3 = typical angina)
trestbps:
Resting blood pressure (in mm Hg on admission to the hospital)
chol:
cholesterol in mg/dl
fbs
Fasting blood sugar > 120 mg/dl (1 = true; 0 = false)
restecg:
Resting electrocardiographic results (0= showing probable or definite left ventricular hypertrophy by Estes' criteria; 1 = normal; 2 = having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV))
thalach:
Maximum heart rate achieved
exang:
Exercise induced angina (1 = yes; 0 = no)
oldpeak:
Depression induced by exercise relative to rest
slope:
The slope of the peak exercise ST segment (0 = downsloping; 1 = flat; 2 = upsloping)
ca:
Number of major vessels (0-3) colored by flourosopy
thal:
1 = normal; 2 = fixed defect; 7 = reversable defect
sick:
Indicates the presence of Heart disease (True = Disease; False = No disease)
Preprocess Data
This part is done for you since you would have already completed it in project 2. Use the train, target, test, and target_test for all future parts. We also provide the column names for each transformed column for future use.
Column names after transformation by pipeline: ['num__age' 'num__trestbps' 'num__chol' 'num__thalach' 'num__oldpeak'
'cat__sex_0' 'cat__sex_1' 'cat__cp_0' 'cat__cp_1' 'cat__cp_2' 'cat__cp_3'
'cat__fbs_0' 'cat__fbs_1' 'cat__restecg_0' 'cat__restecg_1'
'cat__restecg_2' 'cat__exang_0' 'cat__exang_1' 'cat__slope_0'
'cat__slope_1' 'cat__slope_2' 'cat__ca_0' 'cat__ca_1' 'cat__ca_2'
'cat__ca_3' 'cat__ca_4' 'cat__thal_0' 'cat__thal_1' 'cat__thal_2'
'cat__thal_3']
The following shows the baseline accuracy of simply classifying every sample as the majority class.
Counts of each class in target_test: 0 66
1 56
Name: target, dtype: int64
Baseline Accuraccy of using Majority Class: 0.5409836065573771
(25 pts) Decision Trees
[5 pts] Apply Decision Tree on Train Data
Apply the decision tree on the train data
with default parameters of the DecisionTreeClassifier. Report the accuracy and print the confusion matrix
. Make sure to use random_state = 0 so that your results match ours.
Accuracy: 0.696721
Confusion Matrix: [[53 13]
[24 32]]
[5 pts] Visualize the Decision Tree
Visualize the first two layers of the decision tree that you trained.
[Text(0.5, 0.8333333333333334, 'cat__cp_0 <= 0.5\ngini = 0.496\nsamples = 100.0%\nvalue = [0.547, 0.453]'),
Text(0.25, 0.5, 'num__chol <= 0.223\ngini = 0.283\nsamples = 48.6%\nvalue = [0.83, 0.17]'),
Text(0.125, 0.16666666666666666, '\n (...) \n'),
Text(0.375, 0.16666666666666666, '\n (...) \n'),
Text(0.75, 0.5, 'cat__ca_0 <= 0.5\ngini = 0.403\nsamples = 51.4%\nvalue = [0.28, 0.72]'),
Text(0.625, 0.16666666666666666, '\n (...) \n'),
Text(0.875, 0.16666666666666666, '\n (...) \n')]
What is the gini index improvement of the first split?
The gini index of the first split is .486(.283) + .514(.403) = 0.34468. This is a gini index improvement of .496-0.34468 = 0.15132.
[5 pts] Plot the importance of each feature for the Decision Tree
<Axes: >
How many features have non-zero importance for the Decision Tree? If we remove the features with zero importance, will it change the decision tree for the same sampled dataset?
We have 16 features with non-zero importance. If we remoev features with zero importance, it does not change the decision tree for the sampled dataset.
[10 pts] Optimize Decision Tree
While the default Decision Tree performs fairly well on the data, lets see if we can improve performance by optimizing the parameters.
Run a GridSearchCV with 3-Fold Cross Validation for the Decision Tree. Find the best model parameters amongst the following:
max_depth = [1,2,3,4,5,10,15]
min_samples_split = [2,4,6,8]
criterion = ["gini", "entropy"]
After using GridSearchCV, print the best model parameters and the best score.
param_max_depth
param_min_samples_split
param_criterion
mean_test_score
39
3
8
entropy
0.798851
Using the best model you have, report the test accuracy and print out the confusion matrix
Accuracy: 0.786885
Confusion Matrix: [[62 4]
[22 34]]
(20 pts) Multi-Layer Perceptron
[5 pts] Applying a Multi-Layer Perceptron
Apply the MLP on the train data
with hidden_layer_sizes=(100,100) and max_iter = 800. Report the accuracy and print the confusion matrix
. Make sure to set random_state=0.
Accuracy: 0.819672
Confusion Matrix: [[63 3]
[19 37]]
[10 pts] Speedtest between Decision Tree and MLP
Let us compare the training times and prediction times of a Decision Tree and an MLP. Time how long it takes for a Decision Tree and an MLP to perform a .fit operation (i.e. training the model). Then, time how long it takes for a Decision Tree and an MLP to perform
a .predict operation (i.e. predicting the testing data). Print out the timings and specify which model was quicker for each operation.
We recommend using the time python module to time your code. An example of the time module was shown in project 2. Use the default
Decision Tree Classifier and the MLP with the previously mentioned parameters.
Decision Tree Training Time : 0.0023469924926757812
MLP Training Time : 0.41370391845703125
Decision Tree Prediction Time : 0.00042891502380371094
MLP Prediction Time : 0.00022602081298828125
[5 pts] Compare and contrast Decision Trees and MLPs.
Describe at least one advantage and disadvantage of using an MLP over a Decision Tree.
An MLP classifier has a higher test accuracy, so it can better fit data and generalize to complex relationships. However, it is much slower to train than a decision tree.
(35 pts) PCA
[5 pts] Transform the train data using PCA
Train a PCA model to project the train data on the top 10 components. Print out the 10 principal components
. Look at the documentation of PCA for reference.
[ 0.06099466 0.04034864 0.01924581 -0.1017322 0.11071541 -0.12331434
0.12331434 0.34265332 -0.13458918 -0.20936122 0.00129708 -0.01288493
0.01288493 0.19693729 -0.19855631 0.00161902 -0.35054419 0.35054419
0.04595587 0.29412324 -0.34007911 -0.20553518 0.07463263 0.08348053
0.06758334 -0.02016132 -0.00039038 0.04438528 -0.31408112 0.27008622]
[ 0.05231789 0.02890251 0.03826504 -0.00733246 -0.0037285 0.44442215
-0.44442215 0.07362246 -0.03171478 -0.02860787 -0.01329982 -0.02106535
0.02106535 0.42589374 -0.44447634 0.0185826 0.0184908 -0.0184908
-0.02183111 0.1435457 -0.1217146 0.02011775 -0.03199655 0.03699266
0.00382438 -0.02893824 0.00336806 0.00301279 0.29007107 -0.29645192]
[-0.0427616 -0.03742143 0.00354063 -0.04733566 0.01801283 0.30699875
-0.30699875 0.09347247 0.0329172 -0.0989118 -0.02747786 0.19990151
-0.19990151 -0.43118048 0.40996579 0.0212147 -0.13223456 0.13223456
-0.02514842 0.32090865 -0.29576023 0.29149476 -0.18165312 -0.05235916
-0.0472252 -0.01025728 0.00355497 -0.03312684 0.02754858 0.00202329]
[-0.01085267 0.05128855 0.02043706 0.04685242 0.03207741 -0.02899318
0.02899318 -0.03499918 0.0751887 -0.14722255 0.10703303 0.07444483
-0.07444483 0.19169119 -0.19398602 0.00229484 0.37381386 -0.37381386
0.04159629 0.09054175 -0.13213804 0.38469377 -0.41161818 0.00135049
0.04404514 -0.01847122 0.00225008 0.04153747 -0.36627833 0.32249078]
[ 0.04627938 0.01970448 -0.00381961 -0.03509902 0.00213482 -0.0641167
0.0641167 -0.40342028 -0.0378483 0.32955899 0.11170959 -0.25422537
0.25422537 -0.06936845 0.06174935 0.0076191 0.1794727 -0.1794727
-0.0691121 0.4997768 -0.4306647 -0.17536448 0.15450628 -0.00161821
-0.00409491 0.02657132 0.00376026 0.04672977 0.00227974 -0.05276977]
[-0.06836847 -0.02106313 -0.0364071 0.00576781 -0.04376639 -0.35938801
0.35938801 -0.00508255 0.05584308 -0.1221728 0.07141227 -0.10049637
0.10049637 0.0879698 -0.09144775 0.00347796 -0.18119896 0.18119896
0.00840746 0.11524555 -0.12365301 0.48997794 -0.22229128 -0.15548202
-0.09522074 -0.0169839 0.00997476 0.08201132 0.30430633 -0.39629241]
[-0.01781269 0.0730869 0.02016699 0.01826179 0.02836503 0.18963204
-0.18963204 -0.22891469 -0.09804043 0.32499933 0.00195579 -0.36267785
0.36267785 0.02767142 -0.01728665 -0.01038476 -0.29600786 0.29600786
0.06466713 -0.18871682 0.12404969 0.31270592 -0.17972228 -0.08238154
-0.01808524 -0.03251686 0.03177332 -0.09353408 -0.20218982 0.26395057]
[ 0.04335226 0.05421452 -0.03470438 -0.01533346 0.01850783 0.06961587
-0.06961587 0.33466825 0.11123006 -0.41407483 -0.03182348 -0.47183824
0.47183824 -0.18841507 0.1853052 0.00310987 0.14171697 -0.14171697
0.01938161 -0.04763557 0.02825396 -0.11728067 -0.16652438 0.21168304
0.0921391 -0.02001708 0.0110368 0.12414356 -0.00323016 -0.13195021]
[-0.05976132 -0.03693706 0.00661316 0.05615859 -0.04806144 0.07173048
-0.07173048 -0.33252042 0.6471931 -0.443929 0.12925632 -0.05004393
0.05004393 0.07308993 -0.03351071 -0.03957922 -0.08972835 0.08972835
-0.00474533 0.01650628 -0.01176095 -0.00075448 0.33237063 -0.22375775
-0.07770358 -0.03015483 -0.01306916 -0.11556595 -0.02840801 0.15704312]
[-0.03956388 0.01449879 -0.0018639 0.07345678 0.09644551 -0.0121838
0.0121838 -0.36630173 0.20777859 -0.01802986 0.176553 0.10857767
-0.10857767 0.01552185 -0.00336271 -0.01215914 -0.16815714 0.16815714
-0.03179433 -0.0103539 0.04214823 -0.22392515 -0.41077763 0.66321302
-0.08581595 0.05730571 -0.01300773 0.11220997 -0.00547824 -0.09372401]
[5 pts] Percentage of variance explained by top 10 principal components
Using PCA's "explained_variance_ratio_", print the percentage of variance explained by the top 10 principal components.
[0.23862094 0.1360394 0.10034179 0.08239361 0.07495304 0.06591197
0.05919248 0.04935616 0.0404145 0.0299425 ]
[5 pts] Transform the train and test data into train_pca and test_pca using PCA
Note: Use fit_transform for train and transform for test
[5 pts] PCA+Decision Tree
Train the default Decision Tree Classifier using train_pca. Report the accuracy using test_pca and print the confusion matrix
.
Accuracy with PCA
Accuracy: 0.762295
Confusion Matrix: [[53 13]
[16 40]]
Does the model perform better with or without PCA?
The model performs better with PCA.
[5 pts] PCA+MLP
Train the MLP classifier with the same parameters as before using train_pca. Report the accuracy using test_pca and print the confusion matrix
.
Accuracy with PCA
Accuracy: 0.778689
Confusion Matrix: [[60 6]
[21 35]]
Does the model perform better with or without PCA?
This model performed better without PCA.
[10 pts] Pros and Cons of PCA
In your own words, provide at least two pros and at least two cons for using PCA
Response:
Two pros of using PCA is that it decreases the amount of raw data you have significantly which can increase training speed, and also it helps us visualize high-dimensional features much better by translating them into lower dimensions. Two cons are that we may get a lower
accuracy since we are effectively losing information about some of the features, and also it may be hard to interpret components.
(20 pts) K-Means Clustering
[5 pts] Apply K-means to the train data and print out the Inertia score
Use n_cluster = 5 and random_state = 0.
491.0665663612592
/Users/kunalpatil/anaconda3/envs/m148/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to sup
press the warning
warnings.warn(
[10 pts] Find the optimal cluster size using the elbow method.
Use the elbow method to find the best cluster size or range of best cluster sizes for the train data. Check the cluster sizes from 2 to 20. Make sure to plot the Inertia and state where you think the elbow starts. Make sure to use random_state = 0.
/Users/kunalpatil/anaconda3/envs/m148/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to sup
press the warning
warnings.warn(
/Users/kunalpatil/anaconda3/envs/m148/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to sup
press the warning
warnings.warn(
/Users/kunalpatil/anaconda3/envs/m148/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to sup
press the warning
warnings.warn(
/Users/kunalpatil/anaconda3/envs/m148/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to sup
press the warning
warnings.warn(
/Users/kunalpatil/anaconda3/envs/m148/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to sup
press the warning
warnings.warn(
/Users/kunalpatil/anaconda3/envs/m148/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to sup
press the warning
warnings.warn(
/Users/kunalpatil/anaconda3/envs/m148/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to sup
press the warning
warnings.warn(
/Users/kunalpatil/anaconda3/envs/m148/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to sup
press the warning
warnings.warn(
/Users/kunalpatil/anaconda3/envs/m148/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to sup
press the warning
warnings.warn(
/Users/kunalpatil/anaconda3/envs/m148/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to sup
press the warning
warnings.warn(
/Users/kunalpatil/anaconda3/envs/m148/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to sup
press the warning
warnings.warn(
/Users/kunalpatil/anaconda3/envs/m148/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to sup
press the warning
warnings.warn(
/Users/kunalpatil/anaconda3/envs/m148/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to sup
press the warning
warnings.warn(
/Users/kunalpatil/anaconda3/envs/m148/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to sup
press the warning
warnings.warn(
/Users/kunalpatil/anaconda3/envs/m148/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to sup
press the warning
warnings.warn(
Inertia for K = 2: 619.2596852490838
Inertia for K = 3: 562.2941749488493
Inertia for K = 4: 515.3501104402982
Inertia for K = 5: 491.0665663612592
Inertia for K = 6: 458.3449062857246
Inertia for K = 7: 436.46731311592123
Inertia for K = 8: 427.64243132453544
Inertia for K = 9: 409.3453854307658
Inertia for K = 10: 393.8362013824142
Inertia for K = 11: 375.627142914177
Inertia for K = 12: 366.872574191804
Inertia for K = 13: 356.0704428612708
Inertia for K = 14: 353.0506627827143
Inertia for K = 15: 342.96648926542787
/Users/kunalpatil/anaconda3/envs/m148/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to sup
press the warning
warnings.warn(
/Users/kunalpatil/anaconda3/envs/m148/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to sup
press the warning
warnings.warn(
/Users/kunalpatil/anaconda3/envs/m148/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to sup
press the warning
warnings.warn(
/Users/kunalpatil/anaconda3/envs/m148/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to sup
press the warning
warnings.warn(
Inertia for K = 16: 335.8288846772815
Inertia for K = 17: 325.33654094415067
Inertia for K = 18: 310.3448076066395
Inertia for K = 19: 306.7095752890683
Inertia for K = 20: 295.92886291975304
It looks like the elbow starts at k=13.
[5 pts] Find the optimal cluster size for the train_pca data
Repeat the same experiment but use train_pca instead of train.
Inertia for K = 2: 526.8986473659902
Inertia for K = 3: 469.48656326630146
Inertia for K = 4: 423.2161135203968
/Users/kunalpatil/anaconda3/envs/m148/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to sup
press the warning
warnings.warn(
/Users/kunalpatil/anaconda3/envs/m148/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to sup
press the warning
warnings.warn(
/Users/kunalpatil/anaconda3/envs/m148/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to sup
press the warning
warnings.warn(
/Users/kunalpatil/anaconda3/envs/m148/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to sup
press the warning
warnings.warn(
Inertia for K = 5: 403.5879123750675
Inertia for K = 6: 371.0784572148916
Inertia for K = 7: 350.3642628743978
Inertia for K = 8: 330.0476150629754
Inertia for K = 9: 314.69189201646253
Inertia for K = 10: 298.74474399366363
Inertia for K = 11: 288.50442948032435
Inertia for K = 12: 277.3547969513997
/Users/kunalpatil/anaconda3/envs/m148/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to sup
press the warning
warnings.warn(
/Users/kunalpatil/anaconda3/envs/m148/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to sup
press the warning
warnings.warn(
/Users/kunalpatil/anaconda3/envs/m148/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to sup
press the warning
warnings.warn(
/Users/kunalpatil/anaconda3/envs/m148/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to sup
press the warning
warnings.warn(
/Users/kunalpatil/anaconda3/envs/m148/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to sup
press the warning
warnings.warn(
/Users/kunalpatil/anaconda3/envs/m148/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to sup
press the warning
warnings.warn(
/Users/kunalpatil/anaconda3/envs/m148/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to sup
press the warning
warnings.warn(
/Users/kunalpatil/anaconda3/envs/m148/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to sup
press the warning
warnings.warn(
/Users/kunalpatil/anaconda3/envs/m148/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to sup
press the warning
warnings.warn(
/Users/kunalpatil/anaconda3/envs/m148/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to sup
press the warning
warnings.warn(
/Users/kunalpatil/anaconda3/envs/m148/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to sup
press the warning
warnings.warn(
/Users/kunalpatil/anaconda3/envs/m148/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to sup
press the warning
warnings.warn(
/Users/kunalpatil/anaconda3/envs/m148/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to sup
press the warning
warnings.warn(
/Users/kunalpatil/anaconda3/envs/m148/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to sup
press the warning
warnings.warn(
/Users/kunalpatil/anaconda3/envs/m148/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to sup
press the warning
warnings.warn(
Inertia for K = 13: 266.20679282269055
Inertia for K = 14: 258.1239469563823
Inertia for K = 15: 250.34211491204923
Inertia for K = 16: 234.46445989492173
Inertia for K = 17: 231.17570259556672
Inertia for K = 18: 221.78373427947042
Inertia for K = 19: 220.04830474526796
Inertia for K = 20: 211.59508812745764
Notice that the inertia is much smaller for every cluster size when using PCA features. Why do you think this is happening? Hint: Think about what Inertia is calculating and consider the number of features that PCA outputs.
Since there are less features in the PCA data, there is less distance between the clusters and each data point since we are in a lower dimensional space. Thus, the inertia is less.
(100 pts) Putting it all together
Through all the homeworks and projects, you have learned how to apply many different models to perform a supervised learning task. We are now asking you to take everything that you learned to create a model that can predict whether a hotel reservation will be canceled or
not.
Context
Hotels see millions of people every year and always wants to keep rooms occupied and payed for. Cancellations make the business lose money since it may make it difficult to reserve to another customer on such short notice. As such, it is useful for a hotel to know whether a
reservation is likely to cancel or not. The following dataset will provide a variety of information about a booking that you will use to predict whether that booking will cancel or not.
Property Management System - PMS
Attribute Information
(C) is for Categorical
(N) is for Numeric
1) is_canceled (C) : Value indicating if the booking was canceled (1) or not (0).
2) hotel (C) : The datasets contains the booking information of two hotel. One of the hotels is a resort hotel and the other is a city hotel.
3) arrival_date_month (C): Month of arrival date with 12 categories: “January” to “December”
4) stays_in_weekend_nights (N): Number of weekend nights (Saturday or Sunday) the guest stayed or booked to stay at the hotel
5) stays_in_week_nights (N): Number of week nights (Monday to Friday) the guest stayed or booked to stay at the hotel BO and BL/Calculated by counting the number of week nights
6) adults (N): Number of adults
7) children (N): Number of children
8) babies (N): Number of babies
9) meal (C): Type of meal
10) country (C): Country of origin.
11) previous_cancellations (N): Number of previous bookings that were canceled by the customer prior to the current booking
12) previous_bookings_not_canceled (N) : Number of previous bookings not canceled by the customer prior to the current booking
13) reserved_room_type (C): Code of room type reserved. Code is presented instead of designation for anonymity reasons
14) booking_changes (N) : Number of changes/amendments made to the booking from the moment the booking was entered on the PMS until the moment of check-in or cancellation
15) deposit_type (C) : No Deposit – no deposit was made; Non Refund – a deposit was made in the value of the total stay cost; Refundable – a deposit was made with a value under the total cost of stay
16) days_in_waiting_list (N): Number of days the booking was in the waiting list before it was confirmed to the customer
17) customer_type (C): Group – when the booking is associated to a group; Transient – when the booking is not part of a group or contract, and is not associated to other transient booking; Transient-party – when the booking is transient, but is associated to at least other transient booking
18) adr (N): Average Daily Rate (Calculated by dividing the sum of all lodging transactions by the total number of staying nights)
19) required_car_parking_spaces (N): Number of car parking spaces required by the customer
20) total_of_special_requests (N): Number of special requests made by the customer (e.g. twin bed or high floor)
21) name (C): Name of the Guest (Not Real)
22) email (C): Email (Not Real)
23) phone-number (C): Phone number (not real)
This dataset is quite large with 86989 samples. This makes it difficult to just brute force running a lot of models. As such, you have to be thoughtful when designing your models.
The file name for the training data is "hotel_booking.csv".
Challenge
This project is about being able to predict whether a reservation is likely to cancel based on the input parameters available to us. We will ask you to perform some specific instructions to lead you in the right direction but you are given free reign on which models to use and the
preprocessing steps you make. We will ask you to write out a description of what models you choose and why you choose them
.
(50 pts) Preprocessing
Preprocessing:
For the dataset, the following are mandatory pre-processing steps for your data:
Use One-Hot Encoding on all categorical features (specify whether you keep the extra feature or not for features with multiple values)
Determine which fields need to be dropped
Handle missing values (Specify your strategy)
Rescale the real valued features using any strategy you choose (StandardScaler, MinMaxScaler, Normalizer, etc)
Augment at least one feature
Implement a train-test split with 20% of the data going to the test data. Make sure that the test and train data are balanced in terms of the desired class.
After writing your preprocessing code, write out a description of what you did for each step and provide a justification for your choices. All descriptions should be written in the markdown cells of the jupyter notebook. Make sure your writing is clear and professional.
We highly recommend reading through the scikit-learn documentation to make this part easier.
hotel
lead_time
arrival_date_month
stays_in_weekend_nights
stays_in_week_nights
adults
children
babies
meal
country
previous_cancellations
previous_bookings_not_canceled
reserved_room_type
booking_changes
deposit_type
days_in_waiting_list
0
Resort
Hotel
4
February
1
2
2
0.0
0
FB
ESP
0
0
A
0
No Deposit
0
1
City
Hotel
172
June
0
2
1
0.0
0
BB
PRT
0
0
A
0
No Deposit
0
2
City
Hotel
4
November
2
1
1
0.0
0
BB
PRT
0
0
A
0
No Deposit
0
3
City
Hotel
68
September
0
2
2
0.0
0
HB
PRT
0
0
A
0
No Deposit
0
4
City
Hotel
149
July
2
4
3
0.0
0
BB
DEU
0
0
D
0
No Deposit
0
I implemented several different strategies here. Firstly, I dropped any rows with null values so we don't have to deal with them and I also dropped irrelevant columns such as name, email, and phone number. Secondly, I applied a one-hot encoding to all categorical featurse and I
used a standard scalar for all the numerical features. Additionally, I augmented a new feature called total_guests which takes the sum of children and adults and babies.
(50 pts) Try out a few models
Now that you have pre-processed your data, you are ready to try out different models.
For this part of the project, we want you to experiment with all the different models demonstrated in the course to determine which one performs best on the dataset.
You must perform classification using at least 3 of the following models:
Logistic Regression
K-nearest neighbors
SVM
Decision Tree
Multi-Layer Perceptron
Due to the size of the dataset, be careful which models you use and look at their documentation to see how you should tackle this size issue for each model.
For full credit, you must perform some hyperparameter optimization on your models of choice. You may find the following scikit-learn library on hyperparameter optimization useful.
For each model chosen, write a description of which models were chosen, which parameters you optimized, and which parameters you choose for your best model. While the previous part of the project asked you to pre-process the data in a specific manner, you may alter pre-
processing step as you wish to adjust for your chosen classification models.
First, let's try Logisitc Regression!
/Users/kunalpatil/anaconda3/envs/m148/lib/python3.9/site-packages/sklearn/linear_model/_logistic.py:1173: FutureWarning: `penalty='none'`has been deprecated in 1.2 and will be removed in 1.4. To keep the past behaviour, set `penalty=None`.
warnings.warn(
/Users/kunalpatil/anaconda3/envs/m148/lib/python3.9/site-packages/sklearn/linear_model/_logistic.py:1173: FutureWarning: `penalty='none'`has been deprecated in 1.2 and will be removed in 1.4. To keep the past behaviour, set `penalty=None`.
warnings.warn(
/Users/kunalpatil/anaconda3/envs/m148/lib/python3.9/site-packages/sklearn/linear_model/_logistic.py:1173: FutureWarning: `penalty='none'`has been deprecated in 1.2 and will be removed in 1.4. To keep the past behaviour, set `penalty=None`.
warnings.warn(
/Users/kunalpatil/anaconda3/envs/m148/lib/python3.9/site-packages/sklearn/linear_model/_logistic.py:1173: FutureWarning: `penalty='none'`has been deprecated in 1.2 and will be removed in 1.4. To keep the past behaviour, set `penalty=None`.
warnings.warn(
/Users/kunalpatil/anaconda3/envs/m148/lib/python3.9/site-packages/sklearn/linear_model/_logistic.py:1173: FutureWarning: `penalty='none'`has been deprecated in 1.2 and will be removed in 1.4. To keep the past behaviour, set `penalty=None`.
warnings.warn(
/Users/kunalpatil/anaconda3/envs/m148/lib/python3.9/site-packages/sklearn/linear_model/_logistic.py:1173: FutureWarning: `penalty='none'`has been deprecated in 1.2 and will be removed in 1.4. To keep the past behaviour, set `penalty=None`.
warnings.warn(
mean_fit_time
std_fit_time
mean_score_time
std_score_time
param_C
param_penalty
param_solver
params
split0_test_score
split1_test_score
split2_test_score
mean_test_score
std_test_score
rank_test_score
3
0.314936
0.011435
0.001873
0.000020
1
l2
liblinear
{'C': 1, 'penalty': 'l2', 'solver': 'liblinear'}
0.810940
0.811889
0.803171
0.808667
0.003905
1
2
0.989980
0.067458
0.001791
0.000018
1
l2
lbfgs
{'C': 1, 'penalty': 'l2', 'solver': 'lbfgs'}
0.810892
0.811889
0.803123
0.808635
0.003919
2
6
1.099155
0.229325
0.002223
0.000400
1
none
lbfgs
{'C': 1, 'penalty': 'none', 'solver': 'lbfgs'}
0.810653
0.811985
0.803219
0.808619
0.003857
3
5
0.396446
0.046772
0.002154
0.000455
100
l2
liblinear
{'C': 100, 'penalty': 'l2', 'solver': 'libline...
0.810701
0.811937
0.803171
0.808603
0.003874
4
7
1.973559
0.788458
0.001831
0.000045
1
none
newton-cg
{'C': 1, 'penalty': 'none', 'solver': 'newton-...
0.810701
0.811889
0.803219
0.808603
0.003838
4
4
1.032954
0.146655
0.002191
0.000330
100
l2
lbfgs
{'C': 100, 'penalty': 'l2', 'solver': 'lbfgs'}
0.810653
0.811937
0.803171
0.808587
0.003865
6
1
0.097108
0.003048
0.001777
0.000017
0.01
l2
liblinear
{'C': 0.01, 'penalty': 'l2', 'solver': 'liblin...
0.810461
0.809686
0.801495
0.807214
0.004057
7
0
0.218072
0.016401
0.002211
0.000210
0.01
l2
lbfgs
{'C': 0.01, 'penalty': 'l2', 'solver': 'lbfgs'}
0.810365
0.809590
0.801447
0.807134
0.004034
8
For the first model, I chose logistic regression. This is a linear-decision boundary model that works by applying the sigmoid function to predict probabilites that a certain input belongs to class 1. Based on my gridsearch hyperparameter optimization for logsitic regression, the best
values to use for logistic regression are C=1, penality=l2, and solver=liblinear. This resulted in a mean test score of .809.
Second, let's try KNN!
mean_fit_time
std_fit_time
mean_score_time
std_score_time
param_metric
param_n_neighbors
params
split0_test_score
split1_test_score
split2_test_score
mean_test_score
std_test_score
rank_test_score
7
0.004696
0.000130
6.759436
0.012030
manhattan
7
{'metric': 'manhattan', 'n_neighbors': 7}
0.835992
0.830188
0.831050
0.832410
0.002557
1
6
0.004820
0.000088
6.788852
0.022589
manhattan
5
{'metric': 'manhattan', 'n_neighbors': 5}
0.832926
0.829038
0.833733
0.831899
0.002050
2
3
0.004869
0.000051
1.857856
0.011214
euclidean
7
{'metric': 'euclidean', 'n_neighbors': 7}
0.828903
0.829996
0.829469
0.829456
0.000447
3
2
0.004837
0.000042
1.845561
0.006627
euclidean
5
{'metric': 'euclidean', 'n_neighbors': 5}
0.830819
0.827984
0.828703
0.829169
0.001203
4
5
0.005099
0.000189
6.720227
0.035194
manhattan
3
{'metric': 'manhattan', 'n_neighbors': 3}
0.830148
0.826643
0.829373
0.828721
0.001503
5
1
0.004716
0.000026
1.834653
0.009913
euclidean
3
{'metric': 'euclidean', 'n_neighbors': 3}
0.827514
0.825925
0.828176
0.827205
0.000945
6
4
0.004715
0.000144
6.547981
0.066679
manhattan
1
{'metric': 'manhattan', 'n_neighbors': 1}
0.827466
0.821853
0.819985
0.823101
0.003179
7
0
0.004926
0.000311
1.806952
0.003616
euclidean
1
{'metric': 'euclidean', 'n_neighbors': 1}
0.824735
0.820128
0.819314
0.821393
0.002387
8
For my second model, I chose KNN. This model works by classifying data points based on the closest neighbors of the data point in euclidean space. I optimized the parameters of teh value of K (number of neighbors), as well as using the metric of either euclidean or manhattan
distance. The best parameters to use were manhattan distance and 7 neighbors.
Now, let's try using a Decision Tree Classiifer!
mean_fit_time
std_fit_time
mean_score_time
std_score_time
param_criterion
param_max_depth
params
split0_test_score
split1_test_score
split2_test_score
mean_test_score
std_test_score
rank_test_score
7
0.073220
0.000809
0.002477
0.000019
entropy
7
{'criterion': 'entropy', 'max_depth': 7}
0.813000
0.813614
0.806476
0.811030
0.003230
1
3
0.080363
0.000050
0.002467
0.000044
gini
7
{'criterion': 'gini', 'max_depth': 7}
0.809216
0.809877
0.805087
0.808060
0.002119
2
2
0.060638
0.000098
0.002317
0.000022
gini
5
{'criterion': 'gini', 'max_depth': 5}
0.787134
0.789567
0.785447
0.787383
0.001691
3
6
0.056366
0.000779
0.002288
0.000027
entropy
5
{'criterion': 'entropy', 'max_depth': 5}
0.789098
0.788417
0.784010
0.787175
0.002255
4
1
0.039469
0.000217
0.002041
0.000017
gini
3
{'criterion': 'gini', 'max_depth': 3}
0.784404
0.782717
0.777783
0.781635
0.002809
5
5
0.038518
0.000544
0.002064
0.000029
entropy
3
{'criterion': 'entropy', 'max_depth': 3}
0.761556
0.757521
0.754742
0.757940
0.002797
6
0
0.019083
0.002645
0.002249
0.000440
gini
1
{'criterion': 'gini', 'max_depth': 1}
0.761029
0.757042
0.754503
0.757524
0.002686
7
4
0.017222
0.000044
0.001930
0.000018
entropy
1
{'criterion': 'entropy', 'max_depth': 1}
0.761029
0.757042
0.754503
0.757524
0.002686
7
The decision tree classiifer is a tree that splits the data at each level based on a certain feature, until each leaf node becomes purely of one class or there are no more features. I used this classifier and optimized the parameteres of criterion and max_depth and found the
optimal set of parameters to be using entropy for the criterion and a max_depth of 7.
As it turns out, the best classifier that I tested was the KNN with a mean accuracy of .83599.
Extra Credit
We have provided an extra test dataset named "hotel_booking_test.csv" that does not have the target labels. Classify the samples in the dataset with your best model and write them into a csv file. Submit your csv file to our Kaggle contest. The website will specify your
classification accuracy on the test set. We will award a bonus point for the project for every percentage point over 75% that you get on your kaggle test accuracy.
To get the bonus points, you must also write out a summary of the model that you submit including any changes you made to the pre-processing steps. The summary must be written in a markdown cell of the jupyter notebook. Note that you should not change earlier parts of the
project to complete the extra credit.
Kaggle Submission Instruction
Submit a two column csv where the first column is named "ID" and is the row number. The second column is named "target" and is the classification for each sample. Make sure that the sample order is preserved.
(78287, 55) (8699, 55)
The model I used was a KNN with 7 neighbors and the manhattan distance metric, which was the optimal one from my GridSearch. Additionally, I had to process the test data to add one more row since my pre-processing resulted in more columns for the training data than the
test data, likely due to the one-hot-encodings.
In [ ]:
#Here are a set of libraries we imported to complete this assignment. #Feel free to use these or equivalent libraries for your implementation
import
numpy as
np # linear algebra
import
pandas as
pd # data processing, CSV file I/O (e.g. pd.read_csv)
import
matplotlib.pyplot as
plt # this is used for the plot the graph import
matplotlib
import
os
import
time
#Sklearn classes
from
sklearn.model_selection import
train_test_split
, cross_val_score
, GridSearchCV
, KFold
from
sklearn import
metrics
from
sklearn.metrics import
confusion_matrix
,
silhouette_score
import
sklearn.metrics.cluster as
smc
from
sklearn.cluster import
KMeans
from
sklearn.tree import
DecisionTreeClassifier
from
sklearn.pipeline import
Pipeline
, FeatureUnion
from
sklearn.preprocessing import
StandardScaler
, OneHotEncoder ,
LabelEncoder
, MinMaxScaler
from
sklearn.compose import
ColumnTransformer
, make_column_transformer
from
sklearn import
datasets
from
sklearn.decomposition import
PCA
from
sklearn.neural_network import
MLPClassifier
from
sklearn.datasets import
make_blobs
from
matplotlib import
pyplot
import
itertools
%
matplotlib
inline
#Sets random seed
import
random random
.
seed
(
42
) In [ ]:
#Helper functions
def
draw_confusion_matrix
(
y
, yhat
, classes
):
'''
Draws a confusion matrix for the given target and predictions
Adapted from scikit-learn and discussion example.
'''
plt
.
cla
()
plt
.
clf
()
matrix =
confusion_matrix
(
y
, yhat
)
plt
.
imshow
(
matrix
, interpolation
=
'nearest'
, cmap
=
plt
.
cm
.
YlOrBr
)
plt
.
title
(
"Confusion Matrix"
)
plt
.
colorbar
()
num_classes =
len
(
classes
)
plt
.
xticks
(
np
.
arange
(
num_classes
), classes
, rotation
=
0
)
plt
.
yticks
(
np
.
arange
(
num_classes
), classes
)
plt
.
tick_params
(
top
=
True
, bottom
=
False
, labeltop
=
True
, labelbottom
=
False
) fmt =
'd'
thresh =
matrix
.
max
() /
2.
for
i
, j in
itertools
.
product
(
range
(
matrix
.
shape
[
0
]), range
(
matrix
.
shape
[
1
])):
plt
.
text
(
j
, i
, format
(
matrix
[
i
, j
], fmt
),
horizontalalignment
=
"center"
,
color
=
"white" if
matrix
[
i
, j
] >
thresh else
"black"
)
plt
.
ylabel
(
'True label'
)
plt
.
xlabel
(
'Predicted label'
)
plt
.
gca
()
.
xaxis
.
set_label_position
(
'top'
)
plt
.
tight_layout
()
plt
.
show
()
def
heatmap
(
data
, row_labels
, col_labels
, figsize =
(
20
,
12
), cmap =
"YlGn"
,
cbar_kw
=
{}, cbarlabel
=
""
, valfmt
=
"{x:.2f}"
,
textcolors
=
(
"black"
, "white"
), threshold
=
None
):
"""
Create a heatmap from a numpy array and two lists of labels. Taken from matplotlib example.
Parameters
----------
data
A 2D numpy array of shape (M, N).
row_labels
A list or array of length M with the labels for the rows.
col_labels
A list or array of length N with the labels for the columns.
ax
A `matplotlib.axes.Axes` instance to which the heatmap is plotted. If
not provided, use current axes or create a new one. Optional.
cmap
A string that specifies the colormap to use. Look at matplotlib docs for information.
Optional.
cbar_kw
A dictionary with arguments to `matplotlib.Figure.colorbar`. Optional.
cbarlabel
The label for the colorbar. Optional.
valfmt
The format of the annotations inside the heatmap. This should either
use the string format method, e.g. "$ {x:.2f}", or be a
`matplotlib.ticker.Formatter`. Optional.
textcolors
A pair of colors. The first is used for values below a threshold,
the second for those above. Optional.
threshold
Value in data units according to which the colors from textcolors are
applied. If None (the default) uses the middle of the colormap as
"""
plt
.
figure
(
figsize =
figsize
)
ax =
plt
.
gca
()
# Plot the heatmap
im =
ax
.
imshow
(
data
,
cmap
=
cmap
)
# Create colorbar
cbar =
ax
.
figure
.
colorbar
(
im
, ax
=
ax
, **
cbar_kw
)
cbar
.
ax
.
set_ylabel
(
cbarlabel
, rotation
=-
90
, va
=
"bottom"
)
# Show all ticks and label them with the respective list entries.
ax
.
set_xticks
(
np
.
arange
(
data
.
shape
[
1
]), labels
=
col_labels
)
ax
.
set_yticks
(
np
.
arange
(
data
.
shape
[
0
]), labels
=
row_labels
)
# Let the horizontal axes labeling appear on top.
ax
.
tick_params
(
top
=
True
, bottom
=
False
,
labeltop
=
True
, labelbottom
=
False
)
# Rotate the tick labels and set their alignment.
plt
.
setp
(
ax
.
get_xticklabels
(), rotation
=-
30
, ha
=
"right"
,
rotation_mode
=
"anchor"
)
# Turn spines off and create white grid.
ax
.
spines
[:]
.
set_visible
(
False
)
ax
.
set_xticks
(
np
.
arange
(
data
.
shape
[
1
]
+
1
)
-
.5
, minor
=
True
)
ax
.
set_yticks
(
np
.
arange
(
data
.
shape
[
0
]
+
1
)
-
.5
, minor
=
True
)
ax
.
grid
(
which
=
"minor"
, color
=
"w"
, linestyle
=
'-'
, linewidth
=
3
)
ax
.
tick_params
(
which
=
"minor"
, bottom
=
False
, left
=
False
)
# Normalize the threshold to the images color range.
if
threshold is
not
None
:
threshold =
im
.
norm
(
threshold
)
else
:
threshold =
im
.
norm
(
data
.
max
())
/
2.
# Set default alignment to center, but allow it to be
# overwritten by textkw.
kw =
dict
(
horizontalalignment
=
"center"
,
verticalalignment
=
"center"
)
# Get the formatter in case a string is supplied
if
isinstance
(
valfmt
, str
):
valfmt =
matplotlib
.
ticker
.
StrMethodFormatter
(
valfmt
)
# Loop over the data and create a `Text` for each "pixel".
# Change the text's color depending on the data.
texts =
[]
for
i in
range
(
data
.
shape
[
0
]):
for
j in
range
(
data
.
shape
[
1
]):
kw
.
update
(
color
=
textcolors
[
int
(
im
.
norm
(
data
[
i
, j
]) >
threshold
)])
text =
im
.
axes
.
text
(
j
, i
, valfmt
(
data
[
i
, j
], None
), **
kw
)
texts
.
append
(
text
)
def
make_meshgrid
(
x
, y
, h
=
0.02
):
"""Create a mesh of points to plot in
Parameters
----------
x: data to base x-axis meshgrid on
y: data to base y-axis meshgrid on
h: stepsize for meshgrid, optional
Returns
-------
xx, yy : ndarray
"""
x_min
, x_max =
x
.
min
() -
1
, x
.
max
() +
1
y_min
, y_max =
y
.
min
() -
1
, y
.
max
() +
1
xx
, yy =
np
.
meshgrid
(
np
.
arange
(
x_min
, x_max
, h
), np
.
arange
(
y_min
, y_max
, h
))
return
xx
, yy
def
plot_contours
(
clf
, xx
, yy
, **
params
):
"""Plot the decision boundaries for a classifier.
Parameters
----------
ax: matplotlib axes object
clf: a classifier
xx: meshgrid ndarray
yy: meshgrid ndarray
params: dictionary of params to pass to contourf, optional
"""
Z =
clf
.
predict
(
np
.
c_
[
xx
.
ravel
(), yy
.
ravel
()])
Z =
Z
.
reshape
(
xx
.
shape
)
out =
plt
.
contourf
(
xx
, yy
, Z
, **
params
)
return
out
def
draw_contour
(
x
,
y
,
clf
, class_labels =
[
"Negative"
, "Positive"
]):
"""
Draws a contour line for the predictor
Assumption that x has only two features. This functions only plots the first two columns of x.
"""
X0
, X1 =
x
[:, 0
], x
[:, 1
]
xx0
, xx1 =
make_meshgrid
(
X0
,
X1
)
plt
.
figure
(
figsize =
(
10
,
6
))
plot_contours
(
clf
, xx0
, xx1
, cmap
=
"PiYG"
, alpha
=
0.8
)
scatter
=
plt
.
scatter
(
X0
, X1
, c
=
y
, cmap
=
"PiYG"
, s
=
30
, edgecolors
=
"k"
)
plt
.
legend
(
handles
=
scatter
.
legend_elements
()[
0
], labels
=
class_labels
)
plt
.
xlim
(
xx0
.
min
(), xx0
.
max
())
plt
.
ylim
(
xx1
.
min
(), xx1
.
max
())
In [ ]:
#Preprocess Data
#Load Data
data =
pd
.
read_csv
(
'datasets/breast_cancer_data.csv'
)
#Drop id column
data =
data
.
drop
([
"id"
],
axis
=
1
)
#Transform target feature into numerical
le =
LabelEncoder
() data
[
'diagnosis'
] =
le
.
fit_transform
(
data
[
'diagnosis'
])
#Split target and data
y =
data
[
"diagnosis"
]
x =
data
.
drop
([
"diagnosis"
],
axis =
1
)
#Train test split
train_raw
, test_raw
, target
, target_test =
train_test_split
(
x
,
y
, test_size
=
0.2
, stratify
=
y
, random_state
=
0
)
#Standardize data
#Since all features are real-valued, we only have one pipeline
pipeline =
Pipeline
([
(
'scaler'
, StandardScaler
())
])
#Transform raw data train =
pipeline
.
fit_transform
(
train_raw
)
test =
pipeline
.
transform
(
test_raw
) #Note that there is no fit calls
#Names of Features after Pipeline
feature_names =
list
(
pipeline
.
get_feature_names_out
(
list
(
x
.
columns
)))
In [ ]:
target
.
value_counts
()
Out[ ]:
In [ ]:
#Baseline accuracy of using the majority class ct =
target_test
.
value_counts
()
print
(
"Counts of each class in target_test: "
)
print
(
ct
)
print
(
"Baseline Accuraccy of using Majority Class: "
, np
.
max
(
ct
)
/
np
.
sum
(
ct
))
In [ ]:
from
sklearn import
tree
from
sklearn.tree import
DecisionTreeClassifier
clf =
DecisionTreeClassifier
(
criterion
=
"gini"
, random_state =
0
)
clf
.
fit
(
train
, target
)
predicted =
clf
.
predict
(
test
)
In [ ]:
print
(
"%-12s %f" %
(
'Accuracy:'
, metrics
.
accuracy_score
(
target_test
,
predicted
)))
print
(
"Confusion Matrix: \n"
, metrics
.
confusion_matrix
(
target_test
,
predicted
))
draw_confusion_matrix
(
target_test
, predicted
, [
'healthy'
, 'sick'
])
In [ ]:
plt
.
figure
(
figsize =
(
30
,
15
)) #Note that we have to pass the feature names into the plotting function to get the actual names
#We pass the column names through the pipeline in case any feature augmentation was made
#For example, a categorical feature will be split into multiple features with one hot encoding
#and this way assigns a name to each column based on the feature value and the original feature name
tree
.
plot_tree
(
clf
,
max_depth
=
1
, proportion
=
True
,
feature_names
=
feature_names
, filled
=
True
)
Out[ ]:
In [ ]:
from
sklearn.tree import
export_text
r =
export_text
(
clf
, feature_names
=
feature_names
)
print
(
r
)
In [ ]:
imp_pd =
pd
.
Series
(
data =
clf
.
feature_importances_ ,
index =
feature_names
)
imp_pd
=
imp_pd
.
sort_values
(
ascending
=
False
)
imp_pd
.
plot
.
bar
()
Out[ ]:
In [ ]:
#Extract first two feature and use the standardscaler train_2 =
StandardScaler
()
.
fit_transform
(
train_raw
[[
'concave points_mean'
,
'perimeter_mean'
]])
depth =
[
1
,
2
,
3
,
4
,
5
,
10
,
15
]
for
d in
depth
:
dt =
DecisionTreeClassifier
(
max_depth =
d
, min_samples_split
=
7
) dt
.
fit
(
train_2
, target
)
draw_contour
(
train_2
,
target
,
dt
,
class_labels =
[
'Benign'
, 'Malignant'
])
plt
.
title
(
f"Max Depth ={
d
}"
)
In [ ]:
from
sklearn.neural_network import
MLPClassifier
clf =
MLPClassifier
(
hidden_layer_sizes
=
(
100
,), max_iter =
400
)
clf
.
fit
(
train
, target
)
predicted =
clf
.
predict
(
test
)
In [ ]:
print
(
"%-12s %f" %
(
'Accuracy:'
, metrics
.
accuracy_score
(
target_test
,
predicted
)))
print
(
"Confusion Matrix: \n"
, metrics
.
confusion_matrix
(
target_test
,
predicted
))
draw_confusion_matrix
(
target_test
, predicted
, [
'Benign'
, 'Malignant'
])
In [ ]:
#Example of using the default Relu activation while altering the number of hidden layers
train_2 =
StandardScaler
()
.
fit_transform
(
train_raw
[[
'concave points_mean'
,
'perimeter_mean'
]])
layers =
[
50
,
100
,
150
,
200
]
for
l in
layers
:
mlp =
MLPClassifier
(
hidden_layer_sizes
=
(
l
,), max_iter =
400
)
mlp
.
fit
(
train_2
, target
)
draw_contour
(
train_2
,
target
,
mlp
,
class_labels =
[
'Benign'
, 'Malignant'
])
plt
.
title
(
f"Hidden Layer Size ={
l
}"
)
In [ ]:
#Example of using the default Relu activation #while altering the number of hidden layers with 2 groups of hidden layers
train_2 =
StandardScaler
()
.
fit_transform
(
train_raw
[[
'concave points_mean'
,
'perimeter_mean'
]])
layers =
[
50
,
100
,
150
,
200
]
for
l in
layers
:
mlp =
MLPClassifier
(
hidden_layer_sizes
=
(
l
,
l
), max_iter =
400
)
mlp
.
fit
(
train_2
, target
)
draw_contour
(
train_2
,
target
,
mlp
,
class_labels =
[
'Benign'
, 'Malignant'
])
plt
.
title
(
f"Hidden Layer Sizes ={
(
l
,
l
)
}"
)
In [ ]:
#Example of using 2 hidden layers of 100 units each with varying activations
train_2 =
StandardScaler
()
.
fit_transform
(
train_raw
[[
'concave points_mean'
,
'perimeter_mean'
]])
acts =
[
'identity'
, 'logistic'
, 'tanh'
, 'relu'
]
for
act in
acts
:
mlp =
MLPClassifier
(
hidden_layer_sizes
=
(
100
,
100
), activation =
act
, max_iter =
400
)
mlp
.
fit
(
train_2
, target
)
draw_contour
(
train_2
,
target
,
mlp
,
class_labels =
[
'Benign'
, 'Malignant'
])
plt
.
title
(
f"Activation = {
act
}"
)
In [ ]:
#Import faces from scikit library
faces =
datasets
.
fetch_olivetti_faces
()
print
(
"Flattened Face Data shape:"
, faces
.
data
.
shape
)
print
(
"Face Image Data Shape:"
, faces
.
images
.
shape
)
print
(
"Shape of target data:"
, faces
.
target
.
shape
)
In [ ]:
#Extract image shape for future use
im_shape =
faces
.
images
[
0
]
.
shape
In [ ]:
#Prints some example faces faceimages =
faces
.
images
[
np
.
random
.
choice
(
len
(
faces
.
images
),
size
=
16
, replace =
False
)] # take random 16 images
fig
, axes =
plt
.
subplots
(
4
,
4
,
sharex
=
True
,
sharey
=
True
,
figsize
=
(
8
,
10
))
for
i in
range
(
16
):
axes
[
i
%
4
][i//4].imshow(faceimages[i], cmap="gray")
plt
.
show
()
In [ ]:
#Perform PCA
from
sklearn.decomposition import
PCA
pca =
PCA
()
pca_pipe =
Pipeline
([(
"scaler"
,
StandardScaler
()), #Scikit learn PCA does not standardize so we need to add
(
"pca"
,
pca
)])
pca_pipe
.
fit
(
faces
.
data
)
Out[ ]:
In [ ]:
fig =
plt
.
figure
(
figsize
=
(
16
, 6
))
for
i in
range
(
30
):
ax =
fig
.
add_subplot
(
3
, 10
, i +
1
, xticks
=
[], yticks
=
[])
ax
.
imshow
(
pca
.
components_
[
i
]
.
reshape
(
im_shape
),
cmap
=
plt
.
cm
.
bone
)
ax
.
set_title
(
f"Var={
pca
.
explained_variance_ratio_
[
i
]
:.2%}"
)
In [ ]:
#Without PCA
clf =
DecisionTreeClassifier
(
random_state
=
0
)
clf
.
fit
(
train
, target
)
predicted =
clf
.
predict
(
test
)
print
(
"Accuracy without PCA"
)
print
(
"%-12s %f" %
(
'Accuracy:'
, metrics
.
accuracy_score
(
target_test
,
predicted
)))
print
(
"Confusion Matrix: \n"
, metrics
.
confusion_matrix
(
target_test
,
predicted
))
draw_confusion_matrix
(
target_test
, predicted
, [
'Benign'
, 'Malignant'
])
#With PCA
pca =
PCA
(
n_components =
0.9
) #Take components that explain at lest 90% variance
train_new =
pca
.
fit_transform
(
train
)
test_new =
pca
.
transform
(
test
)
clf_pca =
DecisionTreeClassifier
(
random_state
=
0
)
clf_pca
.
fit
(
train_new
, target
)
predicted =
clf_pca
.
predict
(
test_new
)
print
(
"Accuracy with PCA"
)
print
(
"%-12s %f" %
(
'Accuracy:'
, metrics
.
accuracy_score
(
target_test
,
predicted
)))
print
(
"Confusion Matrix: \n"
, metrics
.
confusion_matrix
(
target_test
,
predicted
))
draw_confusion_matrix
(
target_test
, predicted
, [
'Benign'
, 'Malignant'
])
In [ ]:
print
(
"Number of Features without PCA: "
, train
.
shape
[
1
])
print
(
"Number of Features with PCA: "
, train_new
.
shape
[
1
])
In [ ]:
feature_names_new =
list
(
range
(
train_new
.
shape
[
1
]))
imp_pd =
pd
.
Series
(
data =
clf_pca
.
feature_importances_ ,
index =
feature_names_new
)
imp_pd
=
imp_pd
.
sort_values
(
ascending
=
False
)
imp_pd
.
plot
.
bar
()
Out[ ]:
In [ ]:
#Artifical Dataset
X
, y =
make_blobs
(
n_samples
=
500
,
n_features
=
2
,
centers
=
5
,
cluster_std
=
1
,
center_box
=
(
-
10.0
, 10.0
),
shuffle
=
True
,
random_state
=
10
,
) # For reproducibility
plt
.
scatter
(
X
[:, 0
], X
[:, 1
], marker
=
"."
, s
=
30
, lw
=
0
, alpha
=
0.7
,
edgecolor
=
"k"
)
Out[ ]:
In [ ]:
ks =
list
(
range
(
2
,
10
))
inertia =
[]
for
k in
ks
:
kmeans =
KMeans
(
n_clusters =
k
, init =
'k-means++'
, random_state =
0
)
kmeans
.
fit
(
X
)
# inertia method returns wcss for that model
inertia
.
append
(
kmeans
.
inertia_
)
print
(
f"Inertia for K = {
k
}: {
kmeans
.
inertia_
}"
)
In [ ]:
plt
.
figure
(
figsize
=
(
10
,
5
))
plt
.
plot
(
ks
, inertia
,
marker
=
'o'
,
color
=
'red'
)
plt
.
title
(
'The Elbow Method'
)
plt
.
xlabel
(
'Number of clusters'
)
plt
.
ylabel
(
'Inertia'
)
plt
.
show
()
In [ ]:
ks =
list
(
range
(
2
,
30
))
inertia =
[]
for
k in
ks
:
kmeans =
KMeans
(
n_clusters =
k
, init =
'k-means++'
, random_state =
0
)
kmeans
.
fit
(
train
)
# inertia method returns wcss for that model
inertia
.
append
(
kmeans
.
inertia_
)
print
(
f"Inertia for K = {
k
}: {
kmeans
.
inertia_
}"
)
In [ ]:
plt
.
figure
(
figsize
=
(
10
,
5
))
plt
.
plot
(
ks
, inertia
,
marker
=
'o'
,
color
=
'red'
)
plt
.
title
(
'The Elbow Method'
)
plt
.
xlabel
(
'Number of clusters'
)
plt
.
ylabel
(
'Inertia'
)
plt
.
show
()
In [ ]:
from
sklearn.cluster import
KMeans
n_clusters =
10 #We know there are 10 subjects
km =
KMeans
(
n_clusters =
n_clusters
,
random_state
=
0
)
pipe
=
Pipeline
([(
"scaler"
,
StandardScaler
()), #First standardize
(
"pca"
,
PCA
()), #Transform using pca
(
"kmeans"
, km )]) #Then apply k means
In [ ]:
clusters =
pipe
.
fit_predict
(
faces
.
data
)
print
(
clusters
)
In [ ]:
for
labelID in
range
(
n_clusters
):
# find all indexes into the `data` array that belong to the
# current label ID, then randomly sample a maximum of 25 indexes
# from the set
idxs =
np
.
where
(
clusters ==
labelID
)[
0
]
idxs =
np
.
random
.
choice
(
idxs
, size
=
min
(
25
, len
(
idxs
)),
replace
=
False
)
# Extract the sampled indexes
id_face =
faces
.
images
[
idxs
]
#Plots sampled faces
fig =
plt
.
figure
(
figsize
=
(
10
,
5
))
for
i in
range
(
min
(
25
,
len
(
idxs
))):
ax =
fig
.
add_subplot
(
5
, 5
, i +
1
, xticks
=
[], yticks
=
[])
ax
.
imshow
(
id_face
[
i
],
cmap
=
plt
.
cm
.
bone
)
fig
.
suptitle
(
f"Id={
labelID
}"
)
In [ ]:
#Preprocess Data
#Load Data
data =
pd
.
read_csv
(
'datasets/heartdisease.csv'
)
#Transform target feature into numerical
le =
LabelEncoder
() data
[
'target'
] =
le
.
fit_transform
(
data
[
'sick'
])
data =
data
.
drop
([
"sick"
], axis =
1
)
#Split target and data
y =
data
[
"target"
]
x =
data
.
drop
([
"target"
],
axis =
1
)
#Train test split
#40% in test data as was in project 2
train_raw
, test_raw
, target
, target_test =
train_test_split
(
x
,
y
, test_size
=
0.4
, stratify
=
y
, random_state
=
0
)
#Feature Transformation
#This is the only change from project 2 since we replaced standard scaler to minmax
#This was done to ensure that the numerical features were still of the same scale
#as the one hot encoded features
num_pipeline =
Pipeline
([
(
'minmax'
, MinMaxScaler
()) ])
heart_num =
train_raw
.
drop
([
'sex'
, 'cp'
, 'fbs'
, 'restecg'
, 'exang'
, 'slope'
, 'ca'
,
'thal'
], axis
=
1
)
numerical_features =
list
(
heart_num
)
categorical_features =
[
'sex'
, 'cp'
, 'fbs'
, 'restecg'
, 'exang'
, 'slope'
, 'ca'
,
'thal'
]
full_pipeline =
ColumnTransformer
([
(
"num"
, num_pipeline
, numerical_features
),
(
"cat"
, OneHotEncoder
(
categories
=
'auto'
), categorical_features
),
])
#Transform raw data train =
full_pipeline
.
fit_transform
(
train_raw
)
test =
full_pipeline
.
transform
(
test_raw
) #Note that there is no fit calls
#Extracts features names for each transformed column
feature_names =
full_pipeline
.
get_feature_names_out
(
list
(
x
.
columns
))
In [ ]:
print
(
"Column names after transformation by pipeline: "
, feature_names
)
In [ ]:
#Baseline accuracy of using the majority class ct =
target_test
.
value_counts
()
print
(
"Counts of each class in target_test: "
)
print
(
ct
)
print
(
"Baseline Accuraccy of using Majority Class: "
, np
.
max
(
ct
)
/
np
.
sum
(
ct
))
In [ ]:
clf =
DecisionTreeClassifier
(
random_state =
0
)
clf
.
fit
(
train
, target
)
predicted =
clf
.
predict
(
test
)
print
(
"%-12s %f" %
(
'Accuracy:'
, metrics
.
accuracy_score
(
target_test
,
predicted
)))
print
(
"Confusion Matrix: \n"
, metrics
.
confusion_matrix
(
target_test
,
predicted
))
draw_confusion_matrix
(
target_test
, predicted
, [
'healthy'
, 'sick'
])
In [ ]:
plt
.
figure
(
figsize =
(
30
,
15
)) tree
.
plot_tree
(
clf
,
max_depth
=
1
, proportion
=
True
,
feature_names
=
feature_names
, filled
=
True
)
Out[ ]:
In [ ]:
imp_pd =
pd
.
Series
(
data =
clf
.
feature_importances_ ,
index =
feature_names
)
imp_pd
=
imp_pd
.
sort_values
(
ascending
=
False
)
imp_pd
.
plot
.
bar
()
Out[ ]:
In [ ]:
parameters
=
[
{
"max_depth"
:[
1
,
2
,
3
,
4
,
5
,
10
,
15
],
"min_samples_split"
:[
2
,
4
,
6
,
8
],
"criterion"
:[
"gini"
, "entropy"
]
}
]
dt =
DecisionTreeClassifier
(
random_state
=
0
)
kf =
KFold
(
n_splits
=
k
, random_state
=
None
)
grid =
GridSearchCV
(
dt
, parameters
, cv =
kf
, scoring
=
'accuracy'
) grid
.
fit
(
train
,
target
)
res =
pd
.
DataFrame
(
grid
.
cv_results_
)
.
sort_values
( by
=
[
"mean_test_score"
], ascending
=
False
)
res
[[
"param_max_depth"
, "param_min_samples_split"
, "param_criterion"
, "mean_test_score"
]]
.
head
(
1
)
Out[ ]:
In [ ]:
clf =
DecisionTreeClassifier
(
random_state =
0 , max_depth =
3
, min_samples_split =
8
, criterion =
"entropy"
)
clf
.
fit
(
train
, target
)
predicted =
clf
.
predict
(
test
)
print
(
"%-12s %f" %
(
'Accuracy:'
, metrics
.
accuracy_score
(
target_test
,
predicted
)))
print
(
"Confusion Matrix: \n"
, metrics
.
confusion_matrix
(
target_test
,
predicted
))
draw_confusion_matrix
(
target_test
, predicted
, [
'healthy'
, 'sick'
])
In [ ]:
clf =
MLPClassifier
(
hidden_layer_sizes
=
(
100
,
100
), max_iter =
800
, random_state
=
0
)
clf
.
fit
(
train
, target
)
predicted =
clf
.
predict
(
test
)
print
(
"%-12s %f" %
(
'Accuracy:'
, metrics
.
accuracy_score
(
target_test
,
predicted
)))
print
(
"Confusion Matrix: \n"
, metrics
.
confusion_matrix
(
target_test
,
predicted
))
draw_confusion_matrix
(
target_test
, predicted
, [
'healthy'
, 'sick'
])
In [ ]:
import
time
dt =
DecisionTreeClassifier
(
random_state =
0 )
mlp =
MLPClassifier
(
hidden_layer_sizes
=
(
100
,
100
), max_iter =
800
, random_state
=
0
)
t0 =
time
.
time
()
dt
.
fit
(
train
, target
)
t1 =
time
.
time
()
print
(
"Decision Tree Training Time : "
, t1
-
t0
)
t0 =
time
.
time
()
mlp
.
fit
(
train
, target
)
t1 =
time
.
time
()
print
(
"MLP Training Time : "
, t1
-
t0
)
t0 =
time
.
time
()
dt
.
predict
(
test
)
t1 =
time
.
time
()
print
(
"Decision Tree Prediction Time : "
, t1
-
t0
)
t0 =
time
.
time
()
mlp
.
predict
(
test
)
t1 =
time
.
time
()
print
(
"MLP Prediction Time : "
, t1
-
t0
)
In [ ]:
pca =
PCA
(
n_components
=
10
)
pca
.
fit
(
train
)
for
i in
range
(
10
):
print
(
pca
.
components_
[
i
])
In [ ]:
print
(
pca
.
explained_variance_ratio_
)
In [ ]:
train_pca =
pca
.
fit_transform
(
train
)
test_pca =
pca
.
transform
(
test
)
In [ ]:
clf_pca =
DecisionTreeClassifier
(
random_state
=
0
)
clf_pca
.
fit
(
train_pca
, target
)
predicted =
clf_pca
.
predict
(
test_pca
)
print
(
"Accuracy with PCA"
)
print
(
"%-12s %f" %
(
'Accuracy:'
, metrics
.
accuracy_score
(
target_test
,
predicted
)))
print
(
"Confusion Matrix: \n"
, metrics
.
confusion_matrix
(
target_test
,
predicted
))
draw_confusion_matrix
(
target_test
, predicted
, [
'Benign'
, 'Malignant'
])
In [ ]:
clf_pca =
MLPClassifier
(
hidden_layer_sizes
=
(
100
,
100
), max_iter =
800
, random_state
=
0
)
clf_pca
.
fit
(
train_pca
, target
)
predicted =
clf_pca
.
predict
(
test_pca
)
print
(
"Accuracy with PCA"
)
print
(
"%-12s %f" %
(
'Accuracy:'
, metrics
.
accuracy_score
(
target_test
,
predicted
)))
print
(
"Confusion Matrix: \n"
, metrics
.
confusion_matrix
(
target_test
,
predicted
))
draw_confusion_matrix
(
target_test
, predicted
, [
'Benign'
, 'Malignant'
])
In [ ]:
k =
KMeans
(
n_clusters
=
5
,
random_state
=
0
)
k
.
fit
(
train
)
print
(
k
.
inertia_
)
In [ ]:
ks =
list
(
range
(
2
,
21
))
inertia =
[]
for
k in
ks
:
kmeans =
KMeans
(
n_clusters =
k
, init =
'k-means++'
, random_state =
0
)
kmeans
.
fit
(
train
)
# inertia method returns wcss for that model
inertia
.
append
(
kmeans
.
inertia_
)
print
(
f"Inertia for K = {
k
}: {
kmeans
.
inertia_
}"
)
plt
.
figure
(
figsize
=
(
10
,
5
))
plt
.
plot
(
ks
, inertia
,
marker
=
'o'
,
color
=
'red'
)
plt
.
title
(
'The Elbow Method'
)
plt
.
xlabel
(
'Number of clusters'
)
plt
.
ylabel
(
'Inertia'
)
plt
.
show
()
In [ ]:
ks =
list
(
range
(
2
,
21
))
inertia =
[]
for
k in
ks
:
kmeans =
KMeans
(
n_clusters =
k
, init =
'k-means++'
, random_state =
0
)
kmeans
.
fit
(
train_pca
)
# inertia method returns wcss for that model
inertia
.
append
(
kmeans
.
inertia_
)
print
(
f"Inertia for K = {
k
}: {
kmeans
.
inertia_
}"
)
plt
.
figure
(
figsize
=
(
10
,
5
))
plt
.
plot
(
ks
, inertia
,
marker
=
'o'
,
color
=
'red'
)
plt
.
title
(
'The Elbow Method'
)
plt
.
xlabel
(
'Number of clusters'
)
plt
.
ylabel
(
'Inertia'
)
plt
.
show
()
In [ ]:
data =
pd
.
read_csv
(
'./datasets/hotel_booking.csv'
)
data =
data
.
drop
([
'email'
, 'phone-number'
, 'name' ], axis
=
1
)
data
=
data
.
dropna
()
features =
data
.
drop
(
'is_canceled'
, axis
=
1
)
targets =
data
[
'is_canceled'
]
features
.
head
()
Out[ ]:
In [ ]:
from
sklearn.impute import
SimpleImputer
from
sklearn.base import
BaseEstimator
, TransformerMixin
class
AugmentFeatures
(
BaseEstimator
, TransformerMixin
):
def
fit
(
self
,
X
,
y
=
None
):
return
self
def
transform
(
self
,
X
):
total_guests =
X
[
'children'
]
+
X
[
'adults'
]
+
X
[
'babies'
]
return
np
.
c_
[
X
, total_guests
]
num_pipeline =
Pipeline
(
[
(
'attribs_adder'
, AugmentFeatures
()),
(
'std_scaler'
, StandardScaler
()),
(
'imputer'
, SimpleImputer
())
]
)
features_cat =
[
'hotel'
, 'arrival_date_month'
, 'meal'
, 'country'
, 'reserved_room_type'
, 'deposit_type'
, 'customer_type'
]
features_num =
list
(
features
.
drop
(
features_cat
, axis
=
1
))
full_pipeline =
ColumnTransformer
([
(
"num"
, num_pipeline
, features_num
),
(
"cat"
, OneHotEncoder
(), features_cat
)
]
)
features_prep =
full_pipeline
.
fit_transform
(
features
)
In [ ]:
from
sklearn.model_selection import
train_test_split
train
, test
, target
, target_test =
train_test_split
(
features
, targets
, test_size
=
0.2
, random_state
=
0
)
train =
full_pipeline
.
fit_transform
(
train
)
test =
full_pipeline
.
fit_transform
(
test
)
In [ ]:
from
sklearn.linear_model import
LogisticRegression
from
sklearn.model_selection import
GridSearchCV
#Note that this a list of dict
#Each dict describes the combination of parameters to check
parameters =
[
{
"penalty"
: [
"l2"
],
"C"
: [
0.01
,
1
,
100
],
"solver"
: [
"lbfgs"
,
"liblinear"
]},
#These solvers support penalty = "l2"
{
"penalty"
: [
"none"
],
"C"
: [
1
], #Specified to prevent error message
"solver"
: [
"lbfgs"
,
"newton-cg"
]},
#These solvers support penalty = "none"
]
#Implementing cross validation
k =
3
kf =
KFold
(
n_splits
=
k
, random_state
=
None
)
log_reg =
LogisticRegression
(
penalty =
"none"
,
max_iter =
1000
, solver =
"lbfgs"
) #will change parameters during CV
grid =
GridSearchCV
(
log_reg , parameters
, cv =
kf
, scoring =
"accuracy"
)
grid
.
fit
(
train
,
target
)
res
=
pd
.
DataFrame
(
grid
.
cv_results_
)
.
sort_values
( by
=
[
"mean_test_score"
], ascending
=
False
)
res
Out[ ]:
In [ ]:
from
sklearn.neighbors import
KNeighborsClassifier
parametersKNN
=
[
{
"n_neighbors"
:[
1
,
3
,
5
,
7
],
"metric"
:[
"euclidean"
, "manhattan"
]
}
]
k =
3
kf =
KFold
(
n_splits
=
k
, random_state
=
None
)
KNN =
KNeighborsClassifier
()
gridKNN =
GridSearchCV
(
KNN
, parametersKNN
, cv =
kf
, scoring
=
'accuracy'
)
gridKNN
.
fit
(
train
,
target
)
resKNN =
pd
.
DataFrame
(
gridKNN
.
cv_results_
)
.
sort_values
( by
=
[
"mean_test_score"
], ascending
=
False
)
resKNN
Out[ ]:
In [ ]:
DT =
DecisionTreeClassifier
()
parameters
=
[
{
"max_depth"
:[
1
,
3
,
5
,
7
],
"criterion"
:[
"gini"
, "entropy"
]
}
]
k =
3
kf =
KFold
(
n_splits
=
k
, random_state
=
None
)
gridDT =
GridSearchCV
(
DT
, parameters
, cv =
kf
, scoring
=
'accuracy'
)
gridDT
.
fit
(
train
,
target
)
res =
pd
.
DataFrame
(
gridDT
.
cv_results_
)
.
sort_values
( by
=
[
"mean_test_score"
], ascending
=
False
)
res
Out[ ]:
In [ ]:
train =
full_pipeline
.
fit_transform
(
features
)
data =
pd
.
read_csv
(
'./datasets/hotel_booking_test.csv'
)
data =
data
.
drop
([
'email'
, 'phone-number'
, 'name' ], axis
=
1
)
data =
full_pipeline
.
fit_transform
(
data
)
zeros =
np
.
zeros
((
1
,
8699
))
data =
np
.
c_
[
data
,
zeros
.
T
]
knn =
KNeighborsClassifier
(
n_neighbors
=
7
, metric
=
'manhattan'
)
knn
.
fit
(
train
, targets
)
preds =
knn
.
predict
(
data
)
In [ ]:
p =
[]
cnt =
0 for
pred in
preds
:
p
.
append
([
str
(
cnt
), str
(
pred
)])
cnt
+=
1
In [ ]:
import
csv
with
open
(
'eggs.csv'
, 'w'
, newline
=
''
) as
csvfile
:
spamwriter =
csv
.
writer
(
csvfile
, delimiter
=
','
,
quoting
=
csv
.
QUOTE_MINIMAL
)
spamwriter
.
writerow
([
"ID"
, "target"
])
for
i in
p
:
spamwriter
.
writerow
(
i
)
▸
Pipeline
▸
StandardScaler
▸
PCA
Discover more documents: Sign up today!
Unlock a world of knowledge! Explore tailored content for a richer learning experience. Here's what you'll get:
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Recommended textbooks for you

Microsoft Visual C#
Computer Science
ISBN:9781337102100
Author:Joyce, Farrell.
Publisher:Cengage Learning,
COMPREHENSIVE MICROSOFT OFFICE 365 EXCE
Computer Science
ISBN:9780357392676
Author:FREUND, Steven
Publisher:CENGAGE L
Np Ms Office 365/Excel 2016 I Ntermed
Computer Science
ISBN:9781337508841
Author:Carey
Publisher:Cengage

Recommended textbooks for you
- Microsoft Visual C#Computer ScienceISBN:9781337102100Author:Joyce, Farrell.Publisher:Cengage Learning,COMPREHENSIVE MICROSOFT OFFICE 365 EXCEComputer ScienceISBN:9780357392676Author:FREUND, StevenPublisher:CENGAGE LNp Ms Office 365/Excel 2016 I NtermedComputer ScienceISBN:9781337508841Author:CareyPublisher:Cengage

Microsoft Visual C#
Computer Science
ISBN:9781337102100
Author:Joyce, Farrell.
Publisher:Cengage Learning,
COMPREHENSIVE MICROSOFT OFFICE 365 EXCE
Computer Science
ISBN:9780357392676
Author:FREUND, Steven
Publisher:CENGAGE L
Np Ms Office 365/Excel 2016 I Ntermed
Computer Science
ISBN:9781337508841
Author:Carey
Publisher:Cengage
