IE6400_Day19
html
keyboard_arrow_up
School
Northeastern University *
*We aren’t endorsed by this school
Course
6400
Subject
Industrial Engineering
Date
Feb 20, 2024
Type
html
Pages
33
Uploaded by ColonelStraw13148
IE6400 Foundations of Data Analytics Engineering
¶
Fall 2023
¶
Module 3: Clustering Methods Part -2
¶
Clustering Methods
¶
Clustering is an unsupervised learning technique used to group similar data points into
clusters. The goal is to partition a dataset such that points in the same group are more
similar to each other than to those in other groups. There are several clustering methods, each suitable for different types of data and applications.
1. K-Means Clustering
¶
•
Description
: Partitions the data into (K) distinct non-overlapping clusters based on distances to the center of each cluster. •
Algorithm
:
1. Initialize (K) cluster centroids randomly. 2. Assign each data point to the nearest centroid. 3. Recompute centroids based on the current cluster assignments. 4. Repeat steps 2-3 until convergence. 2. Hierarchical Clustering
¶
•
Description
: Creates a tree of clusters. Can be agglomerative (bottom-up) or divisive (top-down). •
Algorithm
:
1. Treat each data point as a single cluster. (Total (n) clusters) 2. Find the two clusters that are closest and merge them. 3. Repeat step 2 until only one cluster remains. 3. DBSCAN (Density-Based Spatial Clustering of Applications with Noise)
¶
•
Description
: Clusters are dense regions in the data space, separated by areas of lower point density. Can find arbitrarily shaped clusters. •
Algorithm
:
1. For each point, count the number of points within a specified radius (( \
epsilon )). 2. If a point has more than a specified number (MinPts) of neighbors, it's a core point. 3. Cluster core points closer than ( \epsilon ) and reach border points. 4. Points not in any cluster are treated as noise. 4. Mean Shift Clustering
¶
•
Description
: Based on centroid shifting to modes of the data density. •
Algorithm
:
1. Initialize centroids randomly. 2. Update centroids to the mean of data points within a given window. 3. Repeat step 2 until convergence. 5. Gaussian Mixture Models (GMM)
¶
•
Description
: Assumes data is generated from a mixture of several Gaussian distributions. •
Algorithm
: Uses the Expectation-Maximization (EM) algorithm to estimate model parameters. 6. Spectral Clustering
¶
•
Description
: Uses the eigenvalues of the similarity matrix to reduce dimensionality before clustering in a lower-dimensional space. •
Algorithm
:
1. Build a similarity graph. 2. Compute the Laplacian of this graph. 3. Extract eigenvectors and use them as features. 4. Use K-means or another method on these new features. 7. Affinity Propagation
¶
•
Description
: Works by passing messages between data points to determine clusters. •
Algorithm
:
1. Send responsibility and availability messages between points. 2. Iteratively update messages and decide cluster exemplars. 8. OPTICS (Ordering Points To Identify Clustering Structure)
¶
•
Description
: Similar to DBSCAN but can handle clusters of varying densities. •
Algorithm
: Builds a reachability plot and extracts clusters from it. 9. BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies)
¶
•
Description
: Designed for very large datasets. Builds a tree structure during clustering. •
Algorithm
:
1. Incrementally construct a CF (Clustering Feature) tree. 2. Use the CF tree for clustering. 10. Agglomerative Clustering
¶
•
Description
: Similar to hierarchical clustering but typically stops when it reaches a desired number of clusters. •
Algorithm
:
1. Start with each data point as a single cluster. 2. Recursively merge the closest pair of clusters. 3. Stop when the desired number of clusters is reached. 11. CURE (Clustering Using Representatives)
¶
•
Description
: A robust method for clustering with outliers by representing clusters using multiple scattered points. •
Algorithm
:
1. Select a set of representative points for each cluster. 2. Shrink the representatives towards the center of the cluster. 3. Assign points to the closest representative's cluster. 12. K-Medoids or PAM (Partitioning Around Medoids)
¶
•
Description
: Similar to K-means but uses actual data points as cluster centers (medoids) to reduce the influence of outliers. •
Algorithm
:
1. Randomly select (K) data points as initial medoids. 2. Assign each data point to the nearest medoid. 3. Recompute medoids. 4. Repeat steps 2-3 until convergence. 13. COBWEB
¶
•
Description
: A hierarchical clustering method for categorical data. •
Algorithm
:
1. Construct a tree structure. 2. Place and categorize instances based on maximizing the category utility. 14. Fuzzy C-Means
¶
•
Description
: A soft clustering method where each data point has a degree of belonging to clusters, rather than belonging strictly to one cluster. •
Algorithm
:
1. Assign each data point a weight for each cluster. 2. Compute cluster centers based on weights.
3. Update weights for each data point based on distances to cluster centers. 4. Repeat steps 2-3 until convergence. 15. ROCK (RObust Clustering using linKs)
¶
•
Description
: Designed for categorical data and uses links (similar pairs) for clustering. •
Algorithm
:
1. Count the number of common neighbors between points to form links. 2. Form clusters based on these links. 16. CLIQUE (CLustering In QUEst)
¶
•
Description
: A grid-based clustering algorithm designed to find dense clusters in subspaces of any dimensionality. •
Algorithm
:
1. Partition each dimension into non-overlapping intervals. 2. Identify dense units in each dimension. 3. Form clusters based on the connectivity of dense units. 17. SNN (Shared Nearest Neighbor)
¶
•
Description
: Considers two objects similar if they have many neighbors in common. •
Algorithm
:
1. Compute the k-nearest neighbors for each point. 2. Assign similarity based on shared neighbors. 3. Cluster based on these similarities. 18. STREAM (Spatio-TEmporal Real-time Algorithm for clustering Moving objects)
¶
•
Description
: Designed for real-time clustering of moving objects. •
Algorithm
:
1. Uses micro-clusters to summarize the current state. 2. Periodically merges these micro-clusters to form final clusters. 19. SUBCLU (SUBspace CLUstering)
¶
•
Description
: Discovers clusters in arbitrary subspaces of a dataset. •
Algorithm
:
1. Uses a bottom-up approach. 2. Discovers dense regions in each dimension and merges them to identify clusters. Conclusion
¶
The choice of clustering method often depends on the dataset's size, dimensionality, and nature, the desired shape and structure of the clusters, and any underlying assumptions about the data. Pre-processing, like normalization or standardization, and
the right distance or similarity measure, can also play crucial roles in clustering outcomes.
The variety of clustering methods available offers flexibility in handling different types of datasets and various challenges like noise, outliers, and non-globular cluster shapes. The choice of a method should be guided by the nature of the data and the specific requirements of the application.
Exercise 1 Understanding K-Means Clustering
¶
Introduction
¶
K-Means is one of the most popular clustering algorithms. It aims to partition a set of observations into a number of clusters (k), resulting in the partitioning of the dataset into Voronoi cells. It works iteratively to assign each data point to one of K groups based on the features provided.
Problem Statement
¶
Given a dataset of 2D points, use the K-Means clustering algorithm to group them into clusters. The goal is to understand how K-Means works and how to visualize and
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
interpret the clustering results.
In [1]:
# Generating a sample dataset
from sklearn.datasets import make_blobs
import warnings # Settings the warnings to be ignored warnings.filterwarnings('ignore') X, _ = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=0)
# Visualizing the generated dataset
import matplotlib.pyplot as plt
plt.scatter(X[:, 0], X[:, 1], s=50, cmap='viridis')
plt.title("Generated Data Points")
plt.show()
We have generated a dataset with four clusters. Next, we will apply the KMeans clustering algorithm to this dataset.
The steps are as follows:
1. Define the number of clusters k
. 2. Apply KMeans clustering to the dataset. 3. Visualize the clusters and centroids. 4. Interpret the result. In [2]:
from sklearn.cluster import KMeans
import warnings warnings.filterwarnings('ignore') # Defining the number of clusters
k = 4
# Applying KMeans clustering
kmeans = KMeans(n_clusters=k, init='k-means++', random_state=42)
y_kmeans = kmeans.fit_predict(X)
# Visualizing the clusters and centroids
plt.scatter(X[y_kmeans == 0, 0], X[y_kmeans == 0, 1], s=50, c='red', label='Cluster 1')
plt.scatter(X[y_kmeans == 1, 0], X[y_kmeans == 1, 1], s=50, c='blue', label='Cluster 2')
plt.scatter(X[y_kmeans == 2, 0], X[y_kmeans == 2, 1], s=50, c='green', label='Cluster 3')
plt.scatter(X[y_kmeans == 3, 0], X[y_kmeans == 3, 1], s=50, c='yellow', label='Cluster 4')
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s=200, c='black', marker='X', label='Centroids')
plt.legend()
plt.title("Clusters of data points")
plt.show()
Interpretation
¶
The visualization shows the data points grouped into four distinct clusters, as indicated
by the different colors. The black 'X' markers represent the centroids of these clusters. KMeans has successfully identified the underlying patterns in the dataset and grouped the data points accordingly.
Conclusion
¶
K-Means clustering is a powerful algorithm for partitioning data into distinct groups or clusters. By visualizing and interpreting the results, we can gain insights into the underlying structure of the dataset and make informed decisions based on the clustering results.
Exercise 2 Understanding Hierarchical Clustering
¶
Introduction
¶
Hierarchical clustering is a type of clustering algorithm that builds a hierarchy of clusters by either a bottom-up or top-down approach. The bottom-up approach (Agglomerative) starts with individual data points and merges them into larger
clusters, while the top-down approach (Divisive) starts with the entire dataset and divides it into smaller clusters.
Problem Statement
¶
Given a dataset of 2D points, use the Hierarchical clustering algorithm to group them into clusters. The goal is to understand how Hierarchical clustering works, how to visualize the dendrogram, and how to interpret the clustering results.
In [3]:
# Generating a sample dataset
from sklearn.datasets import make_blobs
X, _ = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=0)
# Visualizing the generated dataset
import matplotlib.pyplot as plt
plt.scatter(X[:, 0], X[:, 1], s=50, cmap='viridis')
plt.title("Generated Data Points")
plt.show()
We have generated a dataset with four clusters. Next, we will apply the Agglomerative
Hierarchical clustering algorithm to this dataset.
The steps are as follows:
1. Compute the distance matrix. 2. Apply Agglomerative Hierarchical clustering. 3. Visualize the dendrogram. 4. Interpret the result. In [4]:
import scipy.cluster.hierarchy as sch
# Computing the distance matrix
dendrogram = sch.dendrogram(sch.linkage(X, method='ward'))
plt.title('Dendrogram')
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
plt.xlabel('Data Points')
plt.ylabel('Euclidean Distances')
plt.show()
Interpretation
¶
The dendrogram visualizes the way clusters are merged. The y-axis represents the Euclidean distances between data points (or clusters). By drawing a horizontal line across the dendrogram, we can decide the number of clusters for our dataset.
For instance, if we draw a horizontal line at a distance of 25, it intersects the dendrogram at four vertical lines, suggesting an optimal cluster number of four.
Conclusion
¶
Hierarchical clustering provides a tree-based representation of data points, which can be useful for understanding the hierarchical structure of the dataset. The dendrogram is a powerful tool for visualizing and deciding the number of clusters.
Exercise 3 Understanding DBSCAN Clustering
¶
Introduction
¶
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a density-
based clustering algorithm that can find clusters of different shapes and sizes from a large amount of data. Unlike K-Means or Hierarchical Clustering, DBSCAN does not require the number of clusters to be specified. It can also identify points that are not part of any cluster, which are typically considered as noise.
Problem Statement
¶
Given a dataset of 2D points, use the DBSCAN clustering algorithm to group them into clusters. The goal is to understand how DBSCAN works, its parameters, and how to interpret the clustering results.
In [5]:
# Generating a sample dataset with clusters of different shapes
from sklearn.datasets import make_moons
X, _ = make_moons(n_samples=300, noise=0.05, random_state=0)
# Visualizing the generated dataset
import matplotlib.pyplot as plt
plt.scatter(X[:, 0], X[:, 1], s=50, cmap='viridis')
plt.title("Generated Data Points")
plt.show()
We have generated a dataset with two crescent-shaped clusters. Traditional clustering methods like K-Means would struggle to identify these non-spherical clusters. Let's see
how DBSCAN performs on this dataset.
The steps are as follows:
1. Apply the DBSCAN algorithm. 2. Visualize the clusters. 3. Interpret the result. In [6]:
from sklearn.cluster import DBSCAN
# Applying DBSCAN
dbscan = DBSCAN(eps=0.3, min_samples=5)
clusters = dbscan.fit_predict(X)
# Visualizing the clusters
plt.scatter(X[:, 0], X[:, 1], c=clusters, cmap='viridis', s=50)
plt.title('Clusters identified by DBSCAN')
plt.show()
Interpretation
¶
The DBSCAN algorithm has successfully identified the two crescent-shaped clusters. Points that are not part of any cluster are labeled as -1, which represents noise.
The two main parameters for DBSCAN are:
•
eps
: The maximum distance between two samples for one to be considered as in
the neighborhood of the other. •
min_samples
: The number of samples in a neighborhood for a point to be considered as a core point. By adjusting these parameters, we can control the density required to form a cluster and the minimum size of clusters.
Conclusion
¶
DBSCAN is a powerful clustering algorithm that can identify clusters of various shapes without the need to specify the number of clusters. It's particularly useful for datasets with non-spherical clusters and noise.
Exercise 4 Understanding Mean Shift Clustering
¶
Introduction
¶
Mean Shift Clustering is a non-parametric, density-based clustering algorithm. Unlike K-Means, which tries to ensure that the distance between the cluster members and the
centroid is minimal, the Mean Shift algorithm assigns the datapoints to the clusters iteratively by shifting points towards the mode (highest density of datapoints). The mode can be understood as the highest density of datapoints in the region.
Problem Statement
¶
Given a dataset of 2D points, use the Mean Shift clustering algorithm to group them into clusters. The goal is to understand how Mean Shift works, its parameters, and how
to interpret the clustering results.
In [7]:
# Generating a sample dataset with blobs
from sklearn.datasets import make_blobs
X, _ = make_blobs(n_samples=300, centers=5, cluster_std=0.60, random_state=0)
# Visualizing the generated dataset
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
import matplotlib.pyplot as plt
plt.scatter(X[:, 0], X[:, 1], s=50, cmap='viridis')
plt.title("Generated Data Points")
plt.show()
We have generated a dataset with five distinct clusters. Let's see how Mean Shift performs on this dataset.
The steps are as follows:
1. Apply the Mean Shift algorithm. 2. Visualize the clusters. 3. Interpret the result. In [8]:
from sklearn.cluster import MeanShift
# Applying Mean Shift
mean_shift = MeanShift(bandwidth=1.5)
clusters = mean_shift.fit_predict(X)
# Visualizing the clusters
plt.scatter(X[:, 0], X[:, 1], c=clusters, cmap='viridis', s=50)
plt.title('Clusters identified by Mean Shift')
plt.show()
Interpretation
¶
The Mean Shift algorithm has identified the clusters in the dataset. The number of clusters is determined by the algorithm based on the data distribution, and we don't need to specify it beforehand.
The main parameter for Mean Shift is:
•
bandwidth
: It can be understood as the radius of the region (or window) in which
the algorithm searches for datapoints to form a cluster. A smaller bandwidth will result in more clusters, while a larger bandwidth may merge distinct clusters. Conclusion
¶
Mean Shift is a versatile clustering algorithm that can identify clusters without the need to specify the number of clusters beforehand. It's particularly useful for datasets where the number of clusters is not known a priori.
Exercise 5 Understanding Gaussian Mixture Models (GMM)
¶
Introduction
¶
Gaussian Mixture Models (GMM) is a probabilistic model that assumes all the data points are generated from a mixture of several Gaussian distributions with unknown parameters. It attempts to find a mixture of multi-dimensional Gaussian probability distributions that best model any input dataset.
Problem Statement
¶
Given a dataset of 2D points, use the GMM algorithm to group them into clusters. The goal is to understand how GMM works, its parameters, and how to interpret the clustering results.
In [9]:
# Generating a sample dataset with blobs
from sklearn.datasets import make_blobs
X, _ = make_blobs(n_samples=300, centers=5, cluster_std=0.60, random_state=42)
# Visualizing the generated dataset
import matplotlib.pyplot as plt
plt.scatter(X[:, 0], X[:, 1], s=50, cmap='viridis')
plt.title("Generated Data Points")
plt.show()
We have generated a dataset with five distinct clusters. Let's see how GMM performs on this dataset.
The steps are as follows:
1. Apply the GMM algorithm. 2. Visualize the clusters. 3. Interpret the result. In [10]:
from sklearn.mixture import GaussianMixture
# Applying GMM
gmm = GaussianMixture(n_components=5)
gmm.fit(X)
labels = gmm.predict(X)
# Visualizing the clusters
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis', s=50)
plt.title('Clusters identified by GMM')
plt.show()
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Interpretation
¶
The GMM algorithm has identified the clusters in the dataset. Unlike K-means, which assigns each point to a single cluster, GMM determines the probability that a given data point belongs to each of the clusters. This makes GMM a soft clustering method.
The main parameters for GMM are:
•
n_components
: The number of mixture components, i.e., the number of Gaussian
distributions. Conclusion
¶
Gaussian Mixture Models provide a probabilistic approach to clustering, allowing for more flexibility in cluster shape compared to methods like K-means. It's particularly useful for datasets where clusters are not necessarily spherical or have different sizes.
Exercise 6 Understanding Spectral Clustering
¶
Introduction
¶
Spectral Clustering is a modern clustering technique that can capture complex cluster structures. It can identify clusters that are non-convex and even clusters that have a nested ring structure, which traditional algorithms like K-means might not capture effectively.
The technique uses the eigenvalues of the similarity matrix of the data to reduce dimensionality before applying another clustering algorithm (like K-means).
Problem Statement
¶
Given a dataset of 2D points, use the Spectral Clustering algorithm to group them into clusters. The goal is to understand how Spectral Clustering works, its parameters, and how to interpret the clustering results.
In [11]:
# Generating a sample dataset with moons
from sklearn.datasets import make_moons
X, _ = make_moons(200, noise=.05, random_state=42)
# Visualizing the generated dataset
import matplotlib.pyplot as plt
plt.scatter(X[:, 0], X[:, 1], s=50, cmap='viridis')
plt.title("Generated Data Points")
plt.show()
We have generated a dataset with two crescent moon-shaped clusters. Traditional clustering algorithms might struggle with such data. Let's see how Spectral Clustering performs on this dataset.
The steps are as follows:
1. Apply the Spectral Clustering algorithm. 2. Visualize the clusters. 3. Interpret the result. In [12]:
from sklearn.cluster import SpectralClustering
# Applying Spectral Clustering
sc = SpectralClustering(n_clusters=2, affinity='nearest_neighbors', assign_labels='kmeans')
labels = sc.fit_predict(X)
# Visualizing the clusters
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis', s=50)
plt.title('Clusters identified by Spectral Clustering')
plt.show()
Interpretation
¶
Spectral Clustering has effectively identified the two crescent moon-shaped clusters. The algorithm works by creating a similarity matrix of the data points and then finding the eigenvectors corresponding to the top eigenvalues. These eigenvectors are then used to transform the data into a lower-dimensional space, where a traditional clustering algorithm (like K-means) can be applied.
The main parameters for Spectral Clustering are:
•
n_clusters
: The number of clusters to form. •
affinity
: How to construct the affinity matrix. 'nearest_neighbors' computes the similarity matrix using the nearest neighbors of each point. •
assign_labels
: The strategy to use to assign labels in the embedding space. Here, we used 'kmeans'. Conclusion
¶
Spectral Clustering is a powerful technique that can capture complex cluster structures, making it suitable for datasets where traditional methods might fail.
Exercise 7 Understanding Affinity Propagation
¶
Introduction
¶
Affinity Propagation is a clustering algorithm that identifies "exemplars" among the data points and forms clusters based on these exemplars. Unlike other clustering methods that require the number of clusters to be specified in advance, Affinity Propagation determines the number of clusters based on the data, making it particularly useful for datasets where the number of clusters is not known a priori.
Problem Statement
¶
Given a dataset of 2D points, use the Affinity Propagation algorithm to group them into
clusters. The goal is to understand how Affinity Propagation works, its parameters, and
how to interpret the clustering results.
In [13]:
# Generating a sample dataset with blobs
from sklearn.datasets import make_blobs
X, _ = make_blobs(300, centers=5, cluster_std=0.60, random_state=42)
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
# Visualizing the generated dataset
import matplotlib.pyplot as plt
plt.scatter(X[:, 0], X[:, 1], s=50, cmap='viridis')
plt.title("Generated Data Points")
plt.show()
We have generated a dataset with five distinct clusters. Let's apply the Affinity Propagation algorithm to this dataset and see how it performs.
The steps are as follows:
1. Apply the Affinity Propagation algorithm. 2. Visualize the clusters and exemplars. 3. Interpret the result. In [14]:
from sklearn.cluster import AffinityPropagation
# Applying Affinity Propagation
af = AffinityPropagation(preference=-50).fit(X)
cluster_centers_indices = af.cluster_centers_indices_
labels = af.labels_
# Visualizing the clusters
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis', s=50)
plt.scatter(X[cluster_centers_indices, 0], X[cluster_centers_indices, 1], s=200, c='red', marker='X', label='Exemplars')
plt.title('Clusters identified by Affinity Propagation')
plt.legend()
plt.show()
Interpretation
¶
Affinity Propagation has identified clusters and their corresponding exemplars (marked
with red 'X'). The algorithm works by sending messages between pairs of data points until convergence. A set of data points are chosen as exemplars and the remaining data points are assigned to the nearest exemplar.
The main parameter for Affinity Propagation is:
•
preference
: Controls how many exemplars are used. A lower value will result in more clusters (more exemplars), and a higher value will result in fewer clusters. Conclusion
¶
Affinity Propagation is an adaptive clustering method that determines the number of clusters based on the data. It can be particularly useful when the number of clusters is not known in advance.
Exercise 8 Understanding OPTICS Clustering
¶
Introduction
¶
The OPTICS (Ordering Points To Identify Clustering Structure) algorithm is a density-
based clustering method similar to DBSCAN. However, unlike DBSCAN, which partitions
the dataset into clusters, OPTICS creates an augmented ordering of the dataset representing its density-based clustering structure. This ordering can then be visualized and analyzed to extract the clusters.
Problem Statement
¶
Given a dataset of 2D points, use the OPTICS algorithm to identify its clustering structure. The goal is to understand how OPTICS works, its parameters, and how to interpret the clustering results.
In [15]:
# Generating a sample dataset with blobs and noise
from sklearn.datasets import make_moons
X, _ = make_moons(n_samples=300, noise=0.05, random_state=42)
# Visualizing the generated dataset
import matplotlib.pyplot as plt
plt.scatter(X[:, 0], X[:, 1], s=50, cmap='viridis')
plt.title("Generated Data Points")
plt.show()
We have generated a dataset with two crescent-shaped clusters. Let's apply the OPTICS algorithm to this dataset and see how it performs.
The steps are as follows:
1. Apply the OPTICS algorithm. 2. Visualize the reachability plot. 3. Visualize the clusters. 4. Interpret the result. In [16]:
from sklearn.cluster import OPTICS
# Applying OPTICS
optics = OPTICS(min_samples=10, xi=0.05, min_cluster_size=0.05)
labels = optics.fit_predict(X)
# Visualizing the reachability plot
plt.figure(figsize=(10, 4))
plt.title('Reachability plot')
plt.plot(optics.reachability_[optics.ordering_])
plt.xlabel('Points')
plt.ylabel('Reachability Distance')
plt.show()
# Visualizing the clusters
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis', s=50)
plt.title('Clusters identified by OPTICS')
plt.show()
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Interpretation
¶
The reachability plot shows the reachability distances for each point in the dataset. Peaks in the plot represent cluster boundaries.
In the scatter plot, OPTICS has identified clusters based on the density of the data points. The main parameters for OPTICS are:
•
min_samples
: The number of samples in a neighborhood for a point to be considered as a core point. •
xi
: Determines the minimum steepness on the reachability plot that constitutes a cluster boundary. •
min_cluster_size
: Minimum number of samples in a cluster. Conclusion
¶
OPTICS is a versatile clustering method that can identify clusters of varying shapes
and densities. It provides a reachability plot that offers insights into the clustering structure of the dataset.
Exercise 9 Understanding BIRCH Clustering
¶
Introduction
¶
The BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies) algorithm is a hierarchical clustering method designed specifically for very large datasets. It incrementally and dynamically clusters incoming multi-dimensional metric data points to try to produce the best quality clustering with the available resources.
Problem Statement
¶
Given a dataset of 2D points, use the BIRCH algorithm to identify its clustering structure. The goal is to understand how BIRCH works, its parameters, and how to interpret the clustering results.
In [17]:
# Generating a sample dataset with blobs
from sklearn.datasets import make_blobs
X, _ = make_blobs(n_samples=500, centers=5, cluster_std=0.6, random_state=42)
# Visualizing the generated dataset
import matplotlib.pyplot as plt
plt.scatter(X[:, 0], X[:, 1], s=50, cmap='viridis')
plt.title("Generated Data Points")
plt.show()
We have generated a dataset with five distinct clusters. Let's apply the BIRCH algorithm to this dataset and see how it performs.
The steps are as follows:
1. Apply the BIRCH algorithm. 2. Visualize the clusters. 3. Interpret the result. In [18]:
from sklearn.cluster import Birch
# Applying BIRCH
birch = Birch(n_clusters=5, threshold=0.5, branching_factor=50)
labels = birch.fit_predict(X)
# Visualizing the clusters
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis', s=50)
plt.title('Clusters identified by BIRCH')
plt.show()
Interpretation
¶
BIRCH has identified clusters based on the density and distribution of the data points. The main parameters for BIRCH are:
•
n_clusters
: The number of clusters after the final clustering step, which treats the leaf clusters from the CF-tree as new samples. •
threshold
: The radius of the sub-cluster obtained by merging a new sample and
the closest sub-cluster should be lesser than the threshold. Otherwise, a new sub-cluster is started. •
branching_factor
: Maximum number of CF sub-clusters in each node. Conclusion
¶
BIRCH is an efficient clustering method that can handle large datasets by summarizing
data in a tree structure and then clustering the leaf nodes. It's particularly useful for datasets where the in-memory processing is not feasible.
Exercise 10 Understanding Agglomerative Clustering
¶
Introduction
¶
Agglomerative Clustering is a type of hierarchical clustering method that builds a hierarchy of clusters by successively merging or "agglomerating" groups of data points. The process starts with each data point as its own cluster and then pairs of clusters are successively merged until all clusters have been merged into one big cluster containing all data points.
Problem Statement
¶
Given a dataset of 2D points, use the Agglomerative Clustering algorithm to identify its
clustering structure. The goal is to understand how Agglomerative Clustering works, its
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
linkage criteria, and how to interpret the clustering results.
In [19]:
# Generating a sample dataset with blobs
from sklearn.datasets import make_blobs
X, _ = make_blobs(n_samples=500, centers=5, cluster_std=0.6, random_state=42)
# Visualizing the generated dataset
import matplotlib.pyplot as plt
plt.scatter(X[:, 0], X[:, 1], s=50, cmap='viridis')
plt.title("Generated Data Points")
plt.show()
We have generated a dataset with five distinct clusters. Let's apply the Agglomerative Clustering algorithm to this dataset and see how it performs.
The steps are as follows:
1. Apply the Agglomerative Clustering algorithm. 2. Visualize the clusters. 3. Interpret the result. In [20]:
from sklearn.cluster import AgglomerativeClustering
# Applying Agglomerative Clustering
agg_clustering = AgglomerativeClustering(n_clusters=5, linkage='ward')
labels = agg_clustering.fit_predict(X)
# Visualizing the clusters
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis', s=50)
plt.title('Clusters identified by Agglomerative Clustering')
plt.show()
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Interpretation
¶
Agglomerative Clustering has identified clusters based on the density and distribution of the data points. The main parameters for Agglomerative Clustering are:
•
n_clusters
: The number of clusters to find. •
linkage
: The linkage criterion determines the distance between two clusters. The 'ward' method minimizes the variance of the distances between the clusters being merged. Conclusion
¶
Agglomerative Clustering is a bottom-up approach where each observation starts in its
own cluster, and pairs of clusters are merged as one moves up the hierarchy. It's particularly useful for understanding hierarchical structures in the data.
Exercise 11 Understanding Clustering Using Representatives
¶
Introduction
¶
Clustering using representatives involves selecting a subset of data points as representatives and then assigning other data points to the nearest representative. This approach can be seen as a variation of the K-Means clustering algorithm, where the representatives play the role of the centroids.
Problem Statement
¶
Given a dataset of 2D points, use the Clustering Using Representatives method to identify its clustering structure. The goal is to understand how this method works, how to select representatives, and how to interpret the clustering results.
In [21]:
# Generating a sample dataset with blobs
from sklearn.datasets import make_blobs
X, _ = make_blobs(n_samples=500, centers=5, cluster_std=0.6, random_state=42)
# Visualizing the generated dataset
import matplotlib.pyplot as plt
plt.scatter(X[:, 0], X[:, 1], s=50, cmap='viridis')
plt.title("Generated Data Points")
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
plt.show()
We have generated a dataset with five distinct clusters. Let's apply the Clustering Using Representatives method to this dataset and see how it performs.
The steps are as follows:
1. Randomly select a subset of data points as representatives. 2. Assign each data point to the nearest representative. 3. Visualize the clusters. 4. Interpret the result. In [22]:
import numpy as np
from sklearn.metrics import pairwise_distances_argmin
# Randomly select representatives
rng = np.random.RandomState(42)
i = rng.permutation(X.shape[0])[:5]
representatives = X[i]
while True:
# Assign labels based on closest representative
labels = pairwise_distances_argmin(X, representatives)
# Find new representatives from data points
new_representatives = np.array([X[labels == i].mean(0) for i in range(5)])
# Check for convergence
if np.all(representatives == new_representatives):
break
representatives = new_representatives
# Visualizing the clusters
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis', s=50)
plt.scatter(representatives[:, 0], representatives[:, 1], c='black', s=200, alpha=0.75)
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
plt.title('Clusters identified by Clustering Using Representatives')
plt.show()
Interpretation
¶
The black points in the plot represent the representatives of each cluster. All other points are colored based on the representative they are closest to.
The main idea behind this method is to reduce the computational complexity by working with a subset of representative points rather than the entire dataset. This can be particularly useful for large datasets.
Conclusion
¶
Clustering Using Representatives is a heuristic method that can provide good clustering results with reduced computational overhead. It's particularly useful for preliminary analysis or when dealing with very large datasets.
Exercise 12 Understanding Partitioning Around Medoids (PAM)
¶
Introduction
¶
Partitioning Around Medoids (PAM) is a clustering algorithm similar to K-Means. However, instead of computing the mean of the items in each cluster, it identifies actual data points as the medoids. A medoid is the most centrally located data point in
a cluster.
Problem Statement
¶
Given a dataset of 2D points, apply the PAM algorithm to identify its clustering structure. The goal is to understand how PAM works, how medoids are selected, and how to interpret the clustering results.
In [23]:
# Generating a sample dataset with blobs
from sklearn.datasets import make_blobs
X, _ = make_blobs(n_samples=500, centers=5, cluster_std=0.6, random_state=42)
# Visualizing the generated dataset
import matplotlib.pyplot as plt
plt.scatter(X[:, 0], X[:, 1], s=50, cmap='viridis')
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
plt.title("Generated Data Points")
plt.show()
We have generated a dataset with five distinct clusters. Let's apply the PAM algorithm to this dataset and see how it performs.
The steps are as follows:
1. Initialize medoids randomly. 2. Assign each data point to the nearest medoid. 3. For each medoid and non-medoid data point, swap them and see if the total distance decreases. If so, perform the swap. 4. Repeat steps 2-3 until no change. 5. Visualize the clusters. 6. Interpret the result. In [24]:
#!pip install scikit-learn-extra
In [25]:
from sklearn_extra.cluster import KMedoids
# Applying PAM clustering
kmedoids = KMedoids(n_clusters=5, random_state=42).fit(X)
# Getting labels and medoids
labels = kmedoids.labels_
medoids = kmedoids.cluster_centers_
# Visualizing the clusters
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis', s=50)
plt.scatter(medoids[:, 0], medoids[:, 1], c='red', s=200, alpha=0.75, marker='X')
plt.title('Clusters identified by PAM')
plt.show()
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Interpretation
¶
The red "X" markers in the plot represent the medoids of each cluster. All other points are colored based on the medoid they are closest to.
Unlike K-Means, which calculates the centroid of a cluster, PAM selects actual data points as medoids. This can make PAM more robust to outliers compared to K-Means.
Conclusion
¶
Partitioning Around Medoids (PAM) is a robust clustering method that works well when clusters are non-spherical or when the dataset contains noise and outliers. It's particularly useful when the notion of a "center" of a cluster (like in K-Means) doesn't make sense.
Exercise 13 Understanding Fuzzy C-Means Clustering
¶
Introduction
¶
Fuzzy C-Means (FCM) is a clustering method that allows data points to belong to multiple clusters with varying degrees of membership. This method is based on the minimization of the following objective function:
$J_m(U, C) = \sum_{i=1}^{n} \sum_{j=1}^{c} u_{ij}^m ||x_i - c_j||^2$
Where:
•
( $u_{ij}$ ) is the degree of membership of ( $x_i$ ) in the cluster ( j ) •
( $x_i$ ) is the ( $i^{th}$ ) of ( d )-dimensional measured data •
( $c_j$ ) is the d-dimension center of the cluster •
( m ) is any real number greater than 1 •
( $||*||$ ) is any norm expressing the similarity between any measured data and the center Problem Statement
¶
Given a dataset of 2D points, apply the Fuzzy C-Means clustering algorithm to identify its clustering structure. The goal is to understand how FCM works, how cluster centers are determined, and how to interpret the fuzzy memberships of data points.
In [26]:
# Generating a sample dataset with blobs
from sklearn.datasets import make_blobs
X, _ = make_blobs(n_samples=500, centers=5, cluster_std=0.6, random_state=42)
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
# Visualizing the generated dataset
import matplotlib.pyplot as plt
plt.scatter(X[:, 0], X[:, 1], s=50, cmap='viridis')
plt.title("Generated Data Points")
plt.show()
We have generated a dataset with five distinct clusters. Let's apply the Fuzzy C-Means algorithm to this dataset and see how it performs.
The steps are as follows:
1. Initialize cluster centers randomly. 2. Calculate the membership matrix based on the distance between data points and cluster centers. 3. Update the cluster centers based on the membership matrix. 4. Repeat steps 2-3 until convergence. 5. Visualize the clusters. 6. Interpret the result. In [27]:
#!pip install fuzzy-c-means
In [28]:
from fcmeans import FCM
# Applying FCM clustering
fcm = FCM(n_clusters=5)
fcm.fit(X)
# Getting labels and centers
labels = fcm.predict(X)
centers = fcm.centers
# Visualizing the clusters
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis', s=50)
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
plt.scatter(centers[:, 0], centers[:, 1], c='red', s=200, alpha=0.75, marker='X')
plt.title('Clusters identified by Fuzzy C-Means')
plt.show()
Interpretation
¶
The red "X" markers in the plot represent the centers of each cluster. All other points are colored based on the cluster they have the highest membership in.
However, unlike hard clustering methods, each data point has memberships in all clusters. These memberships sum to 1 and indicate the degree to which a data point belongs to each cluster.
Conclusion
¶
Fuzzy C-Means clustering provides a more precise clustering result than complex clustering methods. It's handy when the boundaries between clusters are ambiguous or when we want to capture the degree of membership of data points in multiple clusters.
Exercise 14 Case Study: K-Means Clustering on the Iris Dataset
¶
Introduction
¶
The Iris dataset is one of the most popular datasets in the field of machine learning. It contains measurements of 150 iris flowers from three different species: Setosa, Versicolour, and Virginica. In this case study, we will apply the k-means clustering algorithm to this dataset to see if we can cluster these flowers into groups that match their actual species.
Problem Statement
¶
Given the Iris dataset, apply the k-means clustering algorithm to identify potential clusters within the data. Use the elbow method to determine the optimal number of clusters. The goal is to understand how k-means works, how to validate the number of clusters using the elbow method, and how the clusters compare to the actual species of the flowers.
In [29]:
# Importing necessary libraries
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.cluster import KMeans
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
import matplotlib.pyplot as plt
# Loading the Iris dataset
data = load_iris()
df = pd.DataFrame(data.data, columns=data.feature_names)
# Displaying the first few rows of the dataset
df.head()
Out[29]:
sepal length (cm)
sepal width (cm)
petal length (cm)
petal width (cm)
0
5.1
3.5
1.4
0.2
1
4.9
3.0
1.4
0.2
2
4.7
3.2
1.3
0.2
3
4.6
3.1
1.5
0.2
4
5.0
3.6
1.4
0.2
We have loaded the Iris dataset, which contains four features for each flower: sepal length, sepal width, petal length, and petal width.
Next, we will apply the k-means clustering algorithm to this dataset. But first, we need to determine the optimal number of clusters using the elbow method.
In [30]:
# Using the elbow method to find the optimal number of clusters
wcss = [] # Within-cluster sum of squares
for i in range(1, 11):
kmeans = KMeans(n_clusters=i, init='k-means++', random_state=42)
kmeans.fit(df)
wcss.append(kmeans.inertia_)
# Plotting the elbow method graph
plt.figure(figsize=(10,5))
plt.plot(range(1, 11), wcss, marker='o', linestyle='--')
plt.title('Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')
plt.show()
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
From the elbow method graph, we can observe that the optimal number of clusters is around 3, as this is where the decrease in WCSS begins to slow down, forming an "elbow".
With this information, we can proceed to apply k-means clustering with 3 clusters.
In [31]:
# Applying k-means clustering with 3 clusters
kmeans = KMeans(n_clusters=3, init='k-means++', random_state=42)
y_kmeans = kmeans.fit_predict(df)
# Visualizing the clusters on the first two columns
plt.figure(figsize=(12,6))
plt.scatter(df.iloc[y_kmeans == 0, 0], df.iloc[y_kmeans == 0, 1], s=50, c='red', label='Cluster 1')
plt.scatter(df.iloc[y_kmeans == 1, 0], df.iloc[y_kmeans == 1, 1], s=50, c='blue', label='Cluster 2')
plt.scatter(df.iloc[y_kmeans == 2, 0], df.iloc[y_kmeans == 2, 1], s=50, c='green', label='Cluster 3')
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s=300, c='yellow', label='Centroids')
plt.title('Clusters of Iris Flowers')
plt.xlabel('Sepal Length (cm)')
plt.ylabel('Sepal Width (cm)')
plt.legend()
plt.show()
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Interpretation
¶
The visualization shows the three clusters formed by the k-means algorithm on the basis of sepal length and sepal width. The yellow points represent the centroids of the clusters.
•
Cluster 1 (Red)
: Flowers with small sepal length and width. •
Cluster 2 (Blue)
: Flowers with medium to large sepal length and small sepal width. •
Cluster 3 (Green)
: Flowers with medium sepal length and large sepal width. These clusters provide insights into the natural groupings within the Iris dataset based on the features provided.
Conclusion
¶
K-means clustering provides a powerful method for identifying clusters within data. By using the elbow method, we can determine the optimal number of clusters, ensuring that our clustering results are meaningful and interpretable.
Summary
¶
When employing clustering techniques on real-world datasets, beginning with a clear understanding of the dataset and its features is essential. Start by performing exploratory data analysis (EDA) to visualize the data, detect outliers, and determine any evident patterns or structures. Pre-processing is a critical next step; normalize or standardize numerical data to ensure no variable unduly influences the clustering outcome because of its scale. Appropriate encoding techniques, like one-hot encoding,
should be applied for categorical data.
Choosing the suitable clustering algorithm depends on the nature of your data and the
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
problem at hand. For datasets with distinct globular structures, K-means might be good. However, DBSCAN or hierarchical clustering might be more suitable if the clusters are of varying shapes. The number of clusters, a crucial parameter for some algorithms, can often be determined using the elbow method for K-means. For high-
dimensional datasets, consider dimensionality reduction techniques like PCA before clustering to mitigate the curse of dimensionality.
Lastly, their quality and relevance must be evaluated and validated once clusters are formed. Use metrics like silhouette score or Davies-Bouldin index if the labels are unknown. Visualization tools, such as scatter plots, pair plots, or t-SNE, can offer insights into the clustering results. Always remember to interpret the clusters in the application domain context, as their actionable insights ultimately determine the real-
world significance and utility of clusters.
Revised Date: November 5, 2023
¶
In [ ]:
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help