IE6400_Day19

html

School

Northeastern University *

*We aren’t endorsed by this school

Course

6400

Subject

Industrial Engineering

Date

Feb 20, 2024

Type

html

Pages

33

Uploaded by ColonelStraw13148

Report
IE6400 Foundations of Data Analytics Engineering Fall 2023 Module 3: Clustering Methods Part -2 Clustering Methods Clustering is an unsupervised learning technique used to group similar data points into clusters. The goal is to partition a dataset such that points in the same group are more similar to each other than to those in other groups. There are several clustering methods, each suitable for different types of data and applications. 1. K-Means Clustering Description : Partitions the data into (K) distinct non-overlapping clusters based on distances to the center of each cluster. Algorithm : 1. Initialize (K) cluster centroids randomly. 2. Assign each data point to the nearest centroid. 3. Recompute centroids based on the current cluster assignments. 4. Repeat steps 2-3 until convergence. 2. Hierarchical Clustering Description : Creates a tree of clusters. Can be agglomerative (bottom-up) or divisive (top-down). Algorithm : 1. Treat each data point as a single cluster. (Total (n) clusters) 2. Find the two clusters that are closest and merge them. 3. Repeat step 2 until only one cluster remains. 3. DBSCAN (Density-Based Spatial Clustering of Applications with Noise) Description : Clusters are dense regions in the data space, separated by areas of lower point density. Can find arbitrarily shaped clusters. Algorithm : 1. For each point, count the number of points within a specified radius (( \ epsilon )). 2. If a point has more than a specified number (MinPts) of neighbors, it's a core point. 3. Cluster core points closer than ( \epsilon ) and reach border points. 4. Points not in any cluster are treated as noise. 4. Mean Shift Clustering Description : Based on centroid shifting to modes of the data density. Algorithm : 1. Initialize centroids randomly. 2. Update centroids to the mean of data points within a given window. 3. Repeat step 2 until convergence. 5. Gaussian Mixture Models (GMM) Description : Assumes data is generated from a mixture of several Gaussian distributions. Algorithm : Uses the Expectation-Maximization (EM) algorithm to estimate model parameters. 6. Spectral Clustering Description : Uses the eigenvalues of the similarity matrix to reduce dimensionality before clustering in a lower-dimensional space. Algorithm :
1. Build a similarity graph. 2. Compute the Laplacian of this graph. 3. Extract eigenvectors and use them as features. 4. Use K-means or another method on these new features. 7. Affinity Propagation Description : Works by passing messages between data points to determine clusters. Algorithm : 1. Send responsibility and availability messages between points. 2. Iteratively update messages and decide cluster exemplars. 8. OPTICS (Ordering Points To Identify Clustering Structure) Description : Similar to DBSCAN but can handle clusters of varying densities. Algorithm : Builds a reachability plot and extracts clusters from it. 9. BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies) Description : Designed for very large datasets. Builds a tree structure during clustering. Algorithm : 1. Incrementally construct a CF (Clustering Feature) tree. 2. Use the CF tree for clustering. 10. Agglomerative Clustering Description : Similar to hierarchical clustering but typically stops when it reaches a desired number of clusters. Algorithm : 1. Start with each data point as a single cluster. 2. Recursively merge the closest pair of clusters. 3. Stop when the desired number of clusters is reached. 11. CURE (Clustering Using Representatives) Description : A robust method for clustering with outliers by representing clusters using multiple scattered points. Algorithm : 1. Select a set of representative points for each cluster. 2. Shrink the representatives towards the center of the cluster. 3. Assign points to the closest representative's cluster. 12. K-Medoids or PAM (Partitioning Around Medoids) Description : Similar to K-means but uses actual data points as cluster centers (medoids) to reduce the influence of outliers. Algorithm : 1. Randomly select (K) data points as initial medoids. 2. Assign each data point to the nearest medoid. 3. Recompute medoids. 4. Repeat steps 2-3 until convergence. 13. COBWEB Description : A hierarchical clustering method for categorical data. Algorithm : 1. Construct a tree structure. 2. Place and categorize instances based on maximizing the category utility. 14. Fuzzy C-Means Description : A soft clustering method where each data point has a degree of belonging to clusters, rather than belonging strictly to one cluster. Algorithm : 1. Assign each data point a weight for each cluster. 2. Compute cluster centers based on weights.
3. Update weights for each data point based on distances to cluster centers. 4. Repeat steps 2-3 until convergence. 15. ROCK (RObust Clustering using linKs) Description : Designed for categorical data and uses links (similar pairs) for clustering. Algorithm : 1. Count the number of common neighbors between points to form links. 2. Form clusters based on these links. 16. CLIQUE (CLustering In QUEst) Description : A grid-based clustering algorithm designed to find dense clusters in subspaces of any dimensionality. Algorithm : 1. Partition each dimension into non-overlapping intervals. 2. Identify dense units in each dimension. 3. Form clusters based on the connectivity of dense units. 17. SNN (Shared Nearest Neighbor) Description : Considers two objects similar if they have many neighbors in common. Algorithm : 1. Compute the k-nearest neighbors for each point. 2. Assign similarity based on shared neighbors. 3. Cluster based on these similarities. 18. STREAM (Spatio-TEmporal Real-time Algorithm for clustering Moving objects) Description : Designed for real-time clustering of moving objects. Algorithm : 1. Uses micro-clusters to summarize the current state. 2. Periodically merges these micro-clusters to form final clusters. 19. SUBCLU (SUBspace CLUstering) Description : Discovers clusters in arbitrary subspaces of a dataset. Algorithm : 1. Uses a bottom-up approach. 2. Discovers dense regions in each dimension and merges them to identify clusters. Conclusion The choice of clustering method often depends on the dataset's size, dimensionality, and nature, the desired shape and structure of the clusters, and any underlying assumptions about the data. Pre-processing, like normalization or standardization, and the right distance or similarity measure, can also play crucial roles in clustering outcomes. The variety of clustering methods available offers flexibility in handling different types of datasets and various challenges like noise, outliers, and non-globular cluster shapes. The choice of a method should be guided by the nature of the data and the specific requirements of the application. Exercise 1 Understanding K-Means Clustering Introduction K-Means is one of the most popular clustering algorithms. It aims to partition a set of observations into a number of clusters (k), resulting in the partitioning of the dataset into Voronoi cells. It works iteratively to assign each data point to one of K groups based on the features provided. Problem Statement Given a dataset of 2D points, use the K-Means clustering algorithm to group them into clusters. The goal is to understand how K-Means works and how to visualize and
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
interpret the clustering results. In [1]: # Generating a sample dataset from sklearn.datasets import make_blobs import warnings # Settings the warnings to be ignored warnings.filterwarnings('ignore') X, _ = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=0) # Visualizing the generated dataset import matplotlib.pyplot as plt plt.scatter(X[:, 0], X[:, 1], s=50, cmap='viridis') plt.title("Generated Data Points") plt.show() We have generated a dataset with four clusters. Next, we will apply the KMeans clustering algorithm to this dataset. The steps are as follows: 1. Define the number of clusters k . 2. Apply KMeans clustering to the dataset. 3. Visualize the clusters and centroids. 4. Interpret the result. In [2]: from sklearn.cluster import KMeans import warnings warnings.filterwarnings('ignore') # Defining the number of clusters k = 4
# Applying KMeans clustering kmeans = KMeans(n_clusters=k, init='k-means++', random_state=42) y_kmeans = kmeans.fit_predict(X) # Visualizing the clusters and centroids plt.scatter(X[y_kmeans == 0, 0], X[y_kmeans == 0, 1], s=50, c='red', label='Cluster 1') plt.scatter(X[y_kmeans == 1, 0], X[y_kmeans == 1, 1], s=50, c='blue', label='Cluster 2') plt.scatter(X[y_kmeans == 2, 0], X[y_kmeans == 2, 1], s=50, c='green', label='Cluster 3') plt.scatter(X[y_kmeans == 3, 0], X[y_kmeans == 3, 1], s=50, c='yellow', label='Cluster 4') plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s=200, c='black', marker='X', label='Centroids') plt.legend() plt.title("Clusters of data points") plt.show() Interpretation The visualization shows the data points grouped into four distinct clusters, as indicated by the different colors. The black 'X' markers represent the centroids of these clusters. KMeans has successfully identified the underlying patterns in the dataset and grouped the data points accordingly. Conclusion K-Means clustering is a powerful algorithm for partitioning data into distinct groups or clusters. By visualizing and interpreting the results, we can gain insights into the underlying structure of the dataset and make informed decisions based on the clustering results. Exercise 2 Understanding Hierarchical Clustering Introduction Hierarchical clustering is a type of clustering algorithm that builds a hierarchy of clusters by either a bottom-up or top-down approach. The bottom-up approach (Agglomerative) starts with individual data points and merges them into larger
clusters, while the top-down approach (Divisive) starts with the entire dataset and divides it into smaller clusters. Problem Statement Given a dataset of 2D points, use the Hierarchical clustering algorithm to group them into clusters. The goal is to understand how Hierarchical clustering works, how to visualize the dendrogram, and how to interpret the clustering results. In [3]: # Generating a sample dataset from sklearn.datasets import make_blobs X, _ = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=0) # Visualizing the generated dataset import matplotlib.pyplot as plt plt.scatter(X[:, 0], X[:, 1], s=50, cmap='viridis') plt.title("Generated Data Points") plt.show() We have generated a dataset with four clusters. Next, we will apply the Agglomerative Hierarchical clustering algorithm to this dataset. The steps are as follows: 1. Compute the distance matrix. 2. Apply Agglomerative Hierarchical clustering. 3. Visualize the dendrogram. 4. Interpret the result. In [4]: import scipy.cluster.hierarchy as sch # Computing the distance matrix dendrogram = sch.dendrogram(sch.linkage(X, method='ward')) plt.title('Dendrogram')
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
plt.xlabel('Data Points') plt.ylabel('Euclidean Distances') plt.show() Interpretation The dendrogram visualizes the way clusters are merged. The y-axis represents the Euclidean distances between data points (or clusters). By drawing a horizontal line across the dendrogram, we can decide the number of clusters for our dataset. For instance, if we draw a horizontal line at a distance of 25, it intersects the dendrogram at four vertical lines, suggesting an optimal cluster number of four. Conclusion Hierarchical clustering provides a tree-based representation of data points, which can be useful for understanding the hierarchical structure of the dataset. The dendrogram is a powerful tool for visualizing and deciding the number of clusters. Exercise 3 Understanding DBSCAN Clustering Introduction DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a density- based clustering algorithm that can find clusters of different shapes and sizes from a large amount of data. Unlike K-Means or Hierarchical Clustering, DBSCAN does not require the number of clusters to be specified. It can also identify points that are not part of any cluster, which are typically considered as noise. Problem Statement Given a dataset of 2D points, use the DBSCAN clustering algorithm to group them into clusters. The goal is to understand how DBSCAN works, its parameters, and how to interpret the clustering results. In [5]: # Generating a sample dataset with clusters of different shapes from sklearn.datasets import make_moons X, _ = make_moons(n_samples=300, noise=0.05, random_state=0) # Visualizing the generated dataset
import matplotlib.pyplot as plt plt.scatter(X[:, 0], X[:, 1], s=50, cmap='viridis') plt.title("Generated Data Points") plt.show() We have generated a dataset with two crescent-shaped clusters. Traditional clustering methods like K-Means would struggle to identify these non-spherical clusters. Let's see how DBSCAN performs on this dataset. The steps are as follows: 1. Apply the DBSCAN algorithm. 2. Visualize the clusters. 3. Interpret the result. In [6]: from sklearn.cluster import DBSCAN # Applying DBSCAN dbscan = DBSCAN(eps=0.3, min_samples=5) clusters = dbscan.fit_predict(X) # Visualizing the clusters plt.scatter(X[:, 0], X[:, 1], c=clusters, cmap='viridis', s=50) plt.title('Clusters identified by DBSCAN') plt.show()
Interpretation The DBSCAN algorithm has successfully identified the two crescent-shaped clusters. Points that are not part of any cluster are labeled as -1, which represents noise. The two main parameters for DBSCAN are: eps : The maximum distance between two samples for one to be considered as in the neighborhood of the other. min_samples : The number of samples in a neighborhood for a point to be considered as a core point. By adjusting these parameters, we can control the density required to form a cluster and the minimum size of clusters. Conclusion DBSCAN is a powerful clustering algorithm that can identify clusters of various shapes without the need to specify the number of clusters. It's particularly useful for datasets with non-spherical clusters and noise. Exercise 4 Understanding Mean Shift Clustering Introduction Mean Shift Clustering is a non-parametric, density-based clustering algorithm. Unlike K-Means, which tries to ensure that the distance between the cluster members and the centroid is minimal, the Mean Shift algorithm assigns the datapoints to the clusters iteratively by shifting points towards the mode (highest density of datapoints). The mode can be understood as the highest density of datapoints in the region. Problem Statement Given a dataset of 2D points, use the Mean Shift clustering algorithm to group them into clusters. The goal is to understand how Mean Shift works, its parameters, and how to interpret the clustering results. In [7]: # Generating a sample dataset with blobs from sklearn.datasets import make_blobs X, _ = make_blobs(n_samples=300, centers=5, cluster_std=0.60, random_state=0) # Visualizing the generated dataset
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
import matplotlib.pyplot as plt plt.scatter(X[:, 0], X[:, 1], s=50, cmap='viridis') plt.title("Generated Data Points") plt.show() We have generated a dataset with five distinct clusters. Let's see how Mean Shift performs on this dataset. The steps are as follows: 1. Apply the Mean Shift algorithm. 2. Visualize the clusters. 3. Interpret the result. In [8]: from sklearn.cluster import MeanShift # Applying Mean Shift mean_shift = MeanShift(bandwidth=1.5) clusters = mean_shift.fit_predict(X) # Visualizing the clusters plt.scatter(X[:, 0], X[:, 1], c=clusters, cmap='viridis', s=50) plt.title('Clusters identified by Mean Shift') plt.show()
Interpretation The Mean Shift algorithm has identified the clusters in the dataset. The number of clusters is determined by the algorithm based on the data distribution, and we don't need to specify it beforehand. The main parameter for Mean Shift is: bandwidth : It can be understood as the radius of the region (or window) in which the algorithm searches for datapoints to form a cluster. A smaller bandwidth will result in more clusters, while a larger bandwidth may merge distinct clusters. Conclusion Mean Shift is a versatile clustering algorithm that can identify clusters without the need to specify the number of clusters beforehand. It's particularly useful for datasets where the number of clusters is not known a priori. Exercise 5 Understanding Gaussian Mixture Models (GMM) Introduction Gaussian Mixture Models (GMM) is a probabilistic model that assumes all the data points are generated from a mixture of several Gaussian distributions with unknown parameters. It attempts to find a mixture of multi-dimensional Gaussian probability distributions that best model any input dataset. Problem Statement Given a dataset of 2D points, use the GMM algorithm to group them into clusters. The goal is to understand how GMM works, its parameters, and how to interpret the clustering results. In [9]: # Generating a sample dataset with blobs from sklearn.datasets import make_blobs X, _ = make_blobs(n_samples=300, centers=5, cluster_std=0.60, random_state=42) # Visualizing the generated dataset import matplotlib.pyplot as plt plt.scatter(X[:, 0], X[:, 1], s=50, cmap='viridis')
plt.title("Generated Data Points") plt.show() We have generated a dataset with five distinct clusters. Let's see how GMM performs on this dataset. The steps are as follows: 1. Apply the GMM algorithm. 2. Visualize the clusters. 3. Interpret the result. In [10]: from sklearn.mixture import GaussianMixture # Applying GMM gmm = GaussianMixture(n_components=5) gmm.fit(X) labels = gmm.predict(X) # Visualizing the clusters plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis', s=50) plt.title('Clusters identified by GMM') plt.show()
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Interpretation The GMM algorithm has identified the clusters in the dataset. Unlike K-means, which assigns each point to a single cluster, GMM determines the probability that a given data point belongs to each of the clusters. This makes GMM a soft clustering method. The main parameters for GMM are: n_components : The number of mixture components, i.e., the number of Gaussian distributions. Conclusion Gaussian Mixture Models provide a probabilistic approach to clustering, allowing for more flexibility in cluster shape compared to methods like K-means. It's particularly useful for datasets where clusters are not necessarily spherical or have different sizes. Exercise 6 Understanding Spectral Clustering Introduction Spectral Clustering is a modern clustering technique that can capture complex cluster structures. It can identify clusters that are non-convex and even clusters that have a nested ring structure, which traditional algorithms like K-means might not capture effectively. The technique uses the eigenvalues of the similarity matrix of the data to reduce dimensionality before applying another clustering algorithm (like K-means). Problem Statement Given a dataset of 2D points, use the Spectral Clustering algorithm to group them into clusters. The goal is to understand how Spectral Clustering works, its parameters, and how to interpret the clustering results. In [11]: # Generating a sample dataset with moons from sklearn.datasets import make_moons X, _ = make_moons(200, noise=.05, random_state=42) # Visualizing the generated dataset import matplotlib.pyplot as plt
plt.scatter(X[:, 0], X[:, 1], s=50, cmap='viridis') plt.title("Generated Data Points") plt.show() We have generated a dataset with two crescent moon-shaped clusters. Traditional clustering algorithms might struggle with such data. Let's see how Spectral Clustering performs on this dataset. The steps are as follows: 1. Apply the Spectral Clustering algorithm. 2. Visualize the clusters. 3. Interpret the result. In [12]: from sklearn.cluster import SpectralClustering # Applying Spectral Clustering sc = SpectralClustering(n_clusters=2, affinity='nearest_neighbors', assign_labels='kmeans') labels = sc.fit_predict(X) # Visualizing the clusters plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis', s=50) plt.title('Clusters identified by Spectral Clustering') plt.show()
Interpretation Spectral Clustering has effectively identified the two crescent moon-shaped clusters. The algorithm works by creating a similarity matrix of the data points and then finding the eigenvectors corresponding to the top eigenvalues. These eigenvectors are then used to transform the data into a lower-dimensional space, where a traditional clustering algorithm (like K-means) can be applied. The main parameters for Spectral Clustering are: n_clusters : The number of clusters to form. affinity : How to construct the affinity matrix. 'nearest_neighbors' computes the similarity matrix using the nearest neighbors of each point. assign_labels : The strategy to use to assign labels in the embedding space. Here, we used 'kmeans'. Conclusion Spectral Clustering is a powerful technique that can capture complex cluster structures, making it suitable for datasets where traditional methods might fail. Exercise 7 Understanding Affinity Propagation Introduction Affinity Propagation is a clustering algorithm that identifies "exemplars" among the data points and forms clusters based on these exemplars. Unlike other clustering methods that require the number of clusters to be specified in advance, Affinity Propagation determines the number of clusters based on the data, making it particularly useful for datasets where the number of clusters is not known a priori. Problem Statement Given a dataset of 2D points, use the Affinity Propagation algorithm to group them into clusters. The goal is to understand how Affinity Propagation works, its parameters, and how to interpret the clustering results. In [13]: # Generating a sample dataset with blobs from sklearn.datasets import make_blobs X, _ = make_blobs(300, centers=5, cluster_std=0.60, random_state=42)
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
# Visualizing the generated dataset import matplotlib.pyplot as plt plt.scatter(X[:, 0], X[:, 1], s=50, cmap='viridis') plt.title("Generated Data Points") plt.show() We have generated a dataset with five distinct clusters. Let's apply the Affinity Propagation algorithm to this dataset and see how it performs. The steps are as follows: 1. Apply the Affinity Propagation algorithm. 2. Visualize the clusters and exemplars. 3. Interpret the result. In [14]: from sklearn.cluster import AffinityPropagation # Applying Affinity Propagation af = AffinityPropagation(preference=-50).fit(X) cluster_centers_indices = af.cluster_centers_indices_ labels = af.labels_ # Visualizing the clusters plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis', s=50) plt.scatter(X[cluster_centers_indices, 0], X[cluster_centers_indices, 1], s=200, c='red', marker='X', label='Exemplars') plt.title('Clusters identified by Affinity Propagation') plt.legend() plt.show()
Interpretation Affinity Propagation has identified clusters and their corresponding exemplars (marked with red 'X'). The algorithm works by sending messages between pairs of data points until convergence. A set of data points are chosen as exemplars and the remaining data points are assigned to the nearest exemplar. The main parameter for Affinity Propagation is: preference : Controls how many exemplars are used. A lower value will result in more clusters (more exemplars), and a higher value will result in fewer clusters. Conclusion Affinity Propagation is an adaptive clustering method that determines the number of clusters based on the data. It can be particularly useful when the number of clusters is not known in advance. Exercise 8 Understanding OPTICS Clustering Introduction The OPTICS (Ordering Points To Identify Clustering Structure) algorithm is a density- based clustering method similar to DBSCAN. However, unlike DBSCAN, which partitions the dataset into clusters, OPTICS creates an augmented ordering of the dataset representing its density-based clustering structure. This ordering can then be visualized and analyzed to extract the clusters. Problem Statement Given a dataset of 2D points, use the OPTICS algorithm to identify its clustering structure. The goal is to understand how OPTICS works, its parameters, and how to interpret the clustering results. In [15]: # Generating a sample dataset with blobs and noise from sklearn.datasets import make_moons X, _ = make_moons(n_samples=300, noise=0.05, random_state=42) # Visualizing the generated dataset import matplotlib.pyplot as plt
plt.scatter(X[:, 0], X[:, 1], s=50, cmap='viridis') plt.title("Generated Data Points") plt.show() We have generated a dataset with two crescent-shaped clusters. Let's apply the OPTICS algorithm to this dataset and see how it performs. The steps are as follows: 1. Apply the OPTICS algorithm. 2. Visualize the reachability plot. 3. Visualize the clusters. 4. Interpret the result. In [16]: from sklearn.cluster import OPTICS # Applying OPTICS optics = OPTICS(min_samples=10, xi=0.05, min_cluster_size=0.05) labels = optics.fit_predict(X) # Visualizing the reachability plot plt.figure(figsize=(10, 4)) plt.title('Reachability plot') plt.plot(optics.reachability_[optics.ordering_]) plt.xlabel('Points') plt.ylabel('Reachability Distance') plt.show() # Visualizing the clusters plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis', s=50) plt.title('Clusters identified by OPTICS') plt.show()
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Interpretation The reachability plot shows the reachability distances for each point in the dataset. Peaks in the plot represent cluster boundaries. In the scatter plot, OPTICS has identified clusters based on the density of the data points. The main parameters for OPTICS are: min_samples : The number of samples in a neighborhood for a point to be considered as a core point. xi : Determines the minimum steepness on the reachability plot that constitutes a cluster boundary. min_cluster_size : Minimum number of samples in a cluster. Conclusion OPTICS is a versatile clustering method that can identify clusters of varying shapes
and densities. It provides a reachability plot that offers insights into the clustering structure of the dataset. Exercise 9 Understanding BIRCH Clustering Introduction The BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies) algorithm is a hierarchical clustering method designed specifically for very large datasets. It incrementally and dynamically clusters incoming multi-dimensional metric data points to try to produce the best quality clustering with the available resources. Problem Statement Given a dataset of 2D points, use the BIRCH algorithm to identify its clustering structure. The goal is to understand how BIRCH works, its parameters, and how to interpret the clustering results. In [17]: # Generating a sample dataset with blobs from sklearn.datasets import make_blobs X, _ = make_blobs(n_samples=500, centers=5, cluster_std=0.6, random_state=42) # Visualizing the generated dataset import matplotlib.pyplot as plt plt.scatter(X[:, 0], X[:, 1], s=50, cmap='viridis') plt.title("Generated Data Points") plt.show() We have generated a dataset with five distinct clusters. Let's apply the BIRCH algorithm to this dataset and see how it performs. The steps are as follows: 1. Apply the BIRCH algorithm. 2. Visualize the clusters. 3. Interpret the result. In [18]: from sklearn.cluster import Birch
# Applying BIRCH birch = Birch(n_clusters=5, threshold=0.5, branching_factor=50) labels = birch.fit_predict(X) # Visualizing the clusters plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis', s=50) plt.title('Clusters identified by BIRCH') plt.show() Interpretation BIRCH has identified clusters based on the density and distribution of the data points. The main parameters for BIRCH are: n_clusters : The number of clusters after the final clustering step, which treats the leaf clusters from the CF-tree as new samples. threshold : The radius of the sub-cluster obtained by merging a new sample and the closest sub-cluster should be lesser than the threshold. Otherwise, a new sub-cluster is started. branching_factor : Maximum number of CF sub-clusters in each node. Conclusion BIRCH is an efficient clustering method that can handle large datasets by summarizing data in a tree structure and then clustering the leaf nodes. It's particularly useful for datasets where the in-memory processing is not feasible. Exercise 10 Understanding Agglomerative Clustering Introduction Agglomerative Clustering is a type of hierarchical clustering method that builds a hierarchy of clusters by successively merging or "agglomerating" groups of data points. The process starts with each data point as its own cluster and then pairs of clusters are successively merged until all clusters have been merged into one big cluster containing all data points. Problem Statement Given a dataset of 2D points, use the Agglomerative Clustering algorithm to identify its clustering structure. The goal is to understand how Agglomerative Clustering works, its
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
linkage criteria, and how to interpret the clustering results. In [19]: # Generating a sample dataset with blobs from sklearn.datasets import make_blobs X, _ = make_blobs(n_samples=500, centers=5, cluster_std=0.6, random_state=42) # Visualizing the generated dataset import matplotlib.pyplot as plt plt.scatter(X[:, 0], X[:, 1], s=50, cmap='viridis') plt.title("Generated Data Points") plt.show() We have generated a dataset with five distinct clusters. Let's apply the Agglomerative Clustering algorithm to this dataset and see how it performs. The steps are as follows: 1. Apply the Agglomerative Clustering algorithm. 2. Visualize the clusters. 3. Interpret the result. In [20]: from sklearn.cluster import AgglomerativeClustering # Applying Agglomerative Clustering agg_clustering = AgglomerativeClustering(n_clusters=5, linkage='ward') labels = agg_clustering.fit_predict(X) # Visualizing the clusters plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis', s=50) plt.title('Clusters identified by Agglomerative Clustering') plt.show()
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Interpretation Agglomerative Clustering has identified clusters based on the density and distribution of the data points. The main parameters for Agglomerative Clustering are: n_clusters : The number of clusters to find. linkage : The linkage criterion determines the distance between two clusters. The 'ward' method minimizes the variance of the distances between the clusters being merged. Conclusion Agglomerative Clustering is a bottom-up approach where each observation starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy. It's particularly useful for understanding hierarchical structures in the data. Exercise 11 Understanding Clustering Using Representatives Introduction Clustering using representatives involves selecting a subset of data points as representatives and then assigning other data points to the nearest representative. This approach can be seen as a variation of the K-Means clustering algorithm, where the representatives play the role of the centroids. Problem Statement Given a dataset of 2D points, use the Clustering Using Representatives method to identify its clustering structure. The goal is to understand how this method works, how to select representatives, and how to interpret the clustering results. In [21]: # Generating a sample dataset with blobs from sklearn.datasets import make_blobs X, _ = make_blobs(n_samples=500, centers=5, cluster_std=0.6, random_state=42) # Visualizing the generated dataset import matplotlib.pyplot as plt plt.scatter(X[:, 0], X[:, 1], s=50, cmap='viridis') plt.title("Generated Data Points")
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
plt.show() We have generated a dataset with five distinct clusters. Let's apply the Clustering Using Representatives method to this dataset and see how it performs. The steps are as follows: 1. Randomly select a subset of data points as representatives. 2. Assign each data point to the nearest representative. 3. Visualize the clusters. 4. Interpret the result. In [22]: import numpy as np from sklearn.metrics import pairwise_distances_argmin # Randomly select representatives rng = np.random.RandomState(42) i = rng.permutation(X.shape[0])[:5] representatives = X[i] while True: # Assign labels based on closest representative labels = pairwise_distances_argmin(X, representatives) # Find new representatives from data points new_representatives = np.array([X[labels == i].mean(0) for i in range(5)]) # Check for convergence if np.all(representatives == new_representatives): break representatives = new_representatives # Visualizing the clusters plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis', s=50) plt.scatter(representatives[:, 0], representatives[:, 1], c='black', s=200, alpha=0.75)
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
plt.title('Clusters identified by Clustering Using Representatives') plt.show() Interpretation The black points in the plot represent the representatives of each cluster. All other points are colored based on the representative they are closest to. The main idea behind this method is to reduce the computational complexity by working with a subset of representative points rather than the entire dataset. This can be particularly useful for large datasets. Conclusion Clustering Using Representatives is a heuristic method that can provide good clustering results with reduced computational overhead. It's particularly useful for preliminary analysis or when dealing with very large datasets. Exercise 12 Understanding Partitioning Around Medoids (PAM) Introduction Partitioning Around Medoids (PAM) is a clustering algorithm similar to K-Means. However, instead of computing the mean of the items in each cluster, it identifies actual data points as the medoids. A medoid is the most centrally located data point in a cluster. Problem Statement Given a dataset of 2D points, apply the PAM algorithm to identify its clustering structure. The goal is to understand how PAM works, how medoids are selected, and how to interpret the clustering results. In [23]: # Generating a sample dataset with blobs from sklearn.datasets import make_blobs X, _ = make_blobs(n_samples=500, centers=5, cluster_std=0.6, random_state=42) # Visualizing the generated dataset import matplotlib.pyplot as plt plt.scatter(X[:, 0], X[:, 1], s=50, cmap='viridis')
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
plt.title("Generated Data Points") plt.show() We have generated a dataset with five distinct clusters. Let's apply the PAM algorithm to this dataset and see how it performs. The steps are as follows: 1. Initialize medoids randomly. 2. Assign each data point to the nearest medoid. 3. For each medoid and non-medoid data point, swap them and see if the total distance decreases. If so, perform the swap. 4. Repeat steps 2-3 until no change. 5. Visualize the clusters. 6. Interpret the result. In [24]: #!pip install scikit-learn-extra In [25]: from sklearn_extra.cluster import KMedoids # Applying PAM clustering kmedoids = KMedoids(n_clusters=5, random_state=42).fit(X) # Getting labels and medoids labels = kmedoids.labels_ medoids = kmedoids.cluster_centers_ # Visualizing the clusters plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis', s=50) plt.scatter(medoids[:, 0], medoids[:, 1], c='red', s=200, alpha=0.75, marker='X') plt.title('Clusters identified by PAM') plt.show()
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Interpretation The red "X" markers in the plot represent the medoids of each cluster. All other points are colored based on the medoid they are closest to. Unlike K-Means, which calculates the centroid of a cluster, PAM selects actual data points as medoids. This can make PAM more robust to outliers compared to K-Means. Conclusion Partitioning Around Medoids (PAM) is a robust clustering method that works well when clusters are non-spherical or when the dataset contains noise and outliers. It's particularly useful when the notion of a "center" of a cluster (like in K-Means) doesn't make sense. Exercise 13 Understanding Fuzzy C-Means Clustering Introduction Fuzzy C-Means (FCM) is a clustering method that allows data points to belong to multiple clusters with varying degrees of membership. This method is based on the minimization of the following objective function: $J_m(U, C) = \sum_{i=1}^{n} \sum_{j=1}^{c} u_{ij}^m ||x_i - c_j||^2$ Where: ( $u_{ij}$ ) is the degree of membership of ( $x_i$ ) in the cluster ( j ) ( $x_i$ ) is the ( $i^{th}$ ) of ( d )-dimensional measured data ( $c_j$ ) is the d-dimension center of the cluster ( m ) is any real number greater than 1 ( $||*||$ ) is any norm expressing the similarity between any measured data and the center Problem Statement Given a dataset of 2D points, apply the Fuzzy C-Means clustering algorithm to identify its clustering structure. The goal is to understand how FCM works, how cluster centers are determined, and how to interpret the fuzzy memberships of data points. In [26]: # Generating a sample dataset with blobs from sklearn.datasets import make_blobs X, _ = make_blobs(n_samples=500, centers=5, cluster_std=0.6, random_state=42)
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
# Visualizing the generated dataset import matplotlib.pyplot as plt plt.scatter(X[:, 0], X[:, 1], s=50, cmap='viridis') plt.title("Generated Data Points") plt.show() We have generated a dataset with five distinct clusters. Let's apply the Fuzzy C-Means algorithm to this dataset and see how it performs. The steps are as follows: 1. Initialize cluster centers randomly. 2. Calculate the membership matrix based on the distance between data points and cluster centers. 3. Update the cluster centers based on the membership matrix. 4. Repeat steps 2-3 until convergence. 5. Visualize the clusters. 6. Interpret the result. In [27]: #!pip install fuzzy-c-means In [28]: from fcmeans import FCM # Applying FCM clustering fcm = FCM(n_clusters=5) fcm.fit(X) # Getting labels and centers labels = fcm.predict(X) centers = fcm.centers # Visualizing the clusters plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis', s=50)
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
plt.scatter(centers[:, 0], centers[:, 1], c='red', s=200, alpha=0.75, marker='X') plt.title('Clusters identified by Fuzzy C-Means') plt.show() Interpretation The red "X" markers in the plot represent the centers of each cluster. All other points are colored based on the cluster they have the highest membership in. However, unlike hard clustering methods, each data point has memberships in all clusters. These memberships sum to 1 and indicate the degree to which a data point belongs to each cluster. Conclusion Fuzzy C-Means clustering provides a more precise clustering result than complex clustering methods. It's handy when the boundaries between clusters are ambiguous or when we want to capture the degree of membership of data points in multiple clusters. Exercise 14 Case Study: K-Means Clustering on the Iris Dataset Introduction The Iris dataset is one of the most popular datasets in the field of machine learning. It contains measurements of 150 iris flowers from three different species: Setosa, Versicolour, and Virginica. In this case study, we will apply the k-means clustering algorithm to this dataset to see if we can cluster these flowers into groups that match their actual species. Problem Statement Given the Iris dataset, apply the k-means clustering algorithm to identify potential clusters within the data. Use the elbow method to determine the optimal number of clusters. The goal is to understand how k-means works, how to validate the number of clusters using the elbow method, and how the clusters compare to the actual species of the flowers. In [29]: # Importing necessary libraries import pandas as pd from sklearn.datasets import load_iris from sklearn.cluster import KMeans
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
import matplotlib.pyplot as plt # Loading the Iris dataset data = load_iris() df = pd.DataFrame(data.data, columns=data.feature_names) # Displaying the first few rows of the dataset df.head() Out[29]: sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) 0 5.1 3.5 1.4 0.2 1 4.9 3.0 1.4 0.2 2 4.7 3.2 1.3 0.2 3 4.6 3.1 1.5 0.2 4 5.0 3.6 1.4 0.2 We have loaded the Iris dataset, which contains four features for each flower: sepal length, sepal width, petal length, and petal width. Next, we will apply the k-means clustering algorithm to this dataset. But first, we need to determine the optimal number of clusters using the elbow method. In [30]: # Using the elbow method to find the optimal number of clusters wcss = [] # Within-cluster sum of squares for i in range(1, 11): kmeans = KMeans(n_clusters=i, init='k-means++', random_state=42) kmeans.fit(df) wcss.append(kmeans.inertia_) # Plotting the elbow method graph plt.figure(figsize=(10,5)) plt.plot(range(1, 11), wcss, marker='o', linestyle='--') plt.title('Elbow Method') plt.xlabel('Number of clusters') plt.ylabel('WCSS') plt.show()
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
From the elbow method graph, we can observe that the optimal number of clusters is around 3, as this is where the decrease in WCSS begins to slow down, forming an "elbow". With this information, we can proceed to apply k-means clustering with 3 clusters. In [31]: # Applying k-means clustering with 3 clusters kmeans = KMeans(n_clusters=3, init='k-means++', random_state=42) y_kmeans = kmeans.fit_predict(df) # Visualizing the clusters on the first two columns plt.figure(figsize=(12,6)) plt.scatter(df.iloc[y_kmeans == 0, 0], df.iloc[y_kmeans == 0, 1], s=50, c='red', label='Cluster 1') plt.scatter(df.iloc[y_kmeans == 1, 0], df.iloc[y_kmeans == 1, 1], s=50, c='blue', label='Cluster 2') plt.scatter(df.iloc[y_kmeans == 2, 0], df.iloc[y_kmeans == 2, 1], s=50, c='green', label='Cluster 3') plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s=300, c='yellow', label='Centroids') plt.title('Clusters of Iris Flowers') plt.xlabel('Sepal Length (cm)') plt.ylabel('Sepal Width (cm)') plt.legend() plt.show()
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Interpretation The visualization shows the three clusters formed by the k-means algorithm on the basis of sepal length and sepal width. The yellow points represent the centroids of the clusters. Cluster 1 (Red) : Flowers with small sepal length and width. Cluster 2 (Blue) : Flowers with medium to large sepal length and small sepal width. Cluster 3 (Green) : Flowers with medium sepal length and large sepal width. These clusters provide insights into the natural groupings within the Iris dataset based on the features provided. Conclusion K-means clustering provides a powerful method for identifying clusters within data. By using the elbow method, we can determine the optimal number of clusters, ensuring that our clustering results are meaningful and interpretable. Summary When employing clustering techniques on real-world datasets, beginning with a clear understanding of the dataset and its features is essential. Start by performing exploratory data analysis (EDA) to visualize the data, detect outliers, and determine any evident patterns or structures. Pre-processing is a critical next step; normalize or standardize numerical data to ensure no variable unduly influences the clustering outcome because of its scale. Appropriate encoding techniques, like one-hot encoding, should be applied for categorical data. Choosing the suitable clustering algorithm depends on the nature of your data and the
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
problem at hand. For datasets with distinct globular structures, K-means might be good. However, DBSCAN or hierarchical clustering might be more suitable if the clusters are of varying shapes. The number of clusters, a crucial parameter for some algorithms, can often be determined using the elbow method for K-means. For high- dimensional datasets, consider dimensionality reduction techniques like PCA before clustering to mitigate the curse of dimensionality. Lastly, their quality and relevance must be evaluated and validated once clusters are formed. Use metrics like silhouette score or Davies-Bouldin index if the labels are unknown. Visualization tools, such as scatter plots, pair plots, or t-SNE, can offer insights into the clustering results. Always remember to interpret the clusters in the application domain context, as their actionable insights ultimately determine the real- world significance and utility of clusters. Revised Date: November 5, 2023 In [ ]:
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help