Clustering Methods for Data Analytics: A Comprehensive Guide

IE6400 Foundations of Data Analytics Engineering ¶ Fall 2023 ¶ Module 3: Clustering Methods Part -2 ¶ Clustering Methods ¶ Clustering is an unsupervised learning technique used to group similar data points into clusters. The goal is to partition a dataset such that points in the same group are more similar to each other than to those in other groups. There are several clustering methods, each suitable for different types of data and applications. 1. K-Means Clustering ¶ • Description : Partitions the data into (K) distinct non-overlapping clusters based on distances to the center of each cluster. • Algorithm : 1. Initialize (K) cluster centroids randomly. 2. Assign each data point to the nearest centroid. 3. Recompute centroids based on the current cluster assignments. 4. Repeat steps 2-3 until convergence. 2. Hierarchical Clustering ¶ • Description : Creates a tree of clusters. Can be agglomerative (bottom-up) or divisive (top-down). • Algorithm : 1. Treat each data point as a single cluster. (Total (n) clusters) 2. Find the two clusters that are closest and merge them. 3. Repeat step 2 until only one cluster remains. 3. DBSCAN (Density-Based Spatial Clustering of Applications with Noise) ¶ • Description : Clusters are dense regions in the data space, separated by areas of lower point density. Can find arbitrarily shaped clusters. • Algorithm : 1. For each point, count the number of points within a specified radius (( \ epsilon )). 2. If a point has more than a specified number (MinPts) of neighbors, it's a core point. 3. Cluster core points closer than ( \epsilon ) and reach border points. 4. Points not in any cluster are treated as noise. 4. Mean Shift Clustering ¶ • Description : Based on centroid shifting to modes of the data density. • Algorithm : 1. Initialize centroids randomly. 2. Update centroids to the mean of data points within a given window. 3. Repeat step 2 until convergence. 5. Gaussian Mixture Models (GMM) ¶ • Description : Assumes data is generated from a mixture of several Gaussian distributions. • Algorithm : Uses the Expectation-Maximization (EM) algorithm to estimate model parameters. 6. Spectral Clustering ¶ • Description : Uses the eigenvalues of the similarity matrix to reduce dimensionality before clustering in a lower-dimensional space. • Algorithm :

1. Build a similarity graph. 2. Compute the Laplacian of this graph. 3. Extract eigenvectors and use them as features. 4. Use K-means or another method on these new features. 7. Affinity Propagation ¶ • Description : Works by passing messages between data points to determine clusters. • Algorithm : 1. Send responsibility and availability messages between points. 2. Iteratively update messages and decide cluster exemplars. 8. OPTICS (Ordering Points To Identify Clustering Structure) ¶ • Description : Similar to DBSCAN but can handle clusters of varying densities. • Algorithm : Builds a reachability plot and extracts clusters from it. 9. BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies) ¶ • Description : Designed for very large datasets. Builds a tree structure during clustering. • Algorithm : 1. Incrementally construct a CF (Clustering Feature) tree. 2. Use the CF tree for clustering. 10. Agglomerative Clustering ¶ • Description : Similar to hierarchical clustering but typically stops when it reaches a desired number of clusters. • Algorithm : 1. Start with each data point as a single cluster. 2. Recursively merge the closest pair of clusters. 3. Stop when the desired number of clusters is reached. 11. CURE (Clustering Using Representatives) ¶ • Description : A robust method for clustering with outliers by representing clusters using multiple scattered points. • Algorithm : 1. Select a set of representative points for each cluster. 2. Shrink the representatives towards the center of the cluster. 3. Assign points to the closest representative's cluster. 12. K-Medoids or PAM (Partitioning Around Medoids) ¶ • Description : Similar to K-means but uses actual data points as cluster centers (medoids) to reduce the influence of outliers. • Algorithm : 1. Randomly select (K) data points as initial medoids. 2. Assign each data point to the nearest medoid. 3. Recompute medoids. 4. Repeat steps 2-3 until convergence. 13. COBWEB ¶ • Description : A hierarchical clustering method for categorical data. • Algorithm : 1. Construct a tree structure. 2. Place and categorize instances based on maximizing the category utility. 14. Fuzzy C-Means ¶ • Description : A soft clustering method where each data point has a degree of belonging to clusters, rather than belonging strictly to one cluster. • Algorithm : 1. Assign each data point a weight for each cluster. 2. Compute cluster centers based on weights.

3. Update weights for each data point based on distances to cluster centers. 4. Repeat steps 2-3 until convergence. 15. ROCK (RObust Clustering using linKs) ¶ • Description : Designed for categorical data and uses links (similar pairs) for clustering. • Algorithm : 1. Count the number of common neighbors between points to form links. 2. Form clusters based on these links. 16. CLIQUE (CLustering In QUEst) ¶ • Description : A grid-based clustering algorithm designed to find dense clusters in subspaces of any dimensionality. • Algorithm : 1. Partition each dimension into non-overlapping intervals. 2. Identify dense units in each dimension. 3. Form clusters based on the connectivity of dense units. 17. SNN (Shared Nearest Neighbor) ¶ • Description : Considers two objects similar if they have many neighbors in common. • Algorithm : 1. Compute the k-nearest neighbors for each point. 2. Assign similarity based on shared neighbors. 3. Cluster based on these similarities. 18. STREAM (Spatio-TEmporal Real-time Algorithm for clustering Moving objects) ¶ • Description : Designed for real-time clustering of moving objects. • Algorithm : 1. Uses micro-clusters to summarize the current state. 2. Periodically merges these micro-clusters to form final clusters. 19. SUBCLU (SUBspace CLUstering) ¶ • Description : Discovers clusters in arbitrary subspaces of a dataset. • Algorithm : 1. Uses a bottom-up approach. 2. Discovers dense regions in each dimension and merges them to identify clusters. Conclusion ¶ The choice of clustering method often depends on the dataset's size, dimensionality, and nature, the desired shape and structure of the clusters, and any underlying assumptions about the data. Pre-processing, like normalization or standardization, and the right distance or similarity measure, can also play crucial roles in clustering outcomes. The variety of clustering methods available offers flexibility in handling different types of datasets and various challenges like noise, outliers, and non-globular cluster shapes. The choice of a method should be guided by the nature of the data and the specific requirements of the application. Exercise 1 Understanding K-Means Clustering ¶ Introduction ¶ K-Means is one of the most popular clustering algorithms. It aims to partition a set of observations into a number of clusters (k), resulting in the partitioning of the dataset into Voronoi cells. It works iteratively to assign each data point to one of K groups based on the features provided. Problem Statement ¶ Given a dataset of 2D points, use the K-Means clustering algorithm to group them into clusters. The goal is to understand how K-Means works and how to visualize and

Your preview ends here