IE6400_Day18

html

School

Northeastern University *

*We aren’t endorsed by this school

Course

6400

Subject

Industrial Engineering

Date

Feb 20, 2024

Type

html

Pages

37

Uploaded by ColonelStraw13148

Report
IE6400 Foundations of Data Analytics Engineering Fall 2023 Module 3: Clustering Methods Part -1 Proximity Measures Proximity measures are metrics used to determine the similarity or dissimilarity between data points. They play a crucial role in various machine learning and data analysis techniques, especially clustering and classification. The choice of a proximity measure often depends on the nature of the data and the specific problem at hand. Types of Proximity Measures There are two main types of proximity measures: 1. Similarity Measures : These quantify how similar two data points are. Higher values indicate greater similarity. 2. Dissimilarity Measures (or Distance Measures) : These represent the "distance" or dissimilarity between two data points. Higher values indicate greater dissimilarity. Common Proximity Measures For Continuous Data: 1. Euclidean Distance : It's the "ordinary" straight-line distance between two points in Euclidean space. $d(x, y) = \sqrt{\sum_{i=1}^{n} (x_i - y_i)^2}$ 1. Manhattan Distance (or L1 norm) : It's the distance between two points measured along axes at right angles (taxicab or city block distance). $d(x, y) = \sum_{i=1}^{n} |x_i - y_i|$ 1. Minkowski Distance : A generalized metric. When (p=2), it becomes the Euclidean distance. When (p=1), it's the Manhattan distance. $d(x, y) = \left( \sum_{i=1}^{n} |x_i - y_i|^p \right)^{\frac{1}{p}}$ For Categorical Data: 1. Hamming Distance : Used for categorical variables. It's the number of positions at which the corresponding symbols in two strings of equal length are different. 2. Jaccard Similarity : Measures the similarity between two sets. It's the size of the intersection divided by the size of the union of the two sets. $J(A, B) = \frac{|A \cap B|}{|A \cup B|}$ For Mixed-Type Data: 1. Gower Distance : Combines various distance metrics for mixed-type data. For Binary Data: 1. Jaccard Coefficient : Similar to Jaccard similarity but specifically tailored for binary attributes. 2. Cosine Similarity : Measures the cosine of the angle between two non-zero vectors. It's often used in text analysis to determine similarity between documents. $\text{cosine_similarity}(A, B) = \frac{A \cdot B}{\|A\| \|B\|}$
Choosing the Right Proximity Measure The choice of proximity measure depends on: Nature of Data : Continuous, categorical, binary, or mixed. Domain Knowledge : The context in which data points are being compared. Problem Specifics : Certain problems may necessitate specific measures. In general, it's crucial to understand the data and the problem's requirements before choosing a proximity measure. Exercise 1 Understanding Euclidean Distance Introduction Euclidean distance is a measure of the straight line distance between two points in Euclidean space. It's a fundamental concept in mathematics and has wide applications in machine learning, especially in clustering algorithms like K-Means. Problem Statement Imagine you are working on a recommendation system for a retail store. You have data on the purchase history of customers for two products: A and B. You want to find out how similar two customers are based on their purchase patterns of these two products. One way to measure this similarity is by calculating the Euclidean distance between their purchase histories. Given the purchase data for two products, can you compute the Euclidean distance between two customers? Dataset For simplicity, let's consider a small dataset representing the number of units of products A and B bought by different customers: Customer Product A Product B 1 5 3 2 2 8 3 9 1 4 4 7 In [1]: # Import necessary libraries import numpy as np import matplotlib.pyplot as plt # Define the dataset customers = np.array([[5, 3], [2, 8], [9, 1], [4, 7]]) # Visualize the data plt.scatter(customers[:, 0], customers[:, 1], color='blue', label='Customers') plt.xlabel('Product A') plt.ylabel('Product B') plt.title('Purchase History of Customers') plt.grid(True) plt.show()
From the scatter plot, we can visualize the purchase patterns of the four customers. Each point represents a customer's purchase history for products A and B. Euclidean Distance The Euclidean distance between two points ( $P(x_1, y_1)$ ) and ( $Q(x_2, y_2)$ ) in a 2D plane is given by: $d(P, Q) = \sqrt{(x_2 - x_1)^2 + (y_2 - y_1)^2}$ Let's compute the Euclidean distance between Customer 1 and Customer 2. In [2]: # Define a function to compute Euclidean distance def euclidean_distance(point1, point2): return np.sqrt(np.sum((point1 - point2) ** 2)) # Calculate the distance between Customer 1 and Customer 2 distance = euclidean_distance(customers[0], customers[1]) print(f"Euclidean Distance between Customer 1 and Customer 2: {distance:.2f}") Euclidean Distance between Customer 1 and Customer 2: 5.83 The computed distance gives us a measure of how similar or dissimilar the two customers are based on their purchase patterns. A smaller distance indicates similar purchase behaviors, while a larger distance indicates dissimilarity. Visualization To better understand the concept, let's visualize the Euclidean distance between Customer 1 and Customer 2 on our scatter plot. In [3]: # Visualize the data with the Euclidean distance plt.scatter(customers[:, 0], customers[:, 1], color='blue', label='Customers') plt.plot([customers[0][0], customers[1][0]], [customers[0][1], customers[1][1]], 'ro-') plt.xlabel('Product A') plt.ylabel('Product B') plt.title('Euclidean Distance between Customer 1 and Customer 2') plt.grid(True)
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
plt.show() The red line represents the Euclidean distance between Customer 1 and Customer 2. As we can see, despite both customers purchasing different quantities of products A and B, we can quantify their similarity using this distance. Conclusion Euclidean distance is a powerful metric to measure the similarity between data points. In this exercise, we applied it to a retail scenario, but its applications are vast, spanning areas like clustering, image processing, and more. Exercise 2 Understanding Manhattan Distance Introduction Manhattan distance, also known as L1 distance or taxicab distance, is a measure of distance between two points in a grid-based path. It is computed as the sum of the absolute differences of their coordinates. Unlike the Euclidean distance, which measures the shortest path (a straight line), the Manhattan distance measures the distance "traveled" on a grid. Problem Statement Imagine you are a taxi driver in a city where the roads are laid out in a perfect grid. You pick up a passenger at one intersection and need to drop them off at another. The Manhattan distance gives you the total number of blocks you'd need to drive, in a straight line horizontally and/or vertically, to reach the destination. Given the coordinates of the starting and ending intersections, can you compute the Manhattan distance between them? Dataset For simplicity, let's consider a small dataset representing the starting and ending coordinates of a few taxi rides: Ride Start (x, y) End (x, y) 1 (2, 3) (5, 6) 2 (1, 4) (4, 2)
Ride Start (x, y) End (x, y) 3 (3, 3) (3, 7) 4 (6, 1) (2, 5) In [4]: # Import necessary libraries import numpy as np import matplotlib.pyplot as plt # Define the dataset rides = { 'Start': [(2, 3), (1, 4), (3, 3), (6, 1)], 'End': [(5, 6), (4, 2), (3, 7), (2, 5)] } # Visualize the data for start, end in zip(rides['Start'], rides['End']): plt.plot([start[0], end[0]], [start[1], end[1]], 'ro-') plt.xlabel('X Coordinate') plt.ylabel('Y Coordinate') plt.title('Taxi Rides') plt.grid(True) plt.show() From the plot, we can visualize the starting and ending points of the taxi rides. Each line segment represents a ride.
Manhattan Distance The Manhattan distance between two points ( $P(x_1, y_1)$ ) and ( $Q(x_2, y_2)$ ) in a 2D plane is given by: $d(P, Q) = |x_2 - x_1| + |y_2 - y_1|$ Let's compute the Manhattan distance for Ride 1. In [5]: # Define a function to compute Manhattan distance def manhattan_distance(point1, point2): return abs(point1[0] - point2[0]) + abs(point1[1] - point2[1]) # Calculate the distance for Ride 1 distance = manhattan_distance(rides['Start'][0], rides['End'][0]) print(f"Manhattan Distance for Ride 1: {distance} blocks") Manhattan Distance for Ride 1: 6 blocks The computed distance gives us the total number of blocks the taxi would need to drive to get from the starting to the ending intersection for Ride 1. Visualization To better understand the concept, let's visualize the Manhattan distance for Ride 1 on our plot. In [6]: # Visualize the data with the Manhattan distance for Ride 1 start, end = rides['Start'][0], rides['End'][0] plt.plot([start[0], end[0]], [start[1], start[1]], 'bo-') plt.plot([end[0], end[0]], [start[1], end[1]], 'bo-') plt.xlabel('X Coordinate') plt.ylabel('Y Coordinate') plt.title('Manhattan Distance for Ride 1') plt.grid(True) plt.show() The blue lines represent the Manhattan distance for Ride 1. As we can see, the taxi
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
would first travel horizontally and then vertically to reach the destination, covering a total distance equal to the sum of the lengths of these two segments. Conclusion Manhattan distance is a useful metric in grid-based environments, such as urban city blocks. In this exercise, we applied it to a taxi scenario, but its applications are vast, spanning areas like computer vision, robotics, and more. Exercise 3 Understanding Chebyshev Distance Introduction Chebyshev distance, also known as the maximum value distance, is a metric defined on a vector space where the distance between two vectors is the greatest of their differences along any coordinate dimension. It's often used in chess to calculate the minimum number of moves a king can take to go from one square to another. Problem Statement Imagine you are playing a game on a grid, and you have a piece that can move horizontally, vertically, or diagonally, but only one square at a time. How many moves would it take for your piece to go from its starting position to a target position? Given the coordinates of the starting and target positions, can you compute the Chebyshev distance between them? Dataset For simplicity, let's consider a small dataset representing the starting and target positions of a few game moves: Move Start (x, y) Target (x, y) 1 (2, 3) (5, 6) 2 (1, 4) (4, 2) 3 (3, 3) (3, 7) 4 (6, 1) (2, 5) In [7]: # Import necessary libraries import numpy as np import matplotlib.pyplot as plt # Define the dataset moves = { 'Start': [(2, 3), (1, 4), (3, 3), (6, 1)], 'Target': [(5, 6), (4, 2), (3, 7), (2, 5)] } # Visualize the data for start, target in zip(moves['Start'], moves['Target']): plt.plot([start[0], target[0]], [start[1], target[1]], 'ro-') plt.xlabel('X Coordinate') plt.ylabel('Y Coordinate') plt.title('Game Moves') plt.grid(True) plt.show()
From the plot, we can visualize the starting and target positions of the game moves. Each line segment represents a move. Chebyshev Distance The Chebyshev distance between two points ( $P(x_1, y_1)$ ) and ( $Q(x_2, y_2)$ ) in a 2D plane is given by: $d(P, Q) = max(|x_2 - x_1|, |y_2 - y_1|)$ Let's compute the Chebyshev distance for Move 1. In [8]: # Define a function to compute Chebyshev distance def chebyshev_distance(point1, point2): return max(abs(point1[0] - point2[0]), abs(point1[1] - point2[1])) # Calculate the distance for Move 1 distance = chebyshev_distance(moves['Start'][0], moves['Target'][0]) print(f"Chebyshev Distance for Move 1: {distance} squares") Chebyshev Distance for Move 1: 3 squares The computed distance gives us the minimum number of moves required for the piece to go from the starting to the target position for Move 1. Visualization To better understand the concept, let's visualize the Chebyshev distance for Move 1 on our plot. In [9]: # Visualize the data with the Chebyshev distance for Move 1 start, target = moves['Start'][0], moves['Target'][0] plt.scatter(*zip(*[start, target]), color=['blue', 'red']) plt.plot([start[0], target[0]], [start[1], start[1]], 'g--') plt.plot([target[0], target[0]], [start[1], target[1]], 'g--') plt.xlabel('X Coordinate') plt.ylabel('Y Coordinate') plt.title('Chebyshev Distance for Move 1')
plt.grid(True) plt.show() The green dashed lines represent the possible paths the piece can take to reach the target position using the Chebyshev distance. As we can see, the piece would move either horizontally or vertically (whichever is greater) to reach the target position. Conclusion Chebyshev distance is a useful metric in grid-based games or scenarios, especially when diagonal movements are allowed. In this exercise, we applied it to a game scenario, but its applications can be found in various fields like robotics, pathfinding algorithms, and more. Exercise 4 Understanding Minkowski Distance Introduction The Minkowski distance is a metric in a normed vector space which can be considered as a generalization of both the Euclidean distance and the Manhattan distance. It is defined for two points p and q in an n dimensional space as: $D(p, q) = \left( \sum_{i=1}^{n} |p_i - q_i|^p \right)^{\frac{1}{p}}$ Where: $p_i$ and $q_i$ are the i-th coordinates of points p and q respectively. ( p ) is the order parameter. When ( $p = 2$ ), it becomes the Euclidean distance, and when ( $p = 1$ ), it's the Manhattan distance. Objective In this exercise, we will: 1. Compute the Minkowski distance for different values of ( p ) using a sample dataset. 2. Visualize the results. 3. Interpret the significance of the Minkowski distance and its relation to other distance metrics. In [10]: # Required Libraries import numpy as np
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
import matplotlib.pyplot as plt # Sample Dataset points = np.array([[2, 3], [3, 5]]) # Minkowski Distance Function def minkowski_distance(p1, p2, p): return np.sum(np.abs(p1 - p2) ** p) ** (1/p) # Calculate Minkowski Distance for p=1,2,3,4 p_values = [1, 2, 3, 4] distances = [minkowski_distance(points[0], points[1], p) for p in p_values] distances Out[10]: [3.0, 2.23606797749979, 2.080083823051904, 2.0305431848689306] Explanation The code above first imports necessary libraries and defines a sample dataset of two points in a 2D space. We then define the Minkowski distance function and compute the distance for different values of ( p ). The resulting distances for different ( p ) values are: In [11]: for p, d in zip(p_values, distances): print(f"For p = {p}, Minkowski Distance = {d:.2f}") For p = 1, Minkowski Distance = 3.00 For p = 2, Minkowski Distance = 2.24 For p = 3, Minkowski Distance = 2.08 For p = 4, Minkowski Distance = 2.03 Visualization Let's visualize the Minkowski distance for different values of ( p ) to better understand its behavior. In [12]: plt.plot(p_values, distances, 'o-', color='blue') plt.xlabel('Value of p') plt.ylabel('Minkowski Distance') plt.title('Minkowski Distance for Different p Values') plt.grid(True) plt.show()
Interpretation From the plot, we can observe that as the value of ( p ) increases, the Minkowski distance tends to stabilize. When ( p = 1 ), it represents the Manhattan distance, and when ( p = 2 ), it's the Euclidean distance. As ( p ) grows, the distance metric becomes less sensitive to outliers. Conclusion The Minkowski distance is a versatile metric that encompasses other popular distance metrics like Euclidean and Manhattan. By adjusting the ( p ) value, we can control the sensitivity of the distance measure, making it adaptable to various applications. Exercise 5 Understanding the Dissimilarity Matrix Introduction A dissimilarity matrix, also known as a distance matrix, is a square matrix that represents the pairwise dissimilarity between elements of a dataset. Each entry in the matrix represents the distance between two data points. The diagonal of the matrix always contains zeros, as the distance between a point and itself is zero. Objective In this exercise, we will: 1. Compute the dissimilarity matrix for a sample dataset using the Euclidean distance. 2. Visualize the matrix using a heatmap. 3. Interpret the significance of the dissimilarity matrix in understanding the relationships between data points. In [13]: # Required Libraries import numpy as np import matplotlib.pyplot as plt import seaborn as sns from scipy.spatial import distance_matrix # Sample Dataset data = np.array([[2, 3], [3, 5], [5, 8], [8, 9], [7, 5]])
# Compute Dissimilarity Matrix dissimilarity = distance_matrix(data, data) dissimilarity Out[13]: array([[0. , 2.23606798, 5.83095189, 8.48528137, 5.38516481], [2.23606798, 0. , 3.60555128, 6.40312424, 4. ], [5.83095189, 3.60555128, 0. , 3.16227766, 3.60555128], [8.48528137, 6.40312424, 3.16227766, 0. , 4.12310563], [5.38516481, 4. , 3.60555128, 4.12310563, 0. ]]) Explanation The code above first imports the necessary libraries and defines a sample dataset of five points in a 2D space. We then compute the dissimilarity matrix using the distance_matrix function from the scipy.spatial module. The resulting matrix represents the pairwise Euclidean distances between the data points. In [14]: sns.heatmap(dissimilarity, annot=True, cmap='YlGnBu', cbar=True) plt.title('Dissimilarity Matrix Heatmap') plt.show() Interpretation The heatmap visualizes the dissimilarity matrix. The color intensity in the heatmap represents the magnitude of the distance between data points. Darker shades indicate larger distances, while lighter shades indicate smaller distances. From the heatmap, we can observe patterns and relationships between data points. For instance, points that are closer in the dataset will have a lighter shade in the heatmap, indicating a smaller distance between them. Conclusion The dissimilarity matrix provides a comprehensive view of the pairwise distances between data points in a dataset. Visualizing this matrix as a heatmap offers insights into the relationships and patterns within the data, aiding in tasks like clustering and
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
data exploration. Exercise 6 Understanding the Hamming Distance Introduction Hamming distance is a measure of the difference between two strings of equal length. It is the number of positions at which the corresponding symbols in the two strings are different. In other words, it measures the minimum number of substitutions required to change one string into the other. Objective In this exercise, we will: 1. Compute the Hamming distance between pairs of strings in a sample dataset. 2. Visualize the distances using a bar chart. 3. Interpret the significance of the Hamming distance in understanding the similarity between strings. In [15]: # Required Libraries import numpy as np import matplotlib.pyplot as plt # Sample Dataset strings = ["101010", "100010", "111011", "101011", "110010"] # Compute Hamming Distance def hamming_distance(s1, s2): return sum(ch1 != ch2 for ch1, ch2 in zip(s1, s2)) distances = [hamming_distance(strings[0], s) for s in strings] distances Out[15]: [0, 1, 2, 1, 2] Explanation The code above first imports the necessary libraries and defines a sample dataset of five binary strings. We then define a function hamming_distance to compute the Hamming distance between two strings. Using this function, we compute the Hamming distances between the first string in the dataset and all other strings. In [16]: plt.bar(strings, distances, color='skyblue') plt.xlabel('Strings') plt.ylabel('Hamming Distance') plt.title('Hamming Distance from the First String') plt.show()
Interpretation The bar chart visualizes the Hamming distances between the first string and all other strings in the dataset. The height of each bar represents the number of positions at which the corresponding string differs from the first string. From the chart, we can observe which strings are most similar to the first string based on their Hamming distance. A smaller distance indicates higher similarity. Conclusion The Hamming distance provides a simple yet effective measure of similarity between strings. It is especially useful in applications like error detection and correction in coding theory. Exercise 7 Understanding the Jaccard Similarity for Categorical Data Introduction The Jaccard Similarity, also known as the Jaccard Index or Jaccard Coefficient, is a statistic used for comparing the similarity and diversity of sample sets. It is especially useful for categorical data. The Jaccard Similarity is defined as the size of the intersection divided by the size of the union of two sets. Mathematically, the Jaccard Similarity J for two sets A and B is given by: $J(A, B) = \frac{|A \cap B|}{|A \cup B|}$ In this exercise, we will compute the Jaccard Similarity for two sets of categorical data. Problem Statement Given two sets of categorical data representing the preferences of two different groups of people, compute the Jaccard Similarity to determine how similar the two groups are in terms of their preferences. In [17]: import numpy as np import pandas as pd # Generating a sample dataset np.random.seed(42) data = { 'Group1': np.random.choice(['Apple', 'Banana', 'Cherry', 'Date'], 100),
'Group2': np.random.choice(['Apple', 'Banana', 'Cherry', 'Date'], 100) } df = pd.DataFrame(data) # Display the first few rows of the dataset df.head() Out[17]: Group1 Group2 0 Cherry Cherry 1 Date Banana 2 Apple Banana 3 Cherry Date 4 Cherry Banana To compute the Jaccard Similarity for the two groups, we will: 1. Determine the unique categories chosen by each group. 2. Compute the intersection of the two sets. 3. Compute the union of the two sets. 4. Calculate the Jaccard Similarity using the formula provided. Let's proceed with these steps. In [18]: # Step 1: Determine the unique categories chosen by each group group1_unique = set(df['Group1'].unique()) group2_unique = set(df['Group2'].unique()) # Step 2: Compute the intersection of the two sets intersection = group1_unique.intersection(group2_unique) # Step 3: Compute the union of the two sets union = group1_unique.union(group2_unique) # Step 4: Calculate the Jaccard Similarity jaccard_similarity = len(intersection) / len(union) jaccard_similarity Out[18]: 1.0 The computed Jaccard Similarity gives us a measure of how similar the two groups are in terms of their categorical preferences. A value closer to 1 indicates high similarity, while a value closer to 0 indicates low similarity. Interpretation The Jaccard Similarity value suggests that the two groups have [high/low/moderate] similarity in their preferences. This can be useful in scenarios such as recommendation systems, where understanding the similarity between different user groups can help in providing better recommendations.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Visualization To better understand the overlap between the two groups, let's visualize the unique categories chosen by each group and their intersection. In [19]: #!conda install -c conda-forge matplotlib-venn In [20]: import matplotlib.pyplot as plt from matplotlib_venn import venn2 # Plotting the unique categories for each group and their intersection venn2_subsets = (len(group1_unique - group2_unique), len(group2_unique - group1_unique), len(intersection)) plt.figure(figsize=(8, 8)) venn2(subsets=venn2_subsets, set_labels=('Group1', 'Group2')) plt.title("Venn Diagram of Preferences for Group1 and Group2") plt.show()
The Venn diagram visually represents the unique preferences of each group and the common preferences between them. The overlapping region indicates the common categories chosen by both groups. Conclusion Understanding the Jaccard Similarity for categorical data provides insights into the similarity between different sets of data. It is a powerful tool for comparing sets and has applications in various domains, including machine learning, data analysis, and recommendation systems. Exercise 8 Understanding the Gower Distance for Mixed-Type Data Introduction When working with datasets that have a mix of numerical and categorical variables, traditional distance metrics like Euclidean or Manhattan might not be suitable. This is where the Gower Distance comes into play. It is specifically designed to handle mixed- type data by computing distances in a way that respects the nature of each variable. The Gower Distance is computed by standardizing each variable type's distance to the range [0, 1] and then taking an average. Problem Statement Given a dataset with a mix of numerical and categorical attributes, compute the Gower Distance between pairs of data points to determine their similarity. In [21]: import numpy as np import pandas as pd # Generating a sample dataset np.random.seed(42) data = { 'Age': np.random.randint(20, 60, 100), 'Income': np.random.randint(30000, 80000, 100), 'Fruit_Preference': np.random.choice(['Apple', 'Banana', 'Cherry'], 100), 'Is_Smoker': np.random.choice([True, False], 100) } df = pd.DataFrame(data) # Display the first few rows of the dataset df.head() Out[21]: Age Income Fruit_Preference Is_Smoker 0 58 53599 Banana True 1 48 65222 Cherry False 2 34 41837 Cherry True 3 27 44039 Cherry False 4 40 60818 Cherry True To compute the Gower Distance for the data points, we will:
1. Normalize numerical attributes to the range [0, 1]. 2. Compute a dissimilarity matrix for each attribute. 3. Combine the dissimilarity matrices using a weighted average. Let's proceed with these steps. In [22]: #!pip install gower In [23]: import gower # Compute the Gower Distance matrix gower_distance_matrix = gower.gower_matrix(df) # Display a portion of the Gower Distance matrix gower_distance_matrix[:5, :5] Out[23]: array([[0. , 0.6226343 , 0.46307787, 0.74686074, 0.40173846], [0.6226343 , 0. , 0.45750707, 0.24128991, 0.32345995], [0.46307787, 0.45750707, 0. , 0.30596074, 0.1340471 ], [0.74686074, 0.24128991, 0.30596074, 0. , 0.41782996], [0.40173846, 0.32345995, 0.1340471 , 0.41782996, 0. ]], dtype=float32) The Gower Distance matrix provides a pairwise distance between each data point in our dataset. Each value in the matrix represents the Gower Distance between two data points, with smaller values indicating higher similarity. Interpretation The Gower Distance is especially useful when we want to cluster or classify mixed-type data. A smaller Gower Distance between two data points suggests that they are more similar in terms of both their numerical and categorical attributes. Visualization To better understand the Gower Distance, let's visualize the distance matrix using a heatmap. In [24]: import seaborn as sns import matplotlib.pyplot as plt plt.figure(figsize=(10, 8)) sns.heatmap(gower_distance_matrix, cmap='viridis', cbar=True) plt.title("Gower Distance Heatmap") plt.show()
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
The heatmap visually represents the Gower Distances between data points. Darker colors indicate smaller distances (higher similarity), while lighter colors indicate larger distances (lower similarity). Conclusion The Gower Distance is a powerful metric for computing distances in mixed-type datasets. It provides a standardized way to compare data points with both numerical and categorical attributes, making it invaluable for clustering, classification, and other machine learning tasks on mixed-type data. Exercise 9 Understanding the Jaccard Coefficient for Binary Data Introduction The Jaccard Coefficient, also known as the Jaccard Index or Jaccard Similarity, is a statistic used for comparing the similarity and diversity of sample sets. It is defined as the size of the intersection divided by the size of the union of two sets. For binary data, it can be used to measure the similarity between two binary vectors. $J(A, B) = \frac{|A \cap B|}{|A \cup B|}$ Where:
( A ) and ( B ) are two sets. ( $|A \cap B|$ ) is the size of the intersection of sets ( A ) and ( B ). ( $|A \cup B|$ ) is the size of the union of sets ( A ) and ( B ). Problem Statement Given a dataset with binary attributes, compute the Jaccard Coefficient between pairs of data points to determine their similarity. In [25]: import numpy as np import pandas as pd # Generating a sample dataset with binary attributes np.random.seed(42) data = { 'Bought_Apple': np.random.choice([0, 1], 100), 'Bought_Banana': np.random.choice([0, 1], 100), 'Bought_Cherry': np.random.choice([0, 1], 100), 'Is_Vegetarian': np.random.choice([0, 1], 100) } df = pd.DataFrame(data) # Display the first few rows of the dataset df.head() Out[25]: Bought_Apple Bought_Banana Bought_Cherry Is_Vegetarian 0 0 0 0 0 1 1 1 1 0 2 0 1 0 0 3 0 1 0 1 4 0 1 1 1 To compute the Jaccard Coefficient for the data points, we will: 1. Calculate the intersection of the binary attributes. 2. Calculate the union of the binary attributes. 3. Use the Jaccard formula to compute the coefficient. Let's proceed with these steps. In [26]: from sklearn.metrics import jaccard_score # Compute the Jaccard Coefficient for the first two data points as an example data_point_1 = df.iloc[0] data_point_2 = df.iloc[1] jaccard_coefficient = jaccard_score(data_point_1, data_point_2, average='macro') jaccard_coefficient
Out[26]: 0.125 The computed Jaccard Coefficient provides a measure of similarity between the two binary vectors. A value of 1 indicates that the vectors are identical, while a value of 0 indicates no similarity. Interpretation The Jaccard Coefficient is a measure of similarity between two binary vectors. It takes into account the presence and absence of attributes, making it a robust metric for binary data comparison. Visualization To better understand the distribution of Jaccard Coefficients in our dataset, let's visualize the coefficients for all pairs of data points using a histogram. In [27]: jaccard_coefficients = [] # Compute Jaccard Coefficients for all pairs of data points for i in range(len(df)): for j in range(i+1, len(df)): coef = jaccard_score(df.iloc[i], df.iloc[j], average='macro') jaccard_coefficients.append(coef) # Plotting the histogram plt.hist(jaccard_coefficients, bins=20, edgecolor='k', alpha=0.7) plt.title("Distribution of Jaccard Coefficients") plt.xlabel("Jaccard Coefficient") plt.ylabel("Frequency") plt.show() The histogram visually represents the distribution of Jaccard Coefficients for all pairs of data points in our dataset. This gives us an insight into how similar or dissimilar the data points are to each other.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Conclusion The Jaccard Coefficient is a powerful metric for comparing binary vectors. It provides a standardized way to measure the similarity between two sets, making it invaluable for tasks like clustering, classification, and other machine learning tasks on binary data. Exercise 10 Understanding the Cosine Similarity for Binary Data Introduction Cosine similarity is a metric used to determine how similar two sets of data are. It measures the cosine of the angle between two vectors projected in a multi- dimensional space. For binary data, it can be used to measure the similarity between two binary vectors. $\text{cosine similarity}(A, B) = \frac{A \cdot B}{\|A\| \|B\|}$ Where: ( A ) and ( B ) are two vectors. ( $A \cdot B$ ) is the dot product of the vectors. ( |A| ) and ( |B| ) are the magnitudes (or lengths) of the vectors. Problem Statement Given a dataset with binary attributes, compute the cosine similarity between pairs of data points to determine their similarity. In [28]: import numpy as np import pandas as pd # Generating a sample dataset with binary attributes np.random.seed(42) data = { 'Bought_Apple': np.random.choice([0, 1], 100), 'Bought_Banana': np.random.choice([0, 1], 100), 'Bought_Cherry': np.random.choice([0, 1], 100), 'Is_Vegetarian': np.random.choice([0, 1], 100) } df = pd.DataFrame(data) # Display the first few rows of the dataset df.head() Out[28]: Bought_Apple Bought_Banana Bought_Cherry Is_Vegetarian 0 0 0 0 0 1 1 1 1 0 2 0 1 0 0 3 0 1 0 1 4 0 1 1 1 To compute the cosine similarity for the data points, we will: 1. Calculate the dot product of the binary attributes.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
2. Calculate the magnitude of each binary vector. 3. Use the cosine similarity formula to compute the similarity. Let's proceed with these steps. In [29]: from sklearn.metrics.pairwise import cosine_similarity # Compute the cosine similarity for the first two data points as an example data_point_1 = df.iloc[0].values.reshape(1, -1) data_point_2 = df.iloc[1].values.reshape(1, -1) cosine_sim = cosine_similarity(data_point_1, data_point_2) cosine_sim[0][0] Out[29]: 0.0 The computed cosine similarity provides a measure of similarity between the two binary vectors. A value close to 1 indicates that the vectors are very similar, while a value close to 0 indicates they are dissimilar. Interpretation Cosine similarity is a measure of similarity between two non-zero vectors. By using the cosine of the angle between them, we can determine how similar they are regardless of their size. Visualization To better understand the distribution of cosine similarities in our dataset, let's visualize the similarities for all pairs of data points using a histogram. In [30]: cosine_similarities = [] # Compute cosine similarities for all pairs of data points for i in range(len(df)): for j in range(i+1, len(df)): coef = cosine_similarity(df.iloc[i].values.reshape(1, -1), df.iloc[j].values.reshape(1, -1)) cosine_similarities.append(coef[0][0]) # Plotting the histogram import matplotlib.pyplot as plt plt.hist(cosine_similarities, bins=20, edgecolor='k', alpha=0.7) plt.title("Distribution of Cosine Similarities") plt.xlabel("Cosine Similarity") plt.ylabel("Frequency") plt.show()
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
The histogram visually represents the distribution of cosine similarities for all pairs of data points in our dataset. This gives us an insight into how similar or dissimilar the data points are to each other. Conclusion Cosine similarity is a powerful metric for comparing vectors, especially in high- dimensional spaces. For binary data, it offers a robust way to measure the similarity between two sets of attributes, making it invaluable for tasks like clustering, classification, and other machine learning tasks on binary data. Evaluating Clustering Methods Evaluating the results of clustering methods is critical to understand the quality and relevance of the clusters formed. Since clustering is unsupervised, assessing its effectiveness can be somewhat subjective. However, there are established metrics and techniques to guide this evaluation, both when ground truth labels are available and when they aren't. Internal Evaluation: Without ground truth labels, evaluate clustering based on the dataset's intrinsic properties. Metrics: 1. Silhouette Coefficient : Compares similarity of data points to their own cluster against other clusters. Values range from -1 (incorrect clustering) to 1 (highly dense clustering), with 0 suggesting overlapping clusters. 2. Davies-Bouldin Index : A ratio of within-cluster and between-cluster distances. Lower values indicate better clustering. 3. Calinski-Harabasz Index : Compares between-cluster dispersion to within-cluster dispersion. Higher values indicate better-defined clusters. 4. Dunn Index : Ratio of the smallest distance between points in different clusters to the
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
largest intra-cluster distance. Higher values indicate better clustering. Relative Evaluation: This involves comparing the results of clustering for different configurations or numbers of clusters. Techniques: 1. Elbow Method : Used with K-means to determine the optimal number of clusters. Plot the variance explained (or inertia) against the number of clusters. The "elbow" point, where the rate of decrease sharply changes, often indicates an optimal number of clusters. Stability and Consistency: Evaluate the robustness of clusters by perturbing the dataset. Techniques: 1. Sub-sampling or Bootstrapping : Repeatedly sample subsets of data and perform clustering. Examine the consistency of the clustering results. 2. Adding Noise : Introduce random noise to the data. Stable clusters should remain relatively unchanged. Challenges and Considerations: 1. Subjectivity : Without a definitive "correct" clustering, some evaluation aspects remain subjective. 2. Scale Sensitivity : Some metrics need data normalization or standardization. 3. Choice of Metric : Different metrics might give varied evaluations for the same clustering result. In conclusion, while evaluating clustering methods, it's often beneficial to consider multiple metrics and, when possible, combine them with domain knowledge to get a comprehensive view of the clustering quality. Exercise 11 Evaluating Clustering with Silhouette Coefficient Introduction The Silhouette Coefficient is a metric used to calculate the goodness of a clustering algorithm. Its value ranges from -1 to 1, where a high value indicates that the object is well matched to its own cluster and poorly matched to neighboring clusters. The formula for a single sample is: $s(i) = \frac{b(i) - a(i)}{max(a(i), b(i))}$ Where: ( a(i) ) is the mean distance between the sample and all other points in the same cluster. ( b(i) ) is the mean distance between the sample and all other points in the nearest cluster that the sample is not a part of. Problem Statement Given a dataset, apply a clustering algorithm and evaluate its performance using the Silhouette Coefficient. The goal is to understand how well the data has been clustered. In [31]: # Generating a sample dataset from sklearn.datasets import make_blobs X, y = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=0) # Visualizing the generated dataset plt.scatter(X[:, 0], X[:, 1], s=50)
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
plt.title("Generated Data Points") plt.show() We have generated a dataset with four clusters. Next, we will apply the KMeans clustering algorithm to this dataset and then evaluate the clustering using the Silhouette Coefficient. The steps are as follows: 1. Apply KMeans clustering to the dataset. 2. Calculate the Silhouette Coefficient for the clustering. 3. Visualize the clusters. 4. Interpret the results. In [32]: from sklearn.cluster import KMeans from sklearn.metrics import silhouette_score import warnings warnings.filterwarnings('ignore') # Applying KMeans clustering kmeans = KMeans(n_clusters=4) predicted_clusters = kmeans.fit_predict(X) # Calculating the Silhouette Coefficient sil_coeff = silhouette_score(X, predicted_clusters, metric='euclidean') sil_coeff Out[32]: 0.6819938690643478 The Silhouette Coefficient gives a perspective into the distance between the resulting clusters. More distant clusters lead to better clusterings. Visualization Let's visualize the clusters formed by the KMeans algorithm. In [33]: plt.scatter(X[:, 0], X[:, 1], c=predicted_clusters, s=50, cmap='viridis')
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
centers = kmeans.cluster_centers_ plt.scatter(centers[:, 0], centers[:, 1], c='red', s=200, alpha=0.75, marker='X') plt.title("Clusters Formed by KMeans") plt.show() The visualization shows the clusters formed by the KMeans algorithm. The red 'X' markers represent the centroids of the clusters. Interpretation The Silhouette Coefficient provides a succinct metric to measure how close each point in one cluster is to the points in the neighboring clusters. Its values range from -1 to 1, where a high value indicates that the object is well matched to its own cluster and poorly matched to neighboring clusters. If most objects have a high value, then the clustering configuration is appropriate. If many points have a low or negative value, then the clustering configuration may have too many or too few clusters. In our exercise, the Silhouette Coefficient value suggests that the clustering is reasonably well done. Conclusion The Silhouette Coefficient is an effective metric to evaluate the quality of clusters created by a clustering algorithm. It provides insight into the separation distance between the resulting clusters. More distant clusters lead to better clustering. Exercise 12 Evaluating Clustering with Davies-Bouldin Index Introduction The Davies-Bouldin Index (DBI) is a metric used to evaluate clustering algorithms. The index signifies the average 'similarity' ratio between each cluster and its most similar cluster. A lower Davies-Bouldin Index relates to a model with better separation between the clusters. The formula for the Davies-Bouldin Index is: $DBI = \frac{1}{k} \sum_{i=1}^{k} \max_{i \neq j} \left( \frac{s_i + s_j} {d_{ij}} \right)$ Where: ( $s_i$ ) is the average distance of all points in cluster ( i ) to the centroid of cluster ( i ). ( $d_{ij}$ ) is the distance between cluster centroids ( i ) and ( j ).
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Problem Statement Given a dataset, apply a clustering algorithm and evaluate its performance using the Davies-Bouldin Index. The goal is to understand how well the data has been clustered. In [34]: # Generating a sample dataset from sklearn.datasets import make_blobs X, y = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=0) # Visualizing the generated dataset plt.scatter(X[:, 0], X[:, 1], s=50) plt.title("Generated Data Points") plt.show() We have generated a dataset with four clusters. Next, we will apply the KMeans clustering algorithm to this dataset and then evaluate the clustering using the Davies- Bouldin Index. In [35]: from sklearn.cluster import KMeans from sklearn.metrics import davies_bouldin_score # Applying KMeans clustering kmeans = KMeans(n_clusters=4) predicted_clusters = kmeans.fit_predict(X) # Calculating the Davies-Bouldin Index dbi = davies_bouldin_score(X, predicted_clusters) dbi Out[35]: 0.43756400782378396 The Davies-Bouldin Index gives a measure of the average similarity ratio of each cluster with its most similar cluster. Lower values indicate better clustering.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Interpretation The Davies-Bouldin Index provides a measure of the average similarity ratio of each cluster with its most similar cluster. Lower values of the index indicate better clustering, implying that the clusters are dense and well separated. In our exercise, the Davies-Bouldin Index value suggests that the clustering has been done effectively, with a good separation between the clusters. Conclusion The Davies-Bouldin Index is a valuable metric to evaluate the quality of clusters created by a clustering algorithm. It offers insight into the separation and density of the resulting clusters. Lower values of the index are desirable as they indicate better clustering. Exercise 13 Evaluating Clustering with Calinski-Harabasz Index Introduction The Calinski-Harabasz Index (CHI), also known as the Variance Ratio Criterion, is a metric used to evaluate clustering algorithms. The index is the ratio of the sum of between-cluster dispersion to within-cluster dispersion. A higher Calinski-Harabasz Index relates to a model with better-defined clusters. The formula for the Calinski-Harabasz Index is: $CHI = \frac{B / (k - 1)}{W / (N - k)}$ Where: ( B ) is the between-group dispersion matrix. ( W ) is the within-cluster dispersion matrix. ( k ) is the number of clusters. ( N ) is the number of data points. Problem Statement Given a dataset, apply a clustering algorithm and evaluate its performance using the Calinski-Harabasz Index. The goal is to understand how well the data has been clustered. In [36]: # Generating a sample dataset from sklearn.datasets import make_blobs X, y = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=0) # Visualizing the generated dataset plt.scatter(X[:, 0], X[:, 1], s=50) plt.title("Generated Data Points") plt.show()
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
In [37]: from sklearn.cluster import KMeans from sklearn.metrics import calinski_harabasz_score # Applying KMeans clustering kmeans = KMeans(n_clusters=4) predicted_clusters = kmeans.fit_predict(X) # Calculating the Calinski-Harabasz Index chi = calinski_harabasz_score(X, predicted_clusters) chi Out[37]: 1210.0899142587818 The Calinski-Harabasz Index gives a measure of the ratio of the sum of between- cluster dispersion to within-cluster dispersion. Higher values indicate better clustering. Interpretation The Calinski-Harabasz Index provides a measure of the ratio of the sum of between- cluster dispersion to within-cluster dispersion. Higher values of the index indicate better clustering, implying that the clusters are dense and well separated. In our exercise, the Calinski-Harabasz Index value suggests that the clustering has been done effectively, with a good separation between the clusters. Conclusion The Calinski-Harabasz Index is a valuable metric to evaluate the quality of clusters created by a clustering algorithm. It offers insight into the separation and density of the resulting clusters. Higher values of the index are desirable as they indicate better clustering. Exercise 14 Evaluating Clustering with Dunn Index Introduction The Dunn Index is a metric used to determine the compactness and separation of clusters. A higher Dunn Index indicates better clustering. The index is defined as the ratio between the minimum inter-cluster distances to the maximum intra-cluster diameter.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
The formula for the Dunn Index is: $DI = \frac{\min(distance\ between\ clusters)}{\max(diameter\ of\ clusters)} $ Where: The distance between clusters is the distance between two points from two different clusters. The diameter of a cluster is the distance between two furthest points in the cluster. Problem Statement Given a dataset, apply a clustering algorithm and evaluate its performance using the Dunn Index. The goal is to understand how well the data has been clustered. In [38]: # Generating a sample dataset from sklearn.datasets import make_blobs X, y = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=0) # Visualizing the generated dataset plt.scatter(X[:, 0], X[:, 1], s=50) plt.title("Generated Data Points") plt.show() In [39]: from sklearn.cluster import KMeans from sklearn.metrics import pairwise_distances import numpy as np # Applying KMeans clustering kmeans = KMeans(n_clusters=4) predicted_clusters = kmeans.fit_predict(X) # Calculating the Dunn Index def dunn_index(X, labels): pairwise_dists = pairwise_distances(X)
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
min_intercluster_distance = np.min([pairwise_dists[labels == i][:, labels == j].min() for i in np.unique(labels) for j in np.unique(labels) if i != j]) max_diameter = max([np.max(pairwise_distances(X[labels == i])) for i in np.unique(labels)]) return min_intercluster_distance / max_diameter di = dunn_index(X, predicted_clusters) di Out[39]: 0.20231427477727162 The Dunn Index provides a measure of the compactness and separation of the clusters. A higher Dunn Index indicates better clustering. Interpretation The Dunn Index provides a measure of the compactness and separation of the clusters. A higher Dunn Index suggests that the clusters are compact and well- separated from each other. In our exercise, the Dunn Index value suggests that the clustering has been done effectively, with a good separation between the clusters. Conclusion The Dunn Index is a valuable metric to evaluate the quality of clusters created by a clustering algorithm. It offers insight into the separation and compactness of the resulting clusters. Higher values of the index are desirable as they indicate better clustering. Exercise 15 Validating Clustering using the Elbow Method Introduction The Elbow Method is a heuristic used in determining the number of clusters in a dataset. The idea is to run k-means clustering on the dataset for a range of values of k (e.g., k from 1 to 10), and then for each value of k compute the sum of squared distances from each point to its assigned center. When these overall dispersions are plotted against k values, the "elbow" of the curve represents an optimal value for k (a balance between precision and computational cost). Problem Statement Given a dataset, we aim to find the optimal number of clusters into which the data may be clustered. The goal is to understand the concept of the Elbow Method and how it can be used to determine the optimal number of clusters for a given dataset. In [40]: # Generating a sample dataset from sklearn.datasets import make_blobs X, y = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=0) # Visualizing the generated dataset import matplotlib.pyplot as plt plt.scatter(X[:, 0], X[:, 1], s=50) plt.title("Generated Data Points") plt.show()
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
We have generated a dataset with four clusters. Next, we will apply the KMeans clustering algorithm to this dataset for a range of k values and compute the sum of squared distances for each value of k. We will then plot these values to visualize the "elbow" and determine the optimal number of clusters. The steps are as follows: 1. Apply KMeans clustering to the dataset for a range of k values. 2. Compute the sum of squared distances for each k. 3. Plot the sum of squared distances against k values. 4. Determine the "elbow" or the point of inflection on the curve. In [41]: from sklearn.cluster import KMeans # Applying KMeans clustering for a range of k values wcss = [] # Within-Cluster-Sum-of-Squares for i in range(1, 11): kmeans = KMeans(n_clusters=i, init='k-means++', random_state=42) kmeans.fit(X) wcss.append(kmeans.inertia_) # Plotting the Elbow Method graph plt.figure(figsize=(10,5)) plt.plot(range(1, 11), wcss, marker='o', linestyle='--') plt.title('Elbow Method') plt.xlabel('Number of clusters') plt.ylabel('WCSS') plt.show()
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
The Elbow Method graph shows the sum of squared distances (WCSS) for different values of k. As k increases, the sum of squared distance tends to zero. The location of the "elbow" on the curve is generally considered as an indicator of the appropriate number of clusters. Interpretation From the graph, we can observe that the "elbow" is formed when the number of clusters is around 4. This suggests that the optimal number of clusters for this dataset is 4. Conclusion The Elbow Method is a powerful technique to determine the optimal number of clusters for a dataset. It provides a visual representation of how the sum of squared distances changes with different numbers of clusters, helping in making an informed decision about the number of clusters to use. Exercise 16 Validating Clustering using Cohesion Introduction Cohesion measures the closeness of the data points within the same cluster. It is the total distance of each sample to its cluster centroid. Lower values of cohesion indicate that the data points are closer to the centroids of their respective clusters, which implies better clustering. Problem Statement Given a dataset, we aim to find the cohesion value after applying the KMeans clustering algorithm. The goal is to understand the concept of cohesion and how it can be used to evaluate the quality of clustering. In [42]: # Generating a sample dataset from sklearn.datasets import make_blobs
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
X, y = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=0) # Visualizing the generated dataset import matplotlib.pyplot as plt plt.scatter(X[:, 0], X[:, 1], s=50) plt.title("Generated Data Points") plt.show() We have generated a dataset with four clusters. Next, we will apply the KMeans clustering algorithm to this dataset and compute the cohesion value. The steps are as follows: 1. Apply KMeans clustering to the dataset. 2. Compute the cohesion value. 3. Interpret the result. In [43]: from sklearn.cluster import KMeans # Applying KMeans clustering kmeans = KMeans(n_clusters=4, init='k-means++', random_state=42) kmeans.fit(X) # Computing the cohesion value cohesion = kmeans.inertia_ cohesion Out[43]: 212.00599621083478 Interpretation The cohesion value represents the sum of squared distances of samples to their closest cluster center. A lower cohesion value indicates that the data points are closer to the centroids of their respective clusters, suggesting better clustering. However, it's essential to balance cohesion with other metrics and the problem's context to
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
determine the best number of clusters. Conclusion Cohesion is a valuable metric to evaluate the quality of clustering. It provides a measure of how close the data points are to their respective cluster centroids. By analyzing cohesion and other metrics, we can make informed decisions about the clustering process and its effectiveness. Exercise 17 Validating Clustering using Separation Score Introduction Separation measures how distinct or well-separated the clusters are from each other. It is the total distance between cluster centroids. Higher values of separation indicate that the clusters are well-separated, which implies better clustering. Problem Statement Given a dataset, we aim to find the separation score after applying the KMeans clustering algorithm. The goal is to understand the concept of separation and how it can be used to evaluate the quality of clustering. In [44]: # Generating a sample dataset from sklearn.datasets import make_blobs X, y = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=0) # Visualizing the generated dataset import matplotlib.pyplot as plt plt.scatter(X[:, 0], X[:, 1], s=50) plt.title("Generated Data Points") plt.show() We have generated a dataset with four clusters. Next, we will apply the KMeans clustering algorithm to this dataset and compute the separation score. The steps are as follows: 1. Apply KMeans clustering to the dataset. 2. Compute the separation score.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
3. Interpret the result. In [45]: from sklearn.cluster import KMeans import numpy as np # Applying KMeans clustering kmeans = KMeans(n_clusters=4, init='k-means++', random_state=42) kmeans.fit(X) # Computing the separation score centroids = kmeans.cluster_centers_ separation = np.sum(np.var(centroids, axis=0)) separation Out[45]: 8.66710533030799 Interpretation The separation score represents the variance between cluster centroids. A higher separation score indicates that the clusters are well-separated from each other, suggesting better clustering. However, it's essential to balance separation with other metrics and the problem's context to determine the best number of clusters. Conclusion Separation is a valuable metric to evaluate the quality of clustering. It provides a measure of how distinct or well-separated the clusters are from each other. By analyzing separation and other metrics, we can make informed decisions about the clustering process and its effectiveness. Revised Date: October 23, 2023 In [ ]:
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help