IE6400_Day18
html
keyboard_arrow_up
School
Northeastern University *
*We aren’t endorsed by this school
Course
6400
Subject
Industrial Engineering
Date
Feb 20, 2024
Type
html
Pages
37
Uploaded by ColonelStraw13148
IE6400 Foundations of Data Analytics Engineering
¶
Fall 2023
¶
Module 3: Clustering Methods Part -1
¶
Proximity Measures
¶
Proximity measures are metrics used to determine the similarity or dissimilarity between data points. They play a crucial role in various machine learning and data analysis techniques, especially clustering and classification. The choice of a proximity measure often depends on the nature of the data and the specific problem at hand.
Types of Proximity Measures
¶
There are two main types of proximity measures:
1.
Similarity Measures
: These quantify how similar two data points are. Higher values indicate greater similarity. 2.
Dissimilarity Measures (or Distance Measures)
: These represent the "distance" or dissimilarity between two data points. Higher values indicate greater dissimilarity. Common Proximity Measures
¶
For Continuous Data:
¶
1.
Euclidean Distance
:
•
It's the "ordinary" straight-line distance between two points in Euclidean space.
$d(x, y) = \sqrt{\sum_{i=1}^{n} (x_i - y_i)^2}$
1.
Manhattan Distance (or L1 norm)
:
•
It's the distance between two points measured along axes at right angles (taxicab or city block distance).
$d(x, y) = \sum_{i=1}^{n} |x_i - y_i|$
1.
Minkowski Distance
:
•
A generalized metric. When (p=2), it becomes the Euclidean distance. When (p=1), it's the Manhattan distance.
$d(x, y) = \left( \sum_{i=1}^{n} |x_i - y_i|^p \right)^{\frac{1}{p}}$
For Categorical Data:
¶
1.
Hamming Distance
:
•
Used for categorical variables. It's the number of positions at which the corresponding symbols in two strings of equal length are different. 2.
Jaccard Similarity
:
•
Measures the similarity between two sets. It's the size of the intersection divided by the size of the union of the two sets.
$J(A, B) = \frac{|A \cap B|}{|A \cup B|}$
For Mixed-Type Data:
¶
1.
Gower Distance
: •
Combines various distance metrics for mixed-type data. For Binary Data:
¶
1.
Jaccard Coefficient
:
•
Similar to Jaccard similarity but specifically tailored for binary attributes. 2.
Cosine Similarity
:
•
Measures the cosine of the angle between two non-zero vectors. It's often used in text analysis to determine similarity between documents.
$\text{cosine_similarity}(A, B) = \frac{A \cdot B}{\|A\| \|B\|}$
Choosing the Right Proximity Measure
¶
The choice of proximity measure depends on:
•
Nature of Data
: Continuous, categorical, binary, or mixed. •
Domain Knowledge
: The context in which data points are being compared. •
Problem Specifics
: Certain problems may necessitate specific measures. In general, it's crucial to understand the data and the problem's requirements before choosing a proximity measure.
Exercise 1 Understanding Euclidean Distance
¶
Introduction
¶
Euclidean distance is a measure of the straight line distance between two points in Euclidean space. It's a fundamental concept in mathematics and has wide applications in machine learning, especially in clustering algorithms like K-Means.
Problem Statement
¶
Imagine you are working on a recommendation system for a retail store. You have data on the purchase history of customers for two products: A and B. You want to find out how similar two customers are based on their purchase patterns of these two products. One way to measure this similarity is by calculating the Euclidean distance between their purchase histories.
Given the purchase data for two products, can you compute the Euclidean distance between two customers?
Dataset
¶
For simplicity, let's consider a small dataset representing the number of units of products A and B bought by different customers:
Customer
Product A
Product B
1
5
3
2
2
8
3
9
1
4
4
7
In [1]:
# Import necessary libraries
import numpy as np
import matplotlib.pyplot as plt
# Define the dataset
customers = np.array([[5, 3], [2, 8], [9, 1], [4, 7]])
# Visualize the data
plt.scatter(customers[:, 0], customers[:, 1], color='blue', label='Customers')
plt.xlabel('Product A')
plt.ylabel('Product B')
plt.title('Purchase History of Customers')
plt.grid(True)
plt.show()
From the scatter plot, we can visualize the purchase patterns of the four customers. Each point represents a customer's purchase history for products A and B.
Euclidean Distance
¶
The Euclidean distance between two points ( $P(x_1, y_1)$ ) and ( $Q(x_2, y_2)$ ) in a 2D plane is given by:
$d(P, Q) = \sqrt{(x_2 - x_1)^2 + (y_2 - y_1)^2}$
Let's compute the Euclidean distance between Customer 1 and Customer 2.
In [2]:
# Define a function to compute Euclidean distance
def euclidean_distance(point1, point2):
return np.sqrt(np.sum((point1 - point2) ** 2))
# Calculate the distance between Customer 1 and Customer 2
distance = euclidean_distance(customers[0], customers[1])
print(f"Euclidean Distance between Customer 1 and Customer 2: {distance:.2f}")
Euclidean Distance between Customer 1 and Customer 2: 5.83
The computed distance gives us a measure of how similar or dissimilar the two customers are based on their purchase patterns. A smaller distance indicates similar purchase behaviors, while a larger distance indicates dissimilarity.
Visualization
¶
To better understand the concept, let's visualize the Euclidean distance between Customer 1 and Customer 2 on our scatter plot.
In [3]:
# Visualize the data with the Euclidean distance
plt.scatter(customers[:, 0], customers[:, 1], color='blue', label='Customers')
plt.plot([customers[0][0], customers[1][0]], [customers[0][1], customers[1][1]], 'ro-')
plt.xlabel('Product A')
plt.ylabel('Product B')
plt.title('Euclidean Distance between Customer 1 and Customer 2')
plt.grid(True)
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
plt.show()
The red line represents the Euclidean distance between Customer 1 and Customer 2. As we can see, despite both customers purchasing different quantities of products A and B, we can quantify their similarity using this distance.
Conclusion
¶
Euclidean distance is a powerful metric to measure the similarity between data points. In this exercise, we applied it to a retail scenario, but its applications are vast, spanning areas like clustering, image processing, and more.
Exercise 2 Understanding Manhattan Distance
¶
Introduction
¶
Manhattan distance, also known as L1 distance or taxicab distance, is a measure of distance between two points in a grid-based path. It is computed as the sum of the absolute differences of their coordinates. Unlike the Euclidean distance, which measures the shortest path (a straight line), the Manhattan distance measures the distance "traveled" on a grid.
Problem Statement
¶
Imagine you are a taxi driver in a city where the roads are laid out in a perfect grid. You pick up a passenger at one intersection and need to drop them off at another. The Manhattan distance gives you the total number of blocks you'd need to drive, in a straight line horizontally and/or vertically, to reach the destination.
Given the coordinates of the starting and ending intersections, can you compute the Manhattan distance between them?
Dataset
¶
For simplicity, let's consider a small dataset representing the starting and ending coordinates of a few taxi rides:
Ride
Start (x, y)
End (x, y)
1
(2, 3)
(5, 6)
2
(1, 4)
(4, 2)
Ride
Start (x, y)
End (x, y)
3
(3, 3)
(3, 7)
4
(6, 1)
(2, 5)
In [4]:
# Import necessary libraries
import numpy as np
import matplotlib.pyplot as plt
# Define the dataset
rides = {
'Start': [(2, 3), (1, 4), (3, 3), (6, 1)],
'End': [(5, 6), (4, 2), (3, 7), (2, 5)]
}
# Visualize the data
for start, end in zip(rides['Start'], rides['End']):
plt.plot([start[0], end[0]], [start[1], end[1]], 'ro-')
plt.xlabel('X Coordinate')
plt.ylabel('Y Coordinate')
plt.title('Taxi Rides')
plt.grid(True)
plt.show()
From the plot, we can visualize the starting and ending points of the taxi rides. Each line segment represents a ride.
Manhattan Distance
¶
The Manhattan distance between two points ( $P(x_1, y_1)$ ) and ( $Q(x_2, y_2)$ ) in a
2D plane is given by:
$d(P, Q) = |x_2 - x_1| + |y_2 - y_1|$
Let's compute the Manhattan distance for Ride 1.
In [5]:
# Define a function to compute Manhattan distance
def manhattan_distance(point1, point2):
return abs(point1[0] - point2[0]) + abs(point1[1] - point2[1])
# Calculate the distance for Ride 1
distance = manhattan_distance(rides['Start'][0], rides['End'][0])
print(f"Manhattan Distance for Ride 1: {distance} blocks")
Manhattan Distance for Ride 1: 6 blocks
The computed distance gives us the total number of blocks the taxi would need to drive to get from the starting to the ending intersection for Ride 1.
Visualization
¶
To better understand the concept, let's visualize the Manhattan distance for Ride 1 on our plot.
In [6]:
# Visualize the data with the Manhattan distance for Ride 1
start, end = rides['Start'][0], rides['End'][0]
plt.plot([start[0], end[0]], [start[1], start[1]], 'bo-')
plt.plot([end[0], end[0]], [start[1], end[1]], 'bo-')
plt.xlabel('X Coordinate')
plt.ylabel('Y Coordinate')
plt.title('Manhattan Distance for Ride 1')
plt.grid(True)
plt.show()
The blue lines represent the Manhattan distance for Ride 1. As we can see, the taxi
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
would first travel horizontally and then vertically to reach the destination, covering a total distance equal to the sum of the lengths of these two segments.
Conclusion
¶
Manhattan distance is a useful metric in grid-based environments, such as urban city blocks. In this exercise, we applied it to a taxi scenario, but its applications are vast, spanning areas like computer vision, robotics, and more.
Exercise 3 Understanding Chebyshev Distance
¶
Introduction
¶
Chebyshev distance, also known as the maximum value distance, is a metric defined on a vector space where the distance between two vectors is the greatest of their differences along any coordinate dimension. It's often used in chess to calculate the minimum number of moves a king can take to go from one square to another.
Problem Statement
¶
Imagine you are playing a game on a grid, and you have a piece that can move horizontally, vertically, or diagonally, but only one square at a time. How many moves would it take for your piece to go from its starting position to a target position?
Given the coordinates of the starting and target positions, can you compute the Chebyshev distance between them?
Dataset
¶
For simplicity, let's consider a small dataset representing the starting and target positions of a few game moves:
Move
Start (x, y)
Target (x, y)
1
(2, 3)
(5, 6)
2
(1, 4)
(4, 2)
3
(3, 3)
(3, 7)
4
(6, 1)
(2, 5)
In [7]:
# Import necessary libraries
import numpy as np
import matplotlib.pyplot as plt
# Define the dataset
moves = {
'Start': [(2, 3), (1, 4), (3, 3), (6, 1)],
'Target': [(5, 6), (4, 2), (3, 7), (2, 5)]
}
# Visualize the data
for start, target in zip(moves['Start'], moves['Target']):
plt.plot([start[0], target[0]], [start[1], target[1]], 'ro-')
plt.xlabel('X Coordinate')
plt.ylabel('Y Coordinate')
plt.title('Game Moves')
plt.grid(True)
plt.show()
From the plot, we can visualize the starting and target positions of the game moves. Each line segment represents a move.
Chebyshev Distance
¶
The Chebyshev distance between two points ( $P(x_1, y_1)$ ) and ( $Q(x_2, y_2)$ ) in a 2D plane is given by:
$d(P, Q) = max(|x_2 - x_1|, |y_2 - y_1|)$
Let's compute the Chebyshev distance for Move 1.
In [8]:
# Define a function to compute Chebyshev distance
def chebyshev_distance(point1, point2):
return max(abs(point1[0] - point2[0]), abs(point1[1] - point2[1]))
# Calculate the distance for Move 1
distance = chebyshev_distance(moves['Start'][0], moves['Target'][0])
print(f"Chebyshev Distance for Move 1: {distance} squares")
Chebyshev Distance for Move 1: 3 squares
The computed distance gives us the minimum number of moves required for the piece to go from the starting to the target position for Move 1.
Visualization
¶
To better understand the concept, let's visualize the Chebyshev distance for Move 1 on
our plot.
In [9]:
# Visualize the data with the Chebyshev distance for Move 1
start, target = moves['Start'][0], moves['Target'][0]
plt.scatter(*zip(*[start, target]), color=['blue', 'red'])
plt.plot([start[0], target[0]], [start[1], start[1]], 'g--')
plt.plot([target[0], target[0]], [start[1], target[1]], 'g--')
plt.xlabel('X Coordinate')
plt.ylabel('Y Coordinate')
plt.title('Chebyshev Distance for Move 1')
plt.grid(True)
plt.show()
The green dashed lines represent the possible paths the piece can take to reach the target position using the Chebyshev distance. As we can see, the piece would move either horizontally or vertically (whichever is greater) to reach the target position.
Conclusion
¶
Chebyshev distance is a useful metric in grid-based games or scenarios, especially when diagonal movements are allowed. In this exercise, we applied it to a game scenario, but its applications can be found in various fields like robotics, pathfinding algorithms, and more.
Exercise 4 Understanding Minkowski Distance
¶
Introduction
¶
The Minkowski distance is a metric in a normed vector space which can be considered as a generalization of both the Euclidean distance and the Manhattan distance. It is defined for two points p
and q
in an n
dimensional space as:
$D(p, q) = \left( \sum_{i=1}^{n} |p_i - q_i|^p \right)^{\frac{1}{p}}$
Where:
•
$p_i$ and $q_i$ are the i-th
coordinates of points p
and q
respectively. •
( p ) is the order parameter. When ( $p = 2$ ), it becomes the Euclidean distance, and when ( $p = 1$ ), it's the Manhattan distance.
Objective
¶
In this exercise, we will:
1. Compute the Minkowski distance for different values of ( p ) using a sample dataset. 2. Visualize the results. 3. Interpret the significance of the Minkowski distance and its relation to other distance metrics. In [10]:
# Required Libraries
import numpy as np
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
import matplotlib.pyplot as plt
# Sample Dataset
points = np.array([[2, 3], [3, 5]])
# Minkowski Distance Function
def minkowski_distance(p1, p2, p):
return np.sum(np.abs(p1 - p2) ** p) ** (1/p)
# Calculate Minkowski Distance for p=1,2,3,4
p_values = [1, 2, 3, 4]
distances = [minkowski_distance(points[0], points[1], p) for p in p_values]
distances
Out[10]:
[3.0, 2.23606797749979, 2.080083823051904, 2.0305431848689306]
Explanation
¶
The code above first imports necessary libraries and defines a sample dataset of two points in a 2D space. We then define the Minkowski distance function and compute the
distance for different values of ( p ).
The resulting distances for different ( p ) values are:
In [11]:
for p, d in zip(p_values, distances):
print(f"For p = {p}, Minkowski Distance = {d:.2f}")
For p = 1, Minkowski Distance = 3.00
For p = 2, Minkowski Distance = 2.24
For p = 3, Minkowski Distance = 2.08
For p = 4, Minkowski Distance = 2.03
Visualization
¶
Let's visualize the Minkowski distance for different values of ( p ) to better understand its behavior.
In [12]:
plt.plot(p_values, distances, 'o-', color='blue')
plt.xlabel('Value of p')
plt.ylabel('Minkowski Distance')
plt.title('Minkowski Distance for Different p Values')
plt.grid(True)
plt.show()
Interpretation
¶
From the plot, we can observe that as the value of ( p ) increases, the Minkowski distance tends to stabilize. When ( p = 1 ), it represents the Manhattan distance, and when ( p = 2 ), it's the Euclidean distance. As ( p ) grows, the distance metric becomes
less sensitive to outliers.
Conclusion
¶
The Minkowski distance is a versatile metric that encompasses other popular distance metrics like Euclidean and Manhattan. By adjusting the ( p ) value, we can control the sensitivity of the distance measure, making it adaptable to various applications.
Exercise 5 Understanding the Dissimilarity Matrix
¶
Introduction
¶
A dissimilarity matrix, also known as a distance matrix, is a square matrix that represents the pairwise dissimilarity between elements of a dataset. Each entry in the matrix represents the distance between two data points. The diagonal of the matrix always contains zeros, as the distance between a point and itself is zero.
Objective
¶
In this exercise, we will:
1. Compute the dissimilarity matrix for a sample dataset using the Euclidean distance. 2. Visualize the matrix using a heatmap. 3. Interpret the significance of the dissimilarity matrix in understanding the relationships between data points. In [13]:
# Required Libraries
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.spatial import distance_matrix
# Sample Dataset
data = np.array([[2, 3], [3, 5], [5, 8], [8, 9], [7, 5]])
# Compute Dissimilarity Matrix
dissimilarity = distance_matrix(data, data)
dissimilarity
Out[13]:
array([[0. , 2.23606798, 5.83095189, 8.48528137, 5.38516481],
[2.23606798, 0. , 3.60555128, 6.40312424, 4. ],
[5.83095189, 3.60555128, 0. , 3.16227766, 3.60555128],
[8.48528137, 6.40312424, 3.16227766, 0. , 4.12310563],
[5.38516481, 4. , 3.60555128, 4.12310563, 0. ]])
Explanation
¶
The code above first imports the necessary libraries and defines a sample dataset of five points in a 2D space. We then compute the dissimilarity matrix using the distance_matrix
function from the scipy.spatial
module.
The resulting matrix represents the pairwise Euclidean distances between the data points.
In [14]:
sns.heatmap(dissimilarity, annot=True, cmap='YlGnBu', cbar=True)
plt.title('Dissimilarity Matrix Heatmap')
plt.show()
Interpretation
¶
The heatmap visualizes the dissimilarity matrix. The color intensity in the heatmap represents the magnitude of the distance between data points. Darker shades indicate
larger distances, while lighter shades indicate smaller distances.
From the heatmap, we can observe patterns and relationships between data points. For instance, points that are closer in the dataset will have a lighter shade in the heatmap, indicating a smaller distance between them.
Conclusion
¶
The dissimilarity matrix provides a comprehensive view of the pairwise distances between data points in a dataset. Visualizing this matrix as a heatmap offers insights into the relationships and patterns within the data, aiding in tasks like clustering and
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
data exploration.
Exercise 6 Understanding the Hamming Distance
¶
Introduction
¶
Hamming distance is a measure of the difference between two strings of equal length. It is the number of positions at which the corresponding symbols in the two strings are different. In other words, it measures the minimum number of substitutions required to
change one string into the other.
Objective
¶
In this exercise, we will:
1. Compute the Hamming distance between pairs of strings in a sample dataset. 2. Visualize the distances using a bar chart. 3. Interpret the significance of the Hamming distance in understanding the similarity between strings. In [15]:
# Required Libraries
import numpy as np
import matplotlib.pyplot as plt
# Sample Dataset
strings = ["101010", "100010", "111011", "101011", "110010"]
# Compute Hamming Distance
def hamming_distance(s1, s2):
return sum(ch1 != ch2 for ch1, ch2 in zip(s1, s2))
distances = [hamming_distance(strings[0], s) for s in strings]
distances
Out[15]:
[0, 1, 2, 1, 2]
Explanation
¶
The code above first imports the necessary libraries and defines a sample dataset of five binary strings. We then define a function hamming_distance
to compute the Hamming distance between two strings.
Using this function, we compute the Hamming distances between the first string in the
dataset and all other strings.
In [16]:
plt.bar(strings, distances, color='skyblue')
plt.xlabel('Strings')
plt.ylabel('Hamming Distance')
plt.title('Hamming Distance from the First String')
plt.show()
Interpretation
¶
The bar chart visualizes the Hamming distances between the first string and all other strings in the dataset. The height of each bar represents the number of positions at which the corresponding string differs from the first string.
From the chart, we can observe which strings are most similar to the first string based on their Hamming distance. A smaller distance indicates higher similarity.
Conclusion
¶
The Hamming distance provides a simple yet effective measure of similarity between strings. It is especially useful in applications like error detection and correction in coding theory.
Exercise 7 Understanding the Jaccard Similarity for Categorical Data
¶
Introduction
¶
The Jaccard Similarity, also known as the Jaccard Index or Jaccard Coefficient, is a statistic used for comparing the similarity and diversity of sample sets. It is especially useful for categorical data. The Jaccard Similarity is defined as the size of the intersection divided by the size of the union of two sets.
Mathematically, the Jaccard Similarity J
for two sets A
and B
is given by:
$J(A, B) = \frac{|A \cap B|}{|A \cup B|}$
In this exercise, we will compute the Jaccard Similarity for two sets of categorical data.
Problem Statement
¶
Given two sets of categorical data representing the preferences of two different groups
of people, compute the Jaccard Similarity to determine how similar the two groups are in terms of their preferences.
In [17]:
import numpy as np
import pandas as pd
# Generating a sample dataset
np.random.seed(42)
data = {
'Group1': np.random.choice(['Apple', 'Banana', 'Cherry', 'Date'], 100),
'Group2': np.random.choice(['Apple', 'Banana', 'Cherry', 'Date'], 100)
}
df = pd.DataFrame(data)
# Display the first few rows of the dataset
df.head()
Out[17]:
Group1 Group2
0
Cherry
Cherry
1
Date
Banana
2
Apple
Banana
3
Cherry
Date
4
Cherry
Banana
To compute the Jaccard Similarity for the two groups, we will:
1. Determine the unique categories chosen by each group. 2. Compute the intersection of the two sets. 3. Compute the union of the two sets. 4. Calculate the Jaccard Similarity using the formula provided. Let's proceed with these steps.
In [18]:
# Step 1: Determine the unique categories chosen by each group
group1_unique = set(df['Group1'].unique())
group2_unique = set(df['Group2'].unique())
# Step 2: Compute the intersection of the two sets
intersection = group1_unique.intersection(group2_unique)
# Step 3: Compute the union of the two sets
union = group1_unique.union(group2_unique)
# Step 4: Calculate the Jaccard Similarity
jaccard_similarity = len(intersection) / len(union)
jaccard_similarity
Out[18]:
1.0
The computed Jaccard Similarity gives us a measure of how similar the two groups are in terms of their categorical preferences. A value closer to 1 indicates high similarity, while a value closer to 0 indicates low similarity.
Interpretation
¶
The Jaccard Similarity value suggests that the two groups have [high/low/moderate] similarity in their preferences. This can be useful in scenarios such as recommendation
systems, where understanding the similarity between different user groups can help in
providing better recommendations.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Visualization
¶
To better understand the overlap between the two groups, let's visualize the unique categories chosen by each group and their intersection.
In [19]:
#!conda install -c conda-forge matplotlib-venn
In [20]:
import matplotlib.pyplot as plt
from matplotlib_venn import venn2
# Plotting the unique categories for each group and their intersection
venn2_subsets = (len(group1_unique - group2_unique), len(group2_unique - group1_unique), len(intersection))
plt.figure(figsize=(8, 8))
venn2(subsets=venn2_subsets, set_labels=('Group1', 'Group2'))
plt.title("Venn Diagram of Preferences for Group1 and Group2")
plt.show()
The Venn diagram visually represents the unique preferences of each group and the common preferences between them. The overlapping region indicates the common categories chosen by both groups.
Conclusion
¶
Understanding the Jaccard Similarity for categorical data provides insights into the similarity between different sets of data. It is a powerful tool for comparing sets and has applications in various domains, including machine learning, data analysis, and recommendation systems.
Exercise 8 Understanding the Gower Distance for Mixed-Type Data
¶
Introduction
¶
When working with datasets that have a mix of numerical and categorical variables, traditional distance metrics like Euclidean or Manhattan might not be suitable. This is where the Gower Distance comes into play. It is specifically designed to handle mixed-
type data by computing distances in a way that respects the nature of each variable.
The Gower Distance is computed by standardizing each variable type's distance to the range [0, 1] and then taking an average.
Problem Statement
¶
Given a dataset with a mix of numerical and categorical attributes, compute the Gower
Distance between pairs of data points to determine their similarity.
In [21]:
import numpy as np
import pandas as pd
# Generating a sample dataset
np.random.seed(42)
data = {
'Age': np.random.randint(20, 60, 100),
'Income': np.random.randint(30000, 80000, 100),
'Fruit_Preference': np.random.choice(['Apple', 'Banana', 'Cherry'], 100),
'Is_Smoker': np.random.choice([True, False], 100)
}
df = pd.DataFrame(data)
# Display the first few rows of the dataset
df.head()
Out[21]:
Age Income Fruit_Preference Is_Smoker
0
58
53599
Banana
True
1
48
65222
Cherry
False
2
34
41837
Cherry
True
3
27
44039
Cherry
False
4
40
60818
Cherry
True
To compute the Gower Distance for the data points, we will:
1. Normalize numerical attributes to the range [0, 1]. 2. Compute a dissimilarity matrix for each attribute. 3. Combine the dissimilarity matrices using a weighted average. Let's proceed with these steps.
In [22]:
#!pip install gower
In [23]:
import gower
# Compute the Gower Distance matrix
gower_distance_matrix = gower.gower_matrix(df)
# Display a portion of the Gower Distance matrix
gower_distance_matrix[:5, :5]
Out[23]:
array([[0. , 0.6226343 , 0.46307787, 0.74686074, 0.40173846],
[0.6226343 , 0. , 0.45750707, 0.24128991, 0.32345995],
[0.46307787, 0.45750707, 0. , 0.30596074, 0.1340471 ],
[0.74686074, 0.24128991, 0.30596074, 0. , 0.41782996],
[0.40173846, 0.32345995, 0.1340471 , 0.41782996, 0. ]],
dtype=float32)
The Gower Distance matrix provides a pairwise distance between each data point in our dataset. Each value in the matrix represents the Gower Distance between two data
points, with smaller values indicating higher similarity.
Interpretation
¶
The Gower Distance is especially useful when we want to cluster or classify mixed-type
data. A smaller Gower Distance between two data points suggests that they are more similar in terms of both their numerical and categorical attributes.
Visualization
¶
To better understand the Gower Distance, let's visualize the distance matrix using a heatmap.
In [24]:
import seaborn as sns
import matplotlib.pyplot as plt
plt.figure(figsize=(10, 8))
sns.heatmap(gower_distance_matrix, cmap='viridis', cbar=True)
plt.title("Gower Distance Heatmap")
plt.show()
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
The heatmap visually represents the Gower Distances between data points. Darker colors indicate smaller distances (higher similarity), while lighter colors indicate larger distances (lower similarity).
Conclusion
¶
The Gower Distance is a powerful metric for computing distances in mixed-type datasets. It provides a standardized way to compare data points with both numerical and categorical attributes, making it invaluable for clustering, classification, and other machine learning tasks on mixed-type data.
Exercise 9 Understanding the Jaccard Coefficient for Binary Data
¶
Introduction
¶
The Jaccard Coefficient, also known as the Jaccard Index or Jaccard Similarity, is a statistic used for comparing the similarity and diversity of sample sets. It is defined as the size of the intersection divided by the size of the union of two sets. For binary data,
it can be used to measure the similarity between two binary vectors.
$J(A, B) = \frac{|A \cap B|}{|A \cup B|}$
Where:
•
( A ) and ( B ) are two sets. •
( $|A \cap B|$ ) is the size of the intersection of sets ( A ) and ( B ). •
( $|A \cup B|$ ) is the size of the union of sets ( A ) and ( B ). Problem Statement
¶
Given a dataset with binary attributes, compute the Jaccard Coefficient between pairs of data points to determine their similarity.
In [25]:
import numpy as np
import pandas as pd
# Generating a sample dataset with binary attributes
np.random.seed(42)
data = {
'Bought_Apple': np.random.choice([0, 1], 100),
'Bought_Banana': np.random.choice([0, 1], 100),
'Bought_Cherry': np.random.choice([0, 1], 100),
'Is_Vegetarian': np.random.choice([0, 1], 100)
}
df = pd.DataFrame(data)
# Display the first few rows of the dataset
df.head()
Out[25]:
Bought_Apple Bought_Banana Bought_Cherry Is_Vegetarian
0
0
0
0
0
1
1
1
1
0
2
0
1
0
0
3
0
1
0
1
4
0
1
1
1
To compute the Jaccard Coefficient for the data points, we will:
1. Calculate the intersection of the binary attributes. 2. Calculate the union of the binary attributes. 3. Use the Jaccard formula to compute the coefficient. Let's proceed with these steps.
In [26]:
from sklearn.metrics import jaccard_score
# Compute the Jaccard Coefficient for the first two data points as an example
data_point_1 = df.iloc[0]
data_point_2 = df.iloc[1]
jaccard_coefficient = jaccard_score(data_point_1, data_point_2, average='macro')
jaccard_coefficient
Out[26]:
0.125
The computed Jaccard Coefficient provides a measure of similarity between the two binary vectors. A value of 1 indicates that the vectors are identical, while a value of 0 indicates no similarity.
Interpretation
¶
The Jaccard Coefficient is a measure of similarity between two binary vectors. It takes into account the presence and absence of attributes, making it a robust metric for binary data comparison.
Visualization
¶
To better understand the distribution of Jaccard Coefficients in our dataset, let's visualize the coefficients for all pairs of data points using a histogram.
In [27]:
jaccard_coefficients = []
# Compute Jaccard Coefficients for all pairs of data points
for i in range(len(df)):
for j in range(i+1, len(df)):
coef = jaccard_score(df.iloc[i], df.iloc[j], average='macro')
jaccard_coefficients.append(coef)
# Plotting the histogram
plt.hist(jaccard_coefficients, bins=20, edgecolor='k', alpha=0.7)
plt.title("Distribution of Jaccard Coefficients")
plt.xlabel("Jaccard Coefficient")
plt.ylabel("Frequency")
plt.show()
The histogram visually represents the distribution of Jaccard Coefficients for all pairs of
data points in our dataset. This gives us an insight into how similar or dissimilar the data points are to each other.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Conclusion
¶
The Jaccard Coefficient is a powerful metric for comparing binary vectors. It provides a standardized way to measure the similarity between two sets, making it invaluable for tasks like clustering, classification, and other machine learning tasks on binary data.
Exercise 10 Understanding the Cosine Similarity for Binary Data
¶
Introduction
¶
Cosine similarity is a metric used to determine how similar two sets of data are. It measures the cosine of the angle between two vectors projected in a multi-
dimensional space. For binary data, it can be used to measure the similarity between two binary vectors.
$\text{cosine similarity}(A, B) = \frac{A \cdot B}{\|A\| \|B\|}$
Where:
•
( A ) and ( B ) are two vectors. •
( $A \cdot B$ ) is the dot product of the vectors. •
( |A| ) and ( |B| ) are the magnitudes (or lengths) of the vectors. Problem Statement
¶
Given a dataset with binary attributes, compute the cosine similarity between pairs of data points to determine their similarity.
In [28]:
import numpy as np
import pandas as pd
# Generating a sample dataset with binary attributes
np.random.seed(42)
data = {
'Bought_Apple': np.random.choice([0, 1], 100),
'Bought_Banana': np.random.choice([0, 1], 100),
'Bought_Cherry': np.random.choice([0, 1], 100),
'Is_Vegetarian': np.random.choice([0, 1], 100)
}
df = pd.DataFrame(data)
# Display the first few rows of the dataset
df.head()
Out[28]:
Bought_Apple Bought_Banana Bought_Cherry Is_Vegetarian
0
0
0
0
0
1
1
1
1
0
2
0
1
0
0
3
0
1
0
1
4
0
1
1
1
To compute the cosine similarity for the data points, we will:
1. Calculate the dot product of the binary attributes.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
2. Calculate the magnitude of each binary vector. 3. Use the cosine similarity formula to compute the similarity. Let's proceed with these steps.
In [29]:
from sklearn.metrics.pairwise import cosine_similarity
# Compute the cosine similarity for the first two data points as an example
data_point_1 = df.iloc[0].values.reshape(1, -1)
data_point_2 = df.iloc[1].values.reshape(1, -1)
cosine_sim = cosine_similarity(data_point_1, data_point_2)
cosine_sim[0][0]
Out[29]:
0.0
The computed cosine similarity provides a measure of similarity between the two binary vectors. A value close to 1 indicates that the vectors are very similar, while a value close to 0 indicates they are dissimilar.
Interpretation
¶
Cosine similarity is a measure of similarity between two non-zero vectors. By using the
cosine of the angle between them, we can determine how similar they are regardless of their size.
Visualization
¶
To better understand the distribution of cosine similarities in our dataset, let's visualize
the similarities for all pairs of data points using a histogram.
In [30]:
cosine_similarities = []
# Compute cosine similarities for all pairs of data points
for i in range(len(df)):
for j in range(i+1, len(df)):
coef = cosine_similarity(df.iloc[i].values.reshape(1, -1), df.iloc[j].values.reshape(1, -1))
cosine_similarities.append(coef[0][0])
# Plotting the histogram
import matplotlib.pyplot as plt
plt.hist(cosine_similarities, bins=20, edgecolor='k', alpha=0.7)
plt.title("Distribution of Cosine Similarities")
plt.xlabel("Cosine Similarity")
plt.ylabel("Frequency")
plt.show()
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
The histogram visually represents the distribution of cosine similarities for all pairs of data points in our dataset. This gives us an insight into how similar or dissimilar the data points are to each other.
Conclusion
¶
Cosine similarity is a powerful metric for comparing vectors, especially in high-
dimensional spaces. For binary data, it offers a robust way to measure the similarity between two sets of attributes, making it invaluable for tasks like clustering, classification, and other machine learning tasks on binary data.
Evaluating Clustering Methods
¶
Evaluating the results of clustering methods is critical to understand the quality and relevance of the clusters formed. Since clustering is unsupervised, assessing its effectiveness can be somewhat subjective. However, there are established metrics and
techniques to guide this evaluation, both when ground truth labels are available and when they aren't.
Internal Evaluation:
¶
Without ground truth labels, evaluate clustering based on the dataset's intrinsic properties.
Metrics:
¶
1.
Silhouette Coefficient
:
•
Compares similarity of data points to their own cluster against other clusters. •
Values range from -1 (incorrect clustering) to 1 (highly dense clustering), with 0 suggesting overlapping clusters. 2.
Davies-Bouldin Index
:
•
A ratio of within-cluster and between-cluster distances. •
Lower values indicate better clustering. 3.
Calinski-Harabasz Index
:
•
Compares between-cluster dispersion to within-cluster dispersion. •
Higher values indicate better-defined clusters. 4.
Dunn Index
:
•
Ratio of the smallest distance between points in different clusters to the
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
largest intra-cluster distance. •
Higher values indicate better clustering. Relative Evaluation:
¶
This involves comparing the results of clustering for different configurations or numbers of clusters.
Techniques:
¶
1.
Elbow Method
:
•
Used with K-means to determine the optimal number of clusters. •
Plot the variance explained (or inertia) against the number of clusters. The "elbow" point, where the rate of decrease sharply changes, often indicates an optimal number of clusters. Stability and Consistency:
¶
Evaluate the robustness of clusters by perturbing the dataset.
Techniques:
¶
1.
Sub-sampling or Bootstrapping
:
•
Repeatedly sample subsets of data and perform clustering. •
Examine the consistency of the clustering results. 2.
Adding Noise
:
•
Introduce random noise to the data. •
Stable clusters should remain relatively unchanged. Challenges and Considerations:
¶
1.
Subjectivity
: Without a definitive "correct" clustering, some evaluation aspects remain subjective. 2.
Scale Sensitivity
: Some metrics need data normalization or standardization. 3.
Choice of Metric
: Different metrics might give varied evaluations for the same clustering result. In conclusion, while evaluating clustering methods, it's often beneficial to consider multiple metrics and, when possible, combine them with domain knowledge to get a comprehensive view of the clustering quality.
Exercise 11 Evaluating Clustering with Silhouette Coefficient
¶
Introduction
¶
The Silhouette Coefficient is a metric used to calculate the goodness of a clustering algorithm. Its value ranges from -1 to 1, where a high value indicates that the object is well matched to its own cluster and poorly matched to neighboring clusters.
The formula for a single sample is:
$s(i) = \frac{b(i) - a(i)}{max(a(i), b(i))}$
Where:
•
( a(i) ) is the mean distance between the sample and all other points in the same
cluster. •
( b(i) ) is the mean distance between the sample and all other points in the nearest cluster that the sample is not a part of. Problem Statement
¶
Given a dataset, apply a clustering algorithm and evaluate its performance using the Silhouette Coefficient. The goal is to understand how well the data has been clustered.
In [31]:
# Generating a sample dataset
from sklearn.datasets import make_blobs
X, y = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=0)
# Visualizing the generated dataset
plt.scatter(X[:, 0], X[:, 1], s=50)
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
plt.title("Generated Data Points")
plt.show()
We have generated a dataset with four clusters. Next, we will apply the KMeans clustering algorithm to this dataset and then evaluate the clustering using the Silhouette Coefficient.
The steps are as follows:
1. Apply KMeans clustering to the dataset. 2. Calculate the Silhouette Coefficient for the clustering. 3. Visualize the clusters. 4. Interpret the results. In [32]:
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
import warnings warnings.filterwarnings('ignore')
# Applying KMeans clustering
kmeans = KMeans(n_clusters=4)
predicted_clusters = kmeans.fit_predict(X)
# Calculating the Silhouette Coefficient
sil_coeff = silhouette_score(X, predicted_clusters, metric='euclidean')
sil_coeff
Out[32]:
0.6819938690643478
The Silhouette Coefficient gives a perspective into the distance between the resulting clusters. More distant clusters lead to better clusterings.
Visualization
¶
Let's visualize the clusters formed by the KMeans algorithm.
In [33]:
plt.scatter(X[:, 0], X[:, 1], c=predicted_clusters, s=50, cmap='viridis')
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
centers = kmeans.cluster_centers_
plt.scatter(centers[:, 0], centers[:, 1], c='red', s=200, alpha=0.75, marker='X')
plt.title("Clusters Formed by KMeans")
plt.show()
The visualization shows the clusters formed by the KMeans algorithm. The red 'X' markers represent the centroids of the clusters.
Interpretation
¶
The Silhouette Coefficient provides a succinct metric to measure how close each point in one cluster is to the points in the neighboring clusters. Its values range from -1 to 1, where a high value indicates that the object is well matched to its own cluster and poorly matched to neighboring clusters. If most objects have a high value, then the clustering configuration is appropriate. If many points have a low or negative value, then the clustering configuration may have too many or too few clusters.
In our exercise, the Silhouette Coefficient value suggests that the clustering is reasonably well done.
Conclusion
¶
The Silhouette Coefficient is an effective metric to evaluate the quality of clusters created by a clustering algorithm. It provides insight into the separation distance between the resulting clusters. More distant clusters lead to better clustering.
Exercise 12 Evaluating Clustering with Davies-Bouldin Index
¶
Introduction
¶
The Davies-Bouldin Index (DBI) is a metric used to evaluate clustering algorithms. The index signifies the average 'similarity' ratio between each cluster and its most similar cluster. A lower Davies-Bouldin Index relates to a model with better separation between the clusters.
The formula for the Davies-Bouldin Index is:
$DBI = \frac{1}{k} \sum_{i=1}^{k} \max_{i \neq j} \left( \frac{s_i + s_j}
{d_{ij}} \right)$
Where:
•
( $s_i$ ) is the average distance of all points in cluster ( i ) to the centroid of cluster ( i ). •
( $d_{ij}$ ) is the distance between cluster centroids ( i ) and ( j ).
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Problem Statement
¶
Given a dataset, apply a clustering algorithm and evaluate its performance using the Davies-Bouldin Index. The goal is to understand how well the data has been clustered.
In [34]:
# Generating a sample dataset
from sklearn.datasets import make_blobs
X, y = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=0)
# Visualizing the generated dataset
plt.scatter(X[:, 0], X[:, 1], s=50)
plt.title("Generated Data Points")
plt.show()
We have generated a dataset with four clusters. Next, we will apply the KMeans clustering algorithm to this dataset and then evaluate the clustering using the Davies-
Bouldin Index.
In [35]:
from sklearn.cluster import KMeans
from sklearn.metrics import davies_bouldin_score
# Applying KMeans clustering
kmeans = KMeans(n_clusters=4)
predicted_clusters = kmeans.fit_predict(X)
# Calculating the Davies-Bouldin Index
dbi = davies_bouldin_score(X, predicted_clusters)
dbi
Out[35]:
0.43756400782378396
The Davies-Bouldin Index gives a measure of the average similarity ratio of each cluster with its most similar cluster. Lower values indicate better clustering.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Interpretation
¶
The Davies-Bouldin Index provides a measure of the average similarity ratio of each cluster with its most similar cluster. Lower values of the index indicate better clustering, implying that the clusters are dense and well separated.
In our exercise, the Davies-Bouldin Index value suggests that the clustering has been done effectively, with a good separation between the clusters.
Conclusion
¶
The Davies-Bouldin Index is a valuable metric to evaluate the quality of clusters created by a clustering algorithm. It offers insight into the separation and density of the resulting clusters. Lower values of the index are desirable as they indicate better clustering.
Exercise 13 Evaluating Clustering with Calinski-Harabasz Index
¶
Introduction
¶
The Calinski-Harabasz Index (CHI), also known as the Variance Ratio Criterion, is a metric used to evaluate clustering algorithms. The index is the ratio of the sum of between-cluster dispersion to within-cluster dispersion. A higher Calinski-Harabasz Index relates to a model with better-defined clusters.
The formula for the Calinski-Harabasz Index is:
$CHI = \frac{B / (k - 1)}{W / (N - k)}$
Where:
•
( B ) is the between-group dispersion matrix. •
( W ) is the within-cluster dispersion matrix. •
( k ) is the number of clusters. •
( N ) is the number of data points. Problem Statement
¶
Given a dataset, apply a clustering algorithm and evaluate its performance using the Calinski-Harabasz Index. The goal is to understand how well the data has been clustered.
In [36]:
# Generating a sample dataset
from sklearn.datasets import make_blobs
X, y = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=0)
# Visualizing the generated dataset
plt.scatter(X[:, 0], X[:, 1], s=50)
plt.title("Generated Data Points")
plt.show()
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
In [37]:
from sklearn.cluster import KMeans
from sklearn.metrics import calinski_harabasz_score
# Applying KMeans clustering
kmeans = KMeans(n_clusters=4)
predicted_clusters = kmeans.fit_predict(X)
# Calculating the Calinski-Harabasz Index
chi = calinski_harabasz_score(X, predicted_clusters)
chi
Out[37]:
1210.0899142587818
The Calinski-Harabasz Index gives a measure of the ratio of the sum of between-
cluster dispersion to within-cluster dispersion. Higher values indicate better clustering.
Interpretation
¶
The Calinski-Harabasz Index provides a measure of the ratio of the sum of between-
cluster dispersion to within-cluster dispersion. Higher values of the index indicate better clustering, implying that the clusters are dense and well separated.
In our exercise, the Calinski-Harabasz Index value suggests that the clustering has been done effectively, with a good separation between the clusters.
Conclusion
¶
The Calinski-Harabasz Index is a valuable metric to evaluate the quality of clusters created by a clustering algorithm. It offers insight into the separation and density of the resulting clusters. Higher values of the index are desirable as they indicate better clustering.
Exercise 14 Evaluating Clustering with Dunn Index
¶
Introduction
¶
The Dunn Index is a metric used to determine the compactness and separation of clusters. A higher Dunn Index indicates better clustering. The index is defined as the ratio between the minimum inter-cluster distances to the maximum intra-cluster diameter.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
The formula for the Dunn Index is:
$DI = \frac{\min(distance\ between\ clusters)}{\max(diameter\ of\ clusters)}
$
Where:
•
The distance between clusters is the distance between two points from two different clusters. •
The diameter of a cluster is the distance between two furthest points in the cluster. Problem Statement
¶
Given a dataset, apply a clustering algorithm and evaluate its performance using the Dunn Index. The goal is to understand how well the data has been clustered.
In [38]:
# Generating a sample dataset
from sklearn.datasets import make_blobs
X, y = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=0)
# Visualizing the generated dataset
plt.scatter(X[:, 0], X[:, 1], s=50)
plt.title("Generated Data Points")
plt.show()
In [39]:
from sklearn.cluster import KMeans
from sklearn.metrics import pairwise_distances
import numpy as np
# Applying KMeans clustering
kmeans = KMeans(n_clusters=4)
predicted_clusters = kmeans.fit_predict(X)
# Calculating the Dunn Index
def dunn_index(X, labels):
pairwise_dists = pairwise_distances(X)
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
min_intercluster_distance = np.min([pairwise_dists[labels == i][:, labels == j].min() for i in np.unique(labels) for j in np.unique(labels) if i != j])
max_diameter = max([np.max(pairwise_distances(X[labels == i])) for i in np.unique(labels)])
return min_intercluster_distance / max_diameter
di = dunn_index(X, predicted_clusters)
di
Out[39]:
0.20231427477727162
The Dunn Index provides a measure of the compactness and separation of the clusters. A higher Dunn Index indicates better clustering.
Interpretation
¶
The Dunn Index provides a measure of the compactness and separation of the clusters. A higher Dunn Index suggests that the clusters are compact and well-
separated from each other.
In our exercise, the Dunn Index value suggests that the clustering has been done effectively, with a good separation between the clusters.
Conclusion
¶
The Dunn Index is a valuable metric to evaluate the quality of clusters created by a clustering algorithm. It offers insight into the separation and compactness of the resulting clusters. Higher values of the index are desirable as they indicate better clustering.
Exercise 15 Validating Clustering using the Elbow Method
¶
Introduction
¶
The Elbow Method is a heuristic used in determining the number of clusters in a dataset. The idea is to run k-means clustering on the dataset for a range of values of k (e.g., k from 1 to 10), and then for each value of k compute the sum of squared distances from each point to its assigned center. When these overall dispersions are plotted against k values, the "elbow" of the curve represents an optimal value for k (a balance between precision and computational cost).
Problem Statement
¶
Given a dataset, we aim to find the optimal number of clusters into which the data may be clustered. The goal is to understand the concept of the Elbow Method and how
it can be used to determine the optimal number of clusters for a given dataset.
In [40]:
# Generating a sample dataset
from sklearn.datasets import make_blobs
X, y = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=0)
# Visualizing the generated dataset
import matplotlib.pyplot as plt
plt.scatter(X[:, 0], X[:, 1], s=50)
plt.title("Generated Data Points")
plt.show()
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
We have generated a dataset with four clusters. Next, we will apply the KMeans clustering algorithm to this dataset for a range of k values and compute the sum of squared distances for each value of k. We will then plot these values to visualize the "elbow" and determine the optimal number of clusters.
The steps are as follows:
1. Apply KMeans clustering to the dataset for a range of k values. 2. Compute the sum of squared distances for each k. 3. Plot the sum of squared distances against k values. 4. Determine the "elbow" or the point of inflection on the curve. In [41]:
from sklearn.cluster import KMeans
# Applying KMeans clustering for a range of k values
wcss = [] # Within-Cluster-Sum-of-Squares
for i in range(1, 11):
kmeans = KMeans(n_clusters=i, init='k-means++', random_state=42)
kmeans.fit(X)
wcss.append(kmeans.inertia_)
# Plotting the Elbow Method graph
plt.figure(figsize=(10,5))
plt.plot(range(1, 11), wcss, marker='o', linestyle='--')
plt.title('Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')
plt.show()
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
The Elbow Method graph shows the sum of squared distances (WCSS) for different values of k. As k increases, the sum of squared distance tends to zero. The location of the "elbow" on the curve is generally considered as an indicator of the appropriate number of clusters.
Interpretation
¶
From the graph, we can observe that the "elbow" is formed when the number of clusters is around 4. This suggests that the optimal number of clusters for this dataset is 4.
Conclusion
¶
The Elbow Method is a powerful technique to determine the optimal number of clusters
for a dataset. It provides a visual representation of how the sum of squared distances changes with different numbers of clusters, helping in making an informed decision about the number of clusters to use.
Exercise 16 Validating Clustering using Cohesion
¶
Introduction
¶
Cohesion measures the closeness of the data points within the same cluster. It is the total distance of each sample to its cluster centroid. Lower values of cohesion indicate that the data points are closer to the centroids of their respective clusters, which implies better clustering.
Problem Statement
¶
Given a dataset, we aim to find the cohesion value after applying the KMeans clustering algorithm. The goal is to understand the concept of cohesion and how it can be used to evaluate the quality of clustering.
In [42]:
# Generating a sample dataset
from sklearn.datasets import make_blobs
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
X, y = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=0)
# Visualizing the generated dataset
import matplotlib.pyplot as plt
plt.scatter(X[:, 0], X[:, 1], s=50)
plt.title("Generated Data Points")
plt.show()
We have generated a dataset with four clusters. Next, we will apply the KMeans clustering algorithm to this dataset and compute the cohesion value.
The steps are as follows:
1. Apply KMeans clustering to the dataset. 2. Compute the cohesion value. 3. Interpret the result. In [43]:
from sklearn.cluster import KMeans
# Applying KMeans clustering
kmeans = KMeans(n_clusters=4, init='k-means++', random_state=42)
kmeans.fit(X)
# Computing the cohesion value
cohesion = kmeans.inertia_
cohesion
Out[43]:
212.00599621083478
Interpretation
¶
The cohesion value represents the sum of squared distances of samples to their closest cluster center. A lower cohesion value indicates that the data points are closer to the centroids of their respective clusters, suggesting better clustering. However, it's essential to balance cohesion with other metrics and the problem's context to
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
determine the best number of clusters.
Conclusion
¶
Cohesion is a valuable metric to evaluate the quality of clustering. It provides a measure of how close the data points are to their respective cluster centroids. By analyzing cohesion and other metrics, we can make informed decisions about the clustering process and its effectiveness.
Exercise 17 Validating Clustering using Separation Score
¶
Introduction
¶
Separation measures how distinct or well-separated the clusters are from each other. It is the total distance between cluster centroids. Higher values of separation indicate that the clusters are well-separated, which implies better clustering.
Problem Statement
¶
Given a dataset, we aim to find the separation score after applying the KMeans clustering algorithm. The goal is to understand the concept of separation and how it can be used to evaluate the quality of clustering.
In [44]:
# Generating a sample dataset
from sklearn.datasets import make_blobs
X, y = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=0)
# Visualizing the generated dataset
import matplotlib.pyplot as plt
plt.scatter(X[:, 0], X[:, 1], s=50)
plt.title("Generated Data Points")
plt.show()
We have generated a dataset with four clusters. Next, we will apply the KMeans clustering algorithm to this dataset and compute the separation score.
The steps are as follows:
1. Apply KMeans clustering to the dataset. 2. Compute the separation score.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
3. Interpret the result. In [45]:
from sklearn.cluster import KMeans
import numpy as np
# Applying KMeans clustering
kmeans = KMeans(n_clusters=4, init='k-means++', random_state=42)
kmeans.fit(X)
# Computing the separation score
centroids = kmeans.cluster_centers_
separation = np.sum(np.var(centroids, axis=0))
separation
Out[45]:
8.66710533030799
Interpretation
¶
The separation score represents the variance between cluster centroids. A higher separation score indicates that the clusters are well-separated from each other, suggesting better clustering. However, it's essential to balance separation with other metrics and the problem's context to determine the best number of clusters.
Conclusion
¶
Separation is a valuable metric to evaluate the quality of clustering. It provides a measure of how distinct or well-separated the clusters are from each other. By analyzing separation and other metrics, we can make informed decisions about the clustering process and its effectiveness.
Revised Date: October 23, 2023
¶
In [ ]:
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help