IE6400_Day18

html

School

Northeastern University *

*We aren’t endorsed by this school

Course

6400

Subject

Industrial Engineering

Date

Feb 20, 2024

Type

html

Pages

Uploaded by ColonelStraw13148

IE6400 Foundations of Data Analytics Engineering ¶ Fall 2023 ¶ Module 3: Clustering Methods Part -1 ¶ Proximity Measures ¶ Proximity measures are metrics used to determine the similarity or dissimilarity between data points. They play a crucial role in various machine learning and data analysis techniques, especially clustering and classification. The choice of a proximity measure often depends on the nature of the data and the specific problem at hand. Types of Proximity Measures ¶ There are two main types of proximity measures: 1. Similarity Measures : These quantify how similar two data points are. Higher values indicate greater similarity. 2. Dissimilarity Measures (or Distance Measures) : These represent the "distance" or dissimilarity between two data points. Higher values indicate greater dissimilarity. Common Proximity Measures ¶ For Continuous Data: ¶ 1. Euclidean Distance : • It's the "ordinary" straight-line distance between two points in Euclidean space. $d(x, y) = \sqrt{\sum_{i=1}^{n} (x_i - y_i)^2}$ 1. Manhattan Distance (or L1 norm) : • It's the distance between two points measured along axes at right angles (taxicab or city block distance). $d(x, y) = \sum_{i=1}^{n} |x_i - y_i|$ 1. Minkowski Distance : • A generalized metric. When (p=2), it becomes the Euclidean distance. When (p=1), it's the Manhattan distance. $d(x, y) = \left( \sum_{i=1}^{n} |x_i - y_i|^p \right)^{\frac{1}{p}}$ For Categorical Data: ¶ 1. Hamming Distance : • Used for categorical variables. It's the number of positions at which the corresponding symbols in two strings of equal length are different. 2. Jaccard Similarity : • Measures the similarity between two sets. It's the size of the intersection divided by the size of the union of the two sets. $J(A, B) = \frac{|A \cap B|}{|A \cup B|}$ For Mixed-Type Data: ¶ 1. Gower Distance : • Combines various distance metrics for mixed-type data. For Binary Data: ¶ 1. Jaccard Coefficient : • Similar to Jaccard similarity but specifically tailored for binary attributes. 2. Cosine Similarity : • Measures the cosine of the angle between two non-zero vectors. It's often used in text analysis to determine similarity between documents. $\text{cosine_similarity}(A, B) = \frac{A \cdot B}{\|A\| \|B\|}$

Choosing the Right Proximity Measure ¶ The choice of proximity measure depends on: • Nature of Data : Continuous, categorical, binary, or mixed. • Domain Knowledge : The context in which data points are being compared. • Problem Specifics : Certain problems may necessitate specific measures. In general, it's crucial to understand the data and the problem's requirements before choosing a proximity measure. Exercise 1 Understanding Euclidean Distance ¶ Introduction ¶ Euclidean distance is a measure of the straight line distance between two points in Euclidean space. It's a fundamental concept in mathematics and has wide applications in machine learning, especially in clustering algorithms like K-Means. Problem Statement ¶ Imagine you are working on a recommendation system for a retail store. You have data on the purchase history of customers for two products: A and B. You want to find out how similar two customers are based on their purchase patterns of these two products. One way to measure this similarity is by calculating the Euclidean distance between their purchase histories. Given the purchase data for two products, can you compute the Euclidean distance between two customers? Dataset ¶ For simplicity, let's consider a small dataset representing the number of units of products A and B bought by different customers: Customer Product A Product B 1 5 3 2 2 8 3 9 1 4 4 7 In [1]: # Import necessary libraries import numpy as np import matplotlib.pyplot as plt # Define the dataset customers = np.array([[5, 3], [2, 8], [9, 1], [4, 7]]) # Visualize the data plt.scatter(customers[:, 0], customers[:, 1], color='blue', label='Customers') plt.xlabel('Product A') plt.ylabel('Product B') plt.title('Purchase History of Customers') plt.grid(True) plt.show()

From the scatter plot, we can visualize the purchase patterns of the four customers. Each point represents a customer's purchase history for products A and B. Euclidean Distance ¶ The Euclidean distance between two points ( $P(x_1, y_1)$ ) and ( $Q(x_2, y_2)$ ) in a 2D plane is given by: $d(P, Q) = \sqrt{(x_2 - x_1)^2 + (y_2 - y_1)^2}$ Let's compute the Euclidean distance between Customer 1 and Customer 2. In [2]: # Define a function to compute Euclidean distance def euclidean_distance(point1, point2): return np.sqrt(np.sum((point1 - point2) ** 2)) # Calculate the distance between Customer 1 and Customer 2 distance = euclidean_distance(customers[0], customers[1]) print(f"Euclidean Distance between Customer 1 and Customer 2: {distance:.2f}") Euclidean Distance between Customer 1 and Customer 2: 5.83 The computed distance gives us a measure of how similar or dissimilar the two customers are based on their purchase patterns. A smaller distance indicates similar purchase behaviors, while a larger distance indicates dissimilarity. Visualization ¶ To better understand the concept, let's visualize the Euclidean distance between Customer 1 and Customer 2 on our scatter plot. In [3]: # Visualize the data with the Euclidean distance plt.scatter(customers[:, 0], customers[:, 1], color='blue', label='Customers') plt.plot([customers[0][0], customers[1][0]], [customers[0][1], customers[1][1]], 'ro-') plt.xlabel('Product A') plt.ylabel('Product B') plt.title('Euclidean Distance between Customer 1 and Customer 2') plt.grid(True)

Your preview ends here