Homework 5

pdf

School

University of California, Berkeley *

*We aren’t endorsed by this school

Course

MISC

Subject

Mathematics

Date

Jan 9, 2024

Type

pdf

Pages

8

Uploaded by ConstableNeutronRam141

Report
Homework 5: Cluster Analysis and Anomaly Detection Question 1: Assume we have a simple dataset with 10 two-dimensional points (x, y). Dataset: (2, 3), (3, 3), (3, 4), (4, 4), (7, 5), (9, 4), (6, 8), (8, 8), (9, 9), (8, 10) Use KMeans algorithm and group data points into two clusters. Initial centroids are Centroid 1: (3, 3), Centroid 2: (8, 8) Solution: Points Centroid 1: (3,3) Centroid 2(8,8) cluster (2,3) 1 7.81 1 (3,3) 0 7.07 1 (3,4) 1 6.40 1 (4,4) 1.41 5.65 1 (7,5) 4.47 3.16 2 (9,4) 6.08 4.12 2 (6,8) 5.83 2 2 (8,8) 7.07 0 2 (9,9) 8.48 1.41 2 (8,8) 8.60 2 2 Next centroids New Centroid 1 = (2+3+3+4)/4, (3+3+4+4)/4 = (3,3.5) New Centroid 2= (7+9+6+8+9+8)/6, (5+4+8+8+9+8)/6 = (7.8, 7.3) Points Centroid 1: (3,3.5) Centroid 2: (7.8,7.3) New cluster (2,3) 1.11 7.2 1 (3,3) 0.5 6.44 1 (3,4) 0.5 5.82 1 (4,4) 1.11 5.03 1 (7,5) 4.27 2.43 2 (9,4) 6.02 3.51 2 (6,8) 5.4 2.15 2 (8,8) 6.7 0.72 2 (9,9) 8.1 2.08 2 (8,8) 8.2 2.70 2 Final clusters: Cluster 1: (2,3), (3,3), (3,4), (4,4) Cluster 2: (7,5), (9,4), (6,8), (8,8), (9,9), (8, 10)
Homework 5: Cluster Analysis and Anomaly Detection Question 2: A data scientist plans to use DBSCAN with the minimum number of points set to 5. Identify each labeled point in the scatter plot as a border point, core point, or outlier. (Discuss it). Solution: A is considered a core point since it meets the criteria for the minimum number of points within the specified ε range. On the other hand, B and C are identified as outliers because they fail to meet the minimum number of points requirement and also fall outside the range of a core point to be classified as a border point. Question 3: Determine whether each labeled point in the figure below is a core point, a boundary point, or an outlier given ε = 2 and the minimum number of points for a core point is 4. (Discuss it).
Homework 5: Cluster Analysis and Anomaly Detection Solution : For a point to be a core point, it should satisfy the min_points condition within the given ε . So, A is a core point which has 4 samples within ε = 2. Point B though it has 2 points within the ε , it doesn t have 4 samples so it is a boundary point. Point C doesn’t have any points within the ε limit. So, C is an outlier. Question 4: What is the distance between the two clusters using centroid linkage? Solution: Centroid of cluster A = (1+2+0)/3, (6+4+2)/3 = (1,4) Centroid of cluster B = (6+8)/2, (8+4)/2 = (7,4) distance between the two clusters = ( ? 1 ? 2 ) 2 + (? 1 ? 2 ) 2 = ( 1 − 7 ) 2 + (4 − 4 ) 2 = 6 Question 5: Create a dendrogram from the following figure. (Height of dendrogram in your solution is not important.)
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Homework 5: Cluster Analysis and Anomaly Detection Solution: Question 6: In real estate, outliers represent a special circumstance that drastically affects the price of a house. One possibility is that a home received multiple offers and one bidder submitted a high offer to guarantee the offer is accepted. Another possibility is that an unexpected event happened and the owner needs to sell quickly.
Homework 5: Cluster Analysis and Anomaly Detection The figure below uses a dataset that contains 76 single family homes with list price and square feet as features. The values for list price and square feet have been standardized, because the units and ranges of both variables are different. a) How many clusters of houses are obtained when ε = 1 and min_samples = 12? b) Is the 1,440 square foot house listed at $277,000 an outlier? Why? c) would decreasing ε most likely increase the number of houses identified as outliers? Why? Solution: a) 9 b) It is an outlier because it neither satisfies the min_samples condition to be core point nor has a core point within the ε limit to be a border point. c) Yes, decreasing the value of ε would likely increase the number of points, including houses, identified as outliers because we can see from the figure there are many border points if we decrease the ε , then these points may not have a core point within their ε limit, which will qualify them to be an outlier. Question 7: Suppose we apply DBSCAN to cluster the following dataset using Euclidean distance.
Homework 5: Cluster Analysis and Anomaly Detection Given that minpoint = 3 and epsilon = 1, answer the following questions. a. Label all points as “core points”, “boundary points” and “noise”. A B C D E F G H I J L M A 0 B 1.4 0 C 2.2 1 0 D 3.1 2 1 0 E 2.8 1.4 1 1.4 0 F 2.2 1 1.4 2.2 1 0 G 3.1 2 2.2 2.8 1.4 1 0 H 5.6 4.2 3.6 3.1 2.8 3.6 3.1 0 I 7.07 5.6 5 4.4 4.2 5 4.4 1.4 0 J 7.81 6.4 5.6 5 5 5.8 5.3 2.2 1 0 L 8.4 7.0 6.4 5.8 5.6 5.8 5.8 2.8 1.4 1 0 M 7.81 6.4 5.8 5.3 5 5.6 5 2.2 1 1.4 1 0 A: ; B: C, F ; C: D, E, B ; D: C E: F, C ; F: G, B, E ; G: F ; H: I: J, M ; J: L ; L: M, J ; M: I, L Core points: B, C, E, F, I, L, M Boundary points: D, G, J Noise: A, D, G, J,H b. What is the clustering result? Two clusters: B, C, D, E, F, G will form a cluster and I, J, L, M will form the other cluster
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Homework 5: Cluster Analysis and Anomaly Detection Question 8 : After performing anomaly detection, data miner A wants to find clusters of outliers. Data miner B claims that this does not make any sense and suggests that A re-read the definition of an anomaly. Do you think it is meaningful to cluster anomalies? Explain. Clustering anomalies can be meaningful in certain contexts, but it depends on the nature of the data and the goals of the analysis. Anomaly detection and clustering serve distinct purposes, and combining them can offer valuable insights depending on the specific objectives of the analysis. Let's explore both perspectives: Data Miner A's Perspective (Clustering Anomalies): Some datasets may exhibit patterns where anomalies are not uniformly distributed but rather form distinct groups or clusters. Clustering anomalies can help identify different types or categories of unusual behavior, providing more nuanced insights into the underlying patterns within the anomalies. It can be useful for understanding the root causes of anomalies, especially if different clusters represent different kinds of abnormal behavior. Data Miner B's Perspective (Questioning the Clustering of Anomalies): Anomalies, by definition, represent instances that deviate significantly from the norm. Clustering might not always align with the concept of anomalies, as anomalies are expected to be rare and distinct. Clustering anomalies might introduce complexities, as anomalies might not share common characteristics or patterns that are easily discernible through clustering. The primary goal of anomaly detection is often to highlight and isolate unusual instances, and clustering might not be necessary for achieving this goal. In summary, the decision to cluster anomalies depends on the nature of the data and the objectives of the analysis. While it might not always be appropriate or necessary, there are scenarios where clustering anomalies can provide additional insights into the underlying structures and patterns within anomalous data points. It's essential for data miners to carefully consider the specific context and goals of their analysis when deciding whether or not to cluster anomalies.
Homework 5: Cluster Analysis and Anomaly Detection Question 9: Referring to the figure below, what is the optimal number of clusters for the dataset? Why? Solution: The optimal number of clusters for the dataset is 3. This is based on the elbow method, which is to determine the optimal number of clusters based on the plot of WCSS. The "elbow" point is where the rate of decrease in WCSS sharply changes and adding more clusters does not significantly reduce WCSS.