Homework5sol

pdf

School

University of California, Berkeley *

*We aren’t endorsed by this school

Course

MISC

Subject

Mathematics

Date

Jan 9, 2024

Type

pdf

Pages

8

Uploaded by ConstableNeutronRam141

Report
Homework 5: Cluster Analysis and Anomaly Detection Question 1: Assume we have a simple dataset with 10 two-dimensional points (x, y). Dataset: (2, 3), (3, 3), (3, 4), (4, 4), (7, 5), (9, 4), (6, 8), (8, 8), (9, 9), (8, 10) Use KMeans algorithm and group data points into two clusters. Initial centroids are Centroid 1: (3, 3), Centroid 2: (8, 8) Solution : Points Centroid 1 (3,3) Centroid 2 (8,8) Cluster (2,3) 1 7.81 1 (3,3) 0 7.07 1 (3,4) 1 6.40 1 (4,4) 1.41 5.65 1 (7,5) 4.47 3.16 2 (9,4) 6.08 4.12 2 (6,8) 5.83 2 2 (8,8) 7.07 0 2 (9,9) 8.48 1.41 2 (8,10) 8.60 2 2 Let’s calculate new centroids: Centroid 1:( 2+3+3+4)/4, (3+3+4+4)/4 (3, 3.5) Centroid 2: (7+9+6+8+9+8)/6, (5+4+8+8+9+10)/6 (7.8, 7.3) Points Centroid 1 (3,3.5) Centroid 2 (7.8,7.3) Cluster (2,3) 1.11 7.2 1 (3,3) 0.5 6.44 1 (3,4) 0.5 5.82 1 (4,4) 1.11 5.03 1 (7,5) 4.27 2.43 2 (9,4) 6.02 3.51 2 (6,8) 5.4 2.15 2 (8,8) 6.7 0.72 2 (9,9) 8.1 2.08 2 (8,10) 8.2 2.70 2 So, the final clusters are:
Homework 5: Cluster Analysis and Anomaly Detection Cluster 1: (2,3), (3,3), (3,4), (4,4) Cluster 2: (7,5), (9,4), (6,8), (8,8), (9,9), (8, 10) Question 2: A data scientist plans to use DBSCAN with the minimum number of points set to 5. Identify each labeled point in the scatter plot as a border point, core point, or outlier. (Discuss it). Solution : Let us consider the colored circles as the ε ranges for each point A, B, C. A is a core point, as it satisfies the minimum number of points condition within the range of ε . B, C are outliers because they do not satisfy the minimum number of points condition and are not in the range of a core point to be considered as border point. Question 3: Determine whether each labeled point in the figure below is a core point, a boundary point, or an outlier given ε = 2 and the minimum number of points for a core point is 4. (Discuss it).
Homework 5: Cluster Analysis and Anomaly Detection Solution : For a point to be a core point, it should satisfy the min_points condition within the given ε . So, A is a core point which has 4 samples within ε = 2. Point B though it has 2 points within the ε, it doesn’t have 4 samples so it is a boundary point. P oint C doesn’t have any points within the ε limit. So, C is an outlier. Question 4: What is the distance between the two clusters using centroid linkage? Solution : The centroid of cluster A = ( 1+2+0 3 , 6+4+2 3 ) = (1,4) The centroid of cluster B = ( 6+8 2 , 4+4 2 ) = (7, 4) The distance between the two clusters is = 6 Question 5: Create a dendrogram from the following figure. (Height of dendrogram in your solution is not important.)
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Homework 5: Cluster Analysis and Anomaly Detection Solution: Question 6: In real estate, outliers represent a special circumstance that drastically affects the price of a house. One possibility is that a home received multiple offers and one bidder submitted a high offer to guarantee the offer is accepted. Another possibility is that an unexpected event happened and the owner needs to sell quickly. The figure below uses a dataset that contains 76 single family homes with list price and square feet as features. The values for list price and square feet have been standardized, because the units and ranges of both variables are different.
Homework 5: Cluster Analysis and Anomaly Detection a) How many clusters of houses are obtained when ε = 1 and min_samples = 12? b) Is the 1,440 square foot house listed at $277,000 an outlier? Why? It is an outlier because it neither satisfies the min_samples condition to be core point nor has a core point within the ε limit to be a border point. c) would decreasing ε most likely increase the number of houses identified as outliers? Why? Yes, as we can see there are many points which are border points and if we decrease the ε , then these points may not have a core point within their ε limit, which will qualify them to be an outlier. Question 7: Suppose we apply DBSCAN to cluster the following dataset using Euclidean distance. Given that minpoint = 3 and epsilon = 1, answer the following questions.
Homework 5: Cluster Analysis and Anomaly Detection a. Label all points as “core points”, “boundary points” and “noise”. A B C D E F G H I J L M A 0 B 1.4 0 C 2.2 1 0 D 3.1 2 1 0 E 2.8 1.4 1 1.4 0 F 2.2 1 1.4 2.2 1 0 G 3.1 2 2.2 2.8 1.4 1 0 H 5.6 4.2 3.6 3.1 2.8 3.6 3.1 0 I 7.07 5.6 5 4.4 4.2 5 4.4 1.4 0 J 7.81 6.4 5.6 5 5 5.8 5.3 2.2 1 0 L 8.4 7.0 6.4 5.8 5.6 5.8 5.8 2.8 1.4 1 0 M 7.81 6.4 5.8 5.3 5 5.6 5 2.2 1 1.4 1 0 A: ; B: C, F ; C: D, E, B ; D: C E: F, C ; F: G, B, E ; G: F ; H: I: J, M ; J: L ; L: M, J ; M: I, L Core points: B, C, E, F, I, L, M Boundary points: D, G, J Noise: A, D, G, J b. What is the clustering result? There will be two clusters: B, C, D, E, F, G will form a cluster and I, J, L, M will form the other cluster.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Homework 5: Cluster Analysis and Anomaly Detection Question 8: After performing anomaly detection, data miner A wants to find clusters of outliers. Data miner B claims that this does not make any sense and suggests that A re-read the definition of an anomaly. Do you think it is meaningful to cluster anomalies? Explain. Solution: Data miner A might argue that clustering anomalies can be meaningful in certain scenarios. Anomalies, by definition, are data points that deviate significantly from the norm. While the primary goal of anomaly detection is to identify these deviant instances, clustering anomalies could provide additional insights. There may be different types or categories of anomalies and grouping them into clusters can help understand common characteristics among outliers. This may lead to a deeper understanding of the underlying patterns or factors contributing to the anomalies. Data miner B emphasizes the definition of an anomaly, suggesting that anomalies are inherently outliers and, therefore, trying to find clusters among outliers might be contradictory. Anomalies are often considered as rare and exceptional events, and attempting to group them into clusters might dilute the concept of what makes them anomalous in the first place. In summary, whether it makes sense to cluster anomalies depends on the specific objectives of the analysis. If the goal is to understand different types of anomalies and their underlying patterns, clustering might be meaningful. However, if the focus is solely on identifying and isolating individual outliers, clustering might be less relevant and could potentially complicate the interpretation of anomalies. It's essential for data miners to carefully consider the nature of the data and the goals of the analysis before deciding whether to cluster anomalies or not. Question 9: Referring to the figure below, what is the optimal number of clusters for the dataset? Why? Solution : For the figure given, the optimal number of clusters for the dataset is 3. This is based on the elbow method, which is to determine the optimal number of clusters based on the plot of WCSS. The "elbow" point is where the rate of decrease in WCSS sharply changes and adding more clusters does not significantly reduce WCSS.
Homework 5: Cluster Analysis and Anomaly Detection The optimal number of clusters is often determined at the point where the WCSS starts to show diminishing returns, creating an "elbow" in the plot.