Tutorial_W9_solution(1) (1)

pdf

School

Queensland University of Technology *

*We aren’t endorsed by this school

Course

509

Subject

Computer Science

Date

Dec 6, 2023

Type

pdf

Pages

Uploaded by CountGalaxy14444

Week 9 Computer Tutorial Dr Richi Nayak, r.nayak@qut.edu.au (mailto:r.nayak@qut.edu.au) Practical Topics: 1. Preparing data for clustering 2. Running K-means Clustering 3. Understanding and visualising a clustering model 4. Running the Agglomerative Clustering Algorithm 5. Running the Kprototypes Clustering Part 1 - Reflective exercises In this practical, you will be introduced to a new dataset, data preparation for clustering, and clustering task. Please reflect on the clustering related theoretical concepts and answer the following questions. Exercise 1: Clustering Introduction: Basics 1. In data mining, Data objects that do not comply with general behavior or model of the data are called as a. Clusters b. Centroids c. Outliers d. None of these. 2. Which of the following is a common use of unsupervised clustering? a. detect outliers b. determine a best set of input attributes for supervised learning c. determine if meaningful relationships can be found in a dataset d. All of a,b and d are common uses of unsupervised clustering. 3. Suppose you are a luxury automobile dealer. Your dealership is planning to sell a new model. Given the following questions, which question could be answered by applying clustering on their customer dataset? a. How much should you charge for the new BMW X6M? b. How likely the person X will buy the new BMW X5W? c. What type of customers has bought the silver BMW M5? List their features.

4. The K-Means algorithm terminates when a. a user-defined minimum value for the summation of squared error differences between instances and their corresponding cluster center is seen. b. the cluster centers for the current iteration are identical to the cluster centers for the previous iteration. c. the number of instances in each cluster for the current iteration is identi cal to the number of instances in each cluster of the previous iteration. d. the number of clusters formed for the current iteration is identical to the number of clusters formed in the previous iteration. 5. This clustering algorithm initially assumes that each data instance represents a single cluster. a. agglomerative clustering b. conceptual clustering c. K-Means clustering d. expectation maximization 6. List the aspects on which k-median performs better and worse than k-means. Discuss a main challenge common to both the K-means and K-medoids algorithms. Exercise 2: Clustering process: proximity measures 1. A popular criterion for measure of closeness of two objects in clustering analysis is a distance function. Name the other criterion that can be used to differentiate two objects. 2. Consider two clustering solutions with the intra- and inter-cluster similarity measures. Let the similarity measure be in the range of [0,1] where 0 is lowest and 1 is highest. Which is the optimal solution? Solution 1: Intra-cluster similarity = 0.99, inter-cluster similarity = 0.01 Solution 2: Intra-cluster similarity = 0.01, inter-cluster similarity = 0.99 3. List various distance measures, and discuss how they are used in clustering. 4. Consider a very small data set with three observations (or data points), calculate the distance between, 1) S1 and S2 2) S1 and S3. Use dot product, Euclidean, and Manhattan distance measure. Exercise 3: Evaluation 1. Unsupervised evaluation can be internal or external. Which of the following is an internal method for evaluating alternative clusterings produced by the K-Means algorithm? a. Use a production rule generator to compare the rule sets generated for each clustering. b. Compute and compare class resemblance scores for the clusters formed by eac h clustering. c. Compare the sum of squared error differences between instances and their co rresponding cluster centers for each alternative clustering. d. Create and compare the decision trees determined by each alternative cluste ring. 2. Explain inter-cluster and intra-cluster distances and their relationship when used to evaluate clustering results? 3. What are the different ways you could subgroup the following superheroes?

4. Suppose there are three clustering solutions with the following silhouette score. Identify which is the best and worst solution? Solution 1: Silhouette score = 0.34 Solution 2: Silhouette score = -0.45 Solution 3: Silhouette score = 0.15 Exercise 4: Partition based clustering 1. Suppose a dataset consists of {0, 1, 3, 4, 100, 102, 106, 108} which are points in one dimension. Perform K-means clustering on these points. Create three clusters by assigning each point to the nearest centroid. Show all the clusters and the total error calculated using the absolute distance measure (|a–b|) for each set of centroids where a and b are two data points. You can choose three random numbers to start the process. Or assume initial centroids are given as {0.5, 3.5, 104}. Repeat with the initial centroids given as {2, 101, 107}. Repeat the above exercise for creating two clusters with centroids as {2, 104} Repeat the above process of creating clusters but using “squared error” ((a-b)^2). Analyse the difference. 2. Suppose there are 125 data points. Using the empirical method, determine the number of clusters. 3. Based on the graph shown below, determine the number of clusters using the elbow method. 4. Suppose a dataset consists of {(10, 23) (12, 30) (8, 32) (4, 15), (3,17) (1, 10)} which are points in two dimensions. Perform K-means clustering on these points, assuming k = 2. Exercise 5: Hierarchical clustering

Your preview ends here