Tutorial_W9_solution(1) (1)

pdf

School

Queensland University of Technology *

*We aren’t endorsed by this school

Course

509

Subject

Computer Science

Date

Dec 6, 2023

Type

pdf

Pages

40

Uploaded by CountGalaxy14444

Report
Week 9 Computer Tutorial Dr Richi Nayak, r.nayak@qut.edu.au (mailto:r.nayak@qut.edu.au) Practical Topics: 1. Preparing data for clustering 2. Running K-means Clustering 3. Understanding and visualising a clustering model 4. Running the Agglomerative Clustering Algorithm 5. Running the Kprototypes Clustering Part 1 - Reflective exercises In this practical, you will be introduced to a new dataset, data preparation for clustering, and clustering task. Please reflect on the clustering related theoretical concepts and answer the following questions. Exercise 1: Clustering Introduction: Basics 1. In data mining, Data objects that do not comply with general behavior or model of the data are called as a. Clusters b. Centroids c. Outliers d. None of these. 2. Which of the following is a common use of unsupervised clustering? a. detect outliers b. determine a best set of input attributes for supervised learning c. determine if meaningful relationships can be found in a dataset d. All of a,b and d are common uses of unsupervised clustering. 3. Suppose you are a luxury automobile dealer. Your dealership is planning to sell a new model. Given the following questions, which question could be answered by applying clustering on their customer dataset? a. How much should you charge for the new BMW X6M? b. How likely the person X will buy the new BMW X5W? c. What type of customers has bought the silver BMW M5? List their features.
4. The K-Means algorithm terminates when a. a user-defined minimum value for the summation of squared error differences between instances and their corresponding cluster center is seen. b. the cluster centers for the current iteration are identical to the cluster centers for the previous iteration. c. the number of instances in each cluster for the current iteration is identi cal to the number of instances in each cluster of the previous iteration. d. the number of clusters formed for the current iteration is identical to the number of clusters formed in the previous iteration. 5. This clustering algorithm initially assumes that each data instance represents a single cluster. a. agglomerative clustering b. conceptual clustering c. K-Means clustering d. expectation maximization 6. List the aspects on which k-median performs better and worse than k-means. Discuss a main challenge common to both the K-means and K-medoids algorithms. Exercise 2: Clustering process: proximity measures 1. A popular criterion for measure of closeness of two objects in clustering analysis is a distance function. Name the other criterion that can be used to differentiate two objects. 2. Consider two clustering solutions with the intra- and inter-cluster similarity measures. Let the similarity measure be in the range of [0,1] where 0 is lowest and 1 is highest. Which is the optimal solution? Solution 1: Intra-cluster similarity = 0.99, inter-cluster similarity = 0.01 Solution 2: Intra-cluster similarity = 0.01, inter-cluster similarity = 0.99 3. List various distance measures, and discuss how they are used in clustering. 4. Consider a very small data set with three observations (or data points), calculate the distance between, 1) S1 and S2 2) S1 and S3. Use dot product, Euclidean, and Manhattan distance measure. Exercise 3: Evaluation 1. Unsupervised evaluation can be internal or external. Which of the following is an internal method for evaluating alternative clusterings produced by the K-Means algorithm? a. Use a production rule generator to compare the rule sets generated for each clustering. b. Compute and compare class resemblance scores for the clusters formed by eac h clustering. c. Compare the sum of squared error differences between instances and their co rresponding cluster centers for each alternative clustering. d. Create and compare the decision trees determined by each alternative cluste ring. 2. Explain inter-cluster and intra-cluster distances and their relationship when used to evaluate clustering results? 3. What are the different ways you could subgroup the following superheroes?
4. Suppose there are three clustering solutions with the following silhouette score. Identify which is the best and worst solution? Solution 1: Silhouette score = 0.34 Solution 2: Silhouette score = -0.45 Solution 3: Silhouette score = 0.15 Exercise 4: Partition based clustering 1. Suppose a dataset consists of {0, 1, 3, 4, 100, 102, 106, 108} which are points in one dimension. Perform K-means clustering on these points. Create three clusters by assigning each point to the nearest centroid. Show all the clusters and the total error calculated using the absolute distance measure (|a–b|) for each set of centroids where a and b are two data points. You can choose three random numbers to start the process. Or assume initial centroids are given as {0.5, 3.5, 104}. Repeat with the initial centroids given as {2, 101, 107}. Repeat the above exercise for creating two clusters with centroids as {2, 104} Repeat the above process of creating clusters but using “squared error” ((a-b)^2). Analyse the difference. 2. Suppose there are 125 data points. Using the empirical method, determine the number of clusters. 3. Based on the graph shown below, determine the number of clusters using the elbow method. 4. Suppose a dataset consists of {(10, 23) (12, 30) (8, 32) (4, 15), (3,17) (1, 10)} which are points in two dimensions. Perform K-means clustering on these points, assuming k = 2. Exercise 5: Hierarchical clustering
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Suppose a dataset consists of 8 records {0, 1 3, 4, 100, 102, 106, 108} which are points in one dimension. Perform hierarchical clustering on these points. Use the following process to assign clusters. Initially, each point is in a cluster by itself. At each step, merge the two clusters with the closest centroids, and continue until only two clusters remain. For simplicity, use the absolute distance measure (|a-b|) where a and b are two data points. Part 2 - Practical exercises This practical introduces you to clustering using Python. You will learn how to preprocess data for clustering, building clustering solutions and evaluate/visualise the results. Different from the predictive mining algorithms/models, a dataset used for clustering is unlabelled . A dataset used in clustering do not have the label information that is mandatory in predictive data mining. The clustering task assists in finding common labels in the dataset. There exist multiple clustering algorithms. Depending on the data types, a clustering algorithm is selected. Algorithms such as K-means, Agglomerative, K-modes, and K-prototypes are some of the commonly used algorithms. In this practical, we will learn to build K-means, Agglomerative and K-prototypes clustering models. 1. Preparing data for clustering We will be using the Census2000 dataset which contains the postal code-level summary of the 2000 United States Census. There are 7 variables in this dataset: ID : Postal code of the region LOCX : Region longitude LOCY : Region latitude MEANHHSZ : Average household size in the region MEDHHINC : Median household income in the region REGDENS : Region population density percentile (1=lowest density, 100 = highest density) REGPOP : Number of people in the region There is no target known in this data therefore we will utilize an unsupervised learning method, clustering, to analyse this data. The goal of the analysis is to group people into distinct subsets based on urbanization, household size, and income factors. These factors are common to matching commercial life-style and life-stage segmentation products (for example, see www.claritas.com (http://www.claritas.com) or www.spectramarketing.com (http://www.spectramarketing.com) ). Similar with past practicals, we will use pandas to load the data and perform data preprocessing.
In [1]: From the .info() output, you should notice the RegDens variable type is set incorrectly. The output lists RegDens as object/categorical, while based on the dataset description given above, RegDens should be an interval/numerical variable. Run .describe() and .value_counts() on the Series to get more information. In [2]: The output of these functions reveals the cause of the incorrect type, which is a number of empty strings in this Series. A solution is to replace them with nan to denote them as missing values and typecast the series into float data type. <class 'pandas.core.frame.DataFrame'> RangeIndex: 33178 entries, 0 to 33177 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ID 33178 non-null object 1 LocX 33178 non-null float64 2 LocY 33178 non-null float64 3 RegDens 33178 non-null object 4 RegPop 33178 non-null int64 5 MedHHInc 33178 non-null int64 6 MeanHHSz 33178 non-null float64 dtypes: float64(3), int64(2), object(2) memory usage: 1.8+ MB count 33178 unique 101 top freq 1013 Name: RegDens, dtype: object 1013 10 322 44 322 25 322 11 322 ... 3 321 1 321 74 321 45 321 42 321 Name: RegDens, Length: 101, dtype: int64 import pandas as pd import numpy as np # not skipping empty values, to demonstrate data preprocessing steps later df = pd.read_csv( 'census2000.csv' , na_filter = False ) df.info() # get more information from RegDens print (df[ 'RegDens' ].describe()) print (df[ 'RegDens' ].value_counts())
In [3]: Visualisation is a great way to spot data problems within the dataset. Again, we will use seaborn and matplotlib for that purpose. Plot the distribution of the variables using distplot . # replace the empty strings in the series with nan and typecast to float df[ 'RegDens' ] = df[ 'RegDens' ].replace( '' , np.nan).astype( float )
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
In [4]: C:\Users\bsthi\anaconda3\lib\site-packages\seaborn\distributions.py:2619: Fu tureWarning: `distplot` is a deprecated function and will be removed in a fu ture version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning) C:\Users\bsthi\anaconda3\lib\site-packages\seaborn\distributions.py:2619: Fu tureWarning: `distplot` is a deprecated function and will be removed in a fu ture version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning) import seaborn as sns import matplotlib.pyplot as plt # Distribution of RegDens regdens_dist = sns.distplot(df[ 'RegDens' ].dropna()) plt.show() # Distribution of MedHHInc medhhinc_dist = sns.distplot(df[ 'MedHHInc' ].dropna()) plt.show() # Distribution of MeanHHSz meanhhsz_dist = sns.distplot(df[ 'MeanHHSz' ].dropna()) plt.show()
The last two distplots show anomalies (outliers) in MeanHHSz and MedHHInc . For both of these variables, there is a large number of very low valued entries. Focus on MeanHHSz first. "Zoom in" on the distribution plot by increasing the number of bins. C:\Users\bsthi\anaconda3\lib\site-packages\seaborn\distributions.py:2619: Fu tureWarning: `distplot` is a deprecated function and will be removed in a fu ture version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning)
In [5]: It is apparent that many of the records are valued close to zero, and logically it is unlikely for an household to have less than 1 member (need to have at least 1 person in a household). This suggests a data problem with this variable. As mentioned before, MedHHInc also contains some errorneous values. There is a chance that these anomalies are related. We could explore this relation using FacetGrid . C:\Users\bsthi\anaconda3\lib\site-packages\seaborn\distributions.py:2619: Fu tureWarning: `distplot` is a deprecated function and will be removed in a fu ture version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning) # Distribution of MeanHHSz, with increased number of bins. More bins = more specific distpl meanhhsz_dist = sns.distplot(df[ 'MeanHHSz' ].dropna(), bins = 100 ) plt.show()
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
In [6]: FacetGrid shows that errorneous data in MeanHHSz are correlated with errorneous data in MedHHInc . Based on this insight, we should eliminate all rows with errorneous MeanHHSz . In [7]: Plot all three variables for a final check. Row # before dropping errorneous rows 33178 Row # after dropping errorneous rows 32079 # create a mask of errorneous MeanHHSz values df[ 'HasError_MeanHHSz' ] = df[ 'MeanHHSz' ] < 1 # use FaceTGrid to plot the distribution of MedHHInc when MeanHHSZ is errorneous g = sns.FacetGrid(df, col = 'HasError_MeanHHSz' ) g = g.map(plt.hist, 'MedHHInc' , bins = 100 ) plt.show() # before print ( "Row # before dropping errorneous rows" , len (df)) # a very easy way to drop rows with MeanHHSz values below 1 df = df[df[ 'MeanHHSz' ] >= 1 ] # after print ( "Row # after dropping errorneous rows" , len (df))
In [8]: C:\Users\bsthi\anaconda3\lib\site-packages\seaborn\distributions.py:2619: Fu tureWarning: `distplot` is a deprecated function and will be removed in a fu ture version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning) C:\Users\bsthi\anaconda3\lib\site-packages\seaborn\distributions.py:2619: Fu tureWarning: `distplot` is a deprecated function and will be removed in a fu ture version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning) # Distribution of RegDens regdens_dist = sns.distplot(df[ 'RegDens' ].dropna()) plt.show() # Distribution of MedHHInc medhhinc_dist = sns.distplot(df[ 'MedHHInc' ].dropna()) plt.show() # Distribution of MeanHHSz meanhhsz_dist = sns.distplot(df[ 'MeanHHSz' ].dropna()) plt.show()
2. Running K-means Clustering Once the data is prepared, we are ready to build a clustering model. Before building the model, we should set the objective of this clustering process. There are a number of good grouping objectives to be applied on this dataset. The suburbs can be clustered based on location ( LocX and LocY ), demographic characteristic ( RegDens , MedHHInc , MeanHHSz and RegPop ) or both. As clustering suburbs based on geographical location is quite straightforward, we will focus on clustering based on demographic characteristics in this practical. Moreover, as in predictive mining, we do not use ID-like variables whose values are unique for each record such as LocX and LocY . These fields do not add any values to data mining process. Thus, we will use MedHHInc , MeanHHSz and RegDens and drop the rest of the features. We will also drop RegPop as it is redundant with RegDens . RegPop is also highly influenced by suburb area size, an information we do not have in this dataset. To compare regions based on their demographic information accurately, RegDens is more suitable. Clustering is sensitive to inputs on different scale. Recall from the lecture that clustering uses proximity/distance measure . The most common distance measure is Euclidian distance . With inputs on different scale, Euclidian distance favors features on larger scale. Thus, we need to apply scaling before C:\Users\bsthi\anaconda3\lib\site-packages\seaborn\distributions.py:2619: Fu tureWarning: `distplot` is a deprecated function and will be removed in a fu ture version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning)
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
performing clustering. In [9]: In [10]: sklearn has many clustering algorithms implemented. In this practical, we will first focus on the most common clustering algorithm, K-Means . K-means starts by picking random points as the initial cluster centers (centroids). For each iteration of K-means, all data points are assigned to the closest centroids. Each centroids are then updated to get closer to the mean of each cluster. In your project space/iPython console, start by importing the K-Means clustering. Initialise the clustering function with n_clusters hyperparameter ( ) of 3 and fit it to the dataset given. Similar to many data mining models, K-Means clustering has the element of randomness, which is controlled by the random_state hyperparameter. In clustering, we want to minimize the intra-cluster distance while maximizing the inter-cluster distances. After the model is fitted, print out its inertia (sum of distances of samples to their closest cluster center/centroid) and centroid locations. count 32079 unique 32079 top 00601 freq 1 Name: ID, dtype: object print (df[ 'ID' ].describe()) from sklearn.preprocessing import StandardScaler # take 3 variables and drop the rest df2 = df[[ 'MedHHInc' , 'MeanHHSz' , 'RegDens' ]] # convert df2 to matrix X = df2.to_numpy() # scaling scaler = StandardScaler() X = scaler.fit_transform(X)
In [11]: The number of clusters is controlled by the n_clusters hyperparameter. Setting k-value is a subjective process due to the absence of the label information. It usually depends upon the domain information whether a small or high number of clusters is required for data understanding. Or it can be a trial and error process. We explain later a systematic process to set the k value. A higher will result in more centroids and clusters, which typically results in lower inertia and a finer-grained cluster solution. However it may not be an indication of right solution as a very high value of can create many small meaningless clusters. In [12]: 3. Understanding and Visualising a Clustering Model Sum of intra-cluster distance: 52450.59019715356 Centroid locations: [1.31801745 0.91730431 0.75459506] [-0.41317147 -0.08392096 -0.88327097] [-0.17519304 -0.41742819 0.83816619] Sum of intra-cluster distance: 27714.36231225028 Centroid locations: [-0.27454857 -0.1577295 -0.00554968] [1.11461835 0.31026514 0.77228009] [-0.19642367 -1.05685288 1.14381746] [-0.43545227 0.68390749 1.13766219] [-0.23679633 0.51857153 -0.9360597 ] [-0.3157633 3.24487675 0.14829235] [3.42065995 0.55990691 0.97283184] [-0.5722583 -0.68466511 -1.1917228 ] from sklearn.cluster import KMeans # random state, we will use 42 instead of 10 for a change rs = 42 # set the random state. different random state seeds might result in different centroids lo model = KMeans(n_clusters = 3 , random_state = rs) model.fit(X) # sum of intra-cluster distances print ( "Sum of intra-cluster distance:" , model.inertia_) print ( "Centroid locations:" ) for centroid in model.cluster_centers_: print (centroid) # set a different n_clusters model = KMeans(n_clusters = 8 , random_state = rs) model.fit(X) # sum of intra-cluster distances print ( "Sum of intra-cluster distance:" , model.inertia_) print ( "Centroid locations:" ) for centroid in model.cluster_centers_: print (centroid)
We will take a closer look at the generated clustering model. A common method to understand clustering results is to visualise the distribution of variables in clusters. We have done this in a very limited way by printing the values of centroids. To gain a better view of how the clusters are spread out in the dataset, we can use seaborn's pairplot. Before that, we will use the generated clustering model to assign each record in the dataset with a cluster ID.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
In [13]: C:\Users\bsthi\AppData\Local\Temp/ipykernel_12504/214265829.py:6: SettingWit hCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/ stable/user_guide/indexing.html#returning-a-view-versus-a-copy (https://pand as.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-v ersus-a-copy) df2['Cluster_ID'] = y Cluster membership 1 15319 2 10575 0 6185 Name: Cluster_ID, dtype: int64 model = KMeans(n_clusters = 3 , random_state = rs).fit(X) # assign cluster ID to each record in X # Ignore the warning, does not apply to our case here y = model.predict(X) df2[ 'Cluster_ID' ] = y # how many records are in each cluster print ( "Cluster membership" ) print (df2[ 'Cluster_ID' ].value_counts()) # pairplot the cluster distribution. cluster_g = sns.pairplot(df2, hue = 'Cluster_ID' ,diag_kind = 'hist' ) plt.show()
Your clustering plots should also look similar, except the cluster IDs might not be same. You may notice some varaiations as there can be mulitple solutions generated by k-means. That is totally fine, not something that you can control even with random_state . This relates to the fact, discussed in the lecture, that k-means clustering algorithm yields a local solution and every run may generate a different solution. The pairplot shows us how different cluster members have different value distribution on different variables. Here is how to interpret the plots: 1. Look at MeanHHSz and RegDens plot (second row, third column) and we could see the difference between suburbs in cluster 1 and 2. Cluster 1 covers less densely populated suburbs with smaller households, while cluster 2 covers more crowded regions and still small families. 2. For MedHHInc (first row, second column), pairplot shows that cluster 0 covering regions with higher median household. The visualisation helps us to profile the clusters as follow: Cluster 0: Suburbs with large households and medium-high earnings. Cluster 1: Sparse populated suburbs with smaller, low earning households. Cluster 2: Dense populated suburbs with smaller, low earning households. While this pairplot is useful to provide overall cluster profiles, it can get very cluttered and hard to understand if there are more clusters. In addition, you might only want to understand a specific cluster, therefore a PairPlot with all clusters might not be necessary. Consider the following clustering model with .
In [14]: Sum of intra-cluster distance: 27714.36231225028 Centroid locations: [-0.27454857 -0.1577295 -0.00554968] [1.11461835 0.31026514 0.77228009] [-0.19642367 -1.05685288 1.14381746] [-0.43545227 0.68390749 1.13766219] [-0.23679633 0.51857153 -0.9360597 ] [-0.3157633 3.24487675 0.14829235] [3.42065995 0.55990691 0.97283184] [-0.5722583 -0.68466511 -1.1917228 ] # set a different n_clusters model = KMeans(n_clusters = 8 , random_state = rs) model.fit(X) # sum of intra-cluster distances print ( "Sum of intra-cluster distance:" , model.inertia_) print ( "Centroid locations:" ) for centroid in model.cluster_centers_: print (centroid)
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
In [15]: Cluster membership 0 7356 7 6227 4 5155 1 4890 2 3986 3 2462 6 1026 5 977 Name: Cluster_ID, dtype: int64 C:\Users\bsthi\AppData\Local\Temp/ipykernel_12504/4055980514.py:3: SettingWi thCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/ stable/user_guide/indexing.html#returning-a-view-versus-a-copy (https://pand as.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-v ersus-a-copy) df2['Cluster_ID'] = y # again, ignore the warning y = model.predict(X) df2[ 'Cluster_ID' ] = y # how many in each print ( "Cluster membership" ) print (df2[ 'Cluster_ID' ].value_counts()) # pairplot cluster_g = sns.pairplot(df2, hue = 'Cluster_ID' ,diag_kind = 'hist' ) plt.show()
As the number of clusters increases, the pairplot plots become more specific, confusing and difficult to interpret. Assume you would like to get insights on cluster 1 and 7 from the K-Means clustering solution. An alternative way to profile clusters is to plot their respective variable distributions against the distributions from all data. This method shows certain characteristics of a cluster compared to characteristics of the whole dataset. We will use distplot to visualise variable distribution. Use the following code:
In [16]: Distribution for cluster 1 C:\Users\bsthi\anaconda3\lib\site-packages\seaborn\distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure -level function with similar flexibility) or `histplot` (an axes-level fun ction for histograms). warnings.warn(msg, FutureWarning) C:\Users\bsthi\anaconda3\lib\site-packages\seaborn\distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure -level function with similar flexibility) or `kdeplot` (an axes-level func tion for kernel density plots). warnings.warn(msg, FutureWarning) C:\Users\bsthi\anaconda3\lib\site-packages\seaborn\distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure -level function with similar flexibility) or `histplot` (an axes-level fun ction for histograms). warnings.warn(msg, FutureWarning) C:\Users\bsthi\anaconda3\lib\site-packages\seaborn\distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure -level function with similar flexibility) or `kdeplot` (an axes-level func tion for kernel density plots). warnings.warn(msg, FutureWarning) C:\Users\bsthi\anaconda3\lib\site-packages\seaborn\distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure -level function with similar flexibility) or `histplot` (an axes-level fun ction for histograms). warnings.warn(msg, FutureWarning) C:\Users\bsthi\anaconda3\lib\site-packages\seaborn\distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in # prepare the column and bin size. Increase bin size to be more specific, but 20 is more th cols = [ 'MedHHInc' , 'MeanHHSz' , 'RegDens' ] n_bins = 20 # inspecting cluster 1 and 7 clusters_to_inspect = [ 1 , 7 ] for cluster in clusters_to_inspect: print ( "Distribution for cluster {}" .format(cluster)) # create subplots fig, ax = plt.subplots(nrows = 3 ) ax[ 0 ].set_title( "Cluster {}" .format(cluster)) for j, col in enumerate (cols): # create the bins bins = np.linspace( min (df2[col]), max (df2[col]), 20 ) # plot distribution of the cluster using histogram sns.distplot(df2[df2[ 'Cluster_ID' ] == cluster][col], bins = bins, ax = ax[j], norm_hist # plot the normal distribution with a black line sns.distplot(df2[col], bins = bins, ax = ax[j], hist = False , color = "k" ) plt.tight_layout() plt.show()
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
a future version. Please adapt your code to use either `displot` (a figure -level function with similar flexibility) or `kdeplot` (an axes-level func tion for kernel density plots). warnings.warn(msg, FutureWarning) Distribution for cluster 7 C:\Users\bsthi\anaconda3\lib\site-packages\seaborn\distributions.py:2619: Fu tureWarning: `distplot` is a deprecated function and will be removed in a fu ture version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning) C:\Users\bsthi\anaconda3\lib\site-packages\seaborn\distributions.py:2619: Fu tureWarning: `distplot` is a deprecated function and will be removed in a fu ture version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `kdeplot` (an axes-level function for kernel density plots). warnings.warn(msg, FutureWarning) C:\Users\bsthi\anaconda3\lib\site-packages\seaborn\distributions.py:2619: Fu tureWarning: `distplot` is a deprecated function and will be removed in a fu ture version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning) C:\Users\bsthi\anaconda3\lib\site-packages\seaborn\distributions.py:2619: Fu tureWarning: `distplot` is a deprecated function and will be removed in a fu ture version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `kdeplot` (an axes-level function for kernel density plots). warnings.warn(msg, FutureWarning) C:\Users\bsthi\anaconda3\lib\site-packages\seaborn\distributions.py:2619: Fu tureWarning: `distplot` is a deprecated function and will be removed in a fu ture version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning) C:\Users\bsthi\anaconda3\lib\site-packages\seaborn\distributions.py:2619: Fu tureWarning: `distplot` is a deprecated function and will be removed in a fu ture version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `kdeplot` (an axes-level function for kernel density plots). warnings.warn(msg, FutureWarning)
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
** NOTE: Your cluster result can be similar, with the subgroups of instances may have assigned a different cluster ID.** Here, we plot the distributions of cluster 1 and cluster 7 against the distributions from all data. The black lines are the distributions from all records, while light-blue lines are for a specific cluster. These plots show us the key characteristics of the clusters, as follows: 1. Cluster 1: Slightly higher MedHHInc , right leaning MeanHHSz and right leaning RegDens . Suburbs in cluster 1 are suburbs with above average household size and dense population. 2. Cluster 7: Slightly lower MedHHInc , left leaning MeanHHSz and left leaning RegDens . Showing that suburbs in cluster 7 are suburbs with small average median household income, smaller families and sparse population. Determining As noted earlier, or the number of clusters is essential for the cluster building process. A smaller is easier/faster to train and should show the general groupings of the dataset. A larger results in finer-grained, more specific clusters, yet it is slow and could "overfit" the dataset. Therefore, the big question is, how do we determine the optimal ? In many cases, can be derived from the business question we are trying to answer with clustering. For example, given a dataset of customers, we would like to build three different marketing approaches. Therefore, the logical answer is to set , build the clusters and create the marketing plans based on the 3 segment profiles. However, sometimes the business question does not provide a clue to set the value. For these cases, an alternative approach is to visually inspecting your data points and guess a K value. Unfortunately, if the dataset is quite large, you will soon find this approach cumbersome. Next, we explain a widely used elbow method to set the value. In this method, a plot is drawn between the values and the clustering error (in sklearn it is called inertia or cost ). Typically, the values are inversely correlated with the clustering error values, i.e. the error gets smaller once K gets larger. As becomes larger, each cluster becomes smaller in size, reducing the intra-cluster distances. The main idea of the elbow method is to find K at which the error decreases abruptly. This produces an "elbow" effect. The plot is drawn to estimate the minimal number of clusters that best accounts for the variability in the dataset. The variability is captured by comparing the error value obtained with a specific solution versus the error value obtained by clustering a uniformly distributed set of points.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Usually, the practice is to go with the minimal number of clusters that subgroups the dataset most effectively (unless you have been provided with a number, or the interpretation is meaningful with more clusters). Therefore, you select the cluster number as per the first valley (i.e., elbow) in the chart, as it indicates the “local minima” to choose the number. It is not a global minimum, as in a chart there may be many valley/peaks. A valley/peak in the graph indicates that you were getting decreased values with an additional number of clusters before it starts increasing again. This increase can be interpreted as “overfitting”, therefore you choose the point before the model starts to overfit. This overfitting indicates that the cluster solution you are fitting to the data with X number of clusters fits worse than uniformly distributed points. Same concept as predictive models – you choose the model before overfitting starts happening. In the following graph, you will choose as the best clustering solution.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
In [17]: C:\Users\bsthi\anaconda3\lib\site-packages\sklearn\cluster\_kmeans.py:792: F utureWarning: 'n_jobs' was deprecated in version 0.23 and will be removed in 1.0 (renaming of 0.25). warnings.warn("'n_jobs' was deprecated in version 0.23 and will be" C:\Users\bsthi\anaconda3\lib\site-packages\sklearn\cluster\_kmeans.py:792: F utureWarning: 'n_jobs' was deprecated in version 0.23 and will be removed in 1.0 (renaming of 0.25). warnings.warn("'n_jobs' was deprecated in version 0.23 and will be" C:\Users\bsthi\anaconda3\lib\site-packages\sklearn\cluster\_kmeans.py:792: F utureWarning: 'n_jobs' was deprecated in version 0.23 and will be removed in 1.0 (renaming of 0.25). warnings.warn("'n_jobs' was deprecated in version 0.23 and will be" C:\Users\bsthi\anaconda3\lib\site-packages\sklearn\cluster\_kmeans.py:792: F utureWarning: 'n_jobs' was deprecated in version 0.23 and will be removed in 1.0 (renaming of 0.25). warnings.warn("'n_jobs' was deprecated in version 0.23 and will be" C:\Users\bsthi\anaconda3\lib\site-packages\sklearn\cluster\_kmeans.py:792: F utureWarning: 'n_jobs' was deprecated in version 0.23 and will be removed in 1.0 (renaming of 0.25). warnings.warn("'n_jobs' was deprecated in version 0.23 and will be" C:\Users\bsthi\anaconda3\lib\site-packages\sklearn\cluster\_kmeans.py:792: F utureWarning: 'n_jobs' was deprecated in version 0.23 and will be removed in 1.0 (renaming of 0.25). warnings.warn("'n_jobs' was deprecated in version 0.23 and will be" C:\Users\bsthi\anaconda3\lib\site-packages\sklearn\cluster\_kmeans.py:792: F utureWarning: 'n_jobs' was deprecated in version 0.23 and will be removed in 1.0 (renaming of 0.25). warnings.warn("'n_jobs' was deprecated in version 0.23 and will be" # list to save the clusters and cost clusters = [] inertia_vals = [] # this whole process should take a while for k in range ( 2 , 15 , 2 ): # train clustering with the specified K model = KMeans(n_clusters = k, random_state = rs, n_jobs = 10 ) model.fit(X) # append model to cluster list clusters.append(model) inertia_vals.append(model.inertia_)
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
In [18]: Here, the elbow is somewhere between 4 and 6. Either values can be selected as the optimal . While being a good heuristic, the elbow method does not always yield the "obvious" K. In many cases, the error plot can be very smooth and shows no distinct K. As an alternative, silhouette score is commonly used. Silhouette score is a measure of how similar an object is to its own cluster (cohesion) compared to other clusters (separation). The silhouette ranges from -1 to 1, where a high value indicates that the object is well matched to its own cluster and poorly matched to neighboring clusters. If most objects have a high value, then the clustering configuration is deemed high quality. If many data points have a low or negative value, then the clustering configuration may have too many or too few clusters. However, the computation of silhouette score is an expensive process and incurs addition overheads. In large datasets, it may not be feasible to compute the score for all objects/samples/records. More info on silhouette (https://en.wikipedia.org/wiki/Silhouette_%28clustering%29) In the underlying clustering problem, a decision will have to be made by choosing between and . We can use the silhouette_score from sklearn , which returns the mean silhouette score for all samples for both solutions. In [19]: silhouette_score returns a mean silhouette score of 0.33 for and 0.25 for . This shows clusters in are more appropriately matched to its own cluster then . Therefore, we could choose KMeans(n_clusters=4, n_jobs=10, random_state=42) Silhouette score for k=4 0.33091719115444046 KMeans(n_clusters=6, n_jobs=10, random_state=42) Silhouette score for k=6 0.25399659182558476 # plot the inertia vs K values plt.plot( range ( 2 , 15 , 2 ), inertia_vals, marker = '*' ) plt.show() from sklearn.metrics import silhouette_score print (clusters[ 1 ]) print ( "Silhouette score for k=4" , silhouette_score(X, clusters[ 1 ].predict(X))) print (clusters[ 2 ]) print ( "Silhouette score for k=6" , silhouette_score(X, clusters[ 2 ].predict(X)))
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
over on the basis of this score.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
In [20]: Sum of intra-cluster distance: 42564.197151149216 Centroid locations: [-0.33314101 2.30743581 0.22102952] [-0.40106113 -0.19147382 -0.89295077] [1.80312021 0.4098886 0.84130225] [-0.15468572 -0.37277263 0.79552382] Cluster membership 1 14526 3 10853 2 4544 0 2156 Name: Cluster_ID, dtype: int64 C:\Users\bsthi\AppData\Local\Temp/ipykernel_12504/2214640681.py:13: SettingW ithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/ stable/user_guide/indexing.html#returning-a-view-versus-a-copy (https://pand as.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-v ersus-a-copy) df2['Cluster_ID'] = y # visualisation of K=4 clustering solution model = KMeans(n_clusters = 4 , random_state = rs) model.fit(X) # sum of intra-cluster distances print ( "Sum of intra-cluster distance:" , model.inertia_) print ( "Centroid locations:" ) for centroid in model.cluster_centers_: print (centroid) y = model.predict(X) df2[ 'Cluster_ID' ] = y # how many in each print ( "Cluster membership" ) print (df2[ 'Cluster_ID' ].value_counts()) # pairplot # added alpha value to assist with overlapping points cluster_g = sns.pairplot(df2, hue = 'Cluster_ID' , diag_kind = 'hist' ) plt.show()
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
4. Running the Agglomerative Clustering Algorithm As an alternative to K-means clustering which uses centroid-based (or partitional) approach, agglomerative/hierarchical clustering is also commonly used to perform clustering on dataset. Agglomerative clustering starts from bottom and assigns each data point as its own cluster. For each pair of clusters, agglomerative clustering recursively merges the pair of clusters, minimising linkage distance between each cluster. Similar to KMeans, you need to import an agglomerative clustering algorithm using the sklearn.cluster module. In [21]: Once the model is imported, you can build a model using the following code. You also need to specify K or the number of clusters. Here, we will use K = 3 . For visualisation purpose (later in this section), we will only build this model on 50 data points, but agglomerative clustering can handle many data points just fine. from sklearn.cluster import AgglomerativeClustering
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
In [22]: Once the model is build, the dendrogram of this model can be visualized using the following code. In [23]: In [24]: Cluster labels are presented on the X axis, with the last K=3 joins on the top of the tree being the cluster centroids. 5. Running the Kprototypes Clustering Out[22]: AgglomerativeClustering(n_clusters=3) agg_model = AgglomerativeClustering(n_clusters = 3 ) agg_model.fit(X[: 50 ]) # subset of X, only 50 data points from matplotlib import pyplot as plt from scipy.cluster.hierarchy import dendrogram def plot_dendrogram (model, ** kwargs): # Children of hierarchical clustering children = model.children_ # Distances between each pair of children # Since we don't have this information, we can use a uniform one for plotting distance = np.arange(children.shape[ 0 ]) # The number of observations contained in each cluster level no_of_observations = np.arange( 2 , children.shape[ 0 ] + 2 ) # Create linkage matrix and then plot the dendrogram linkage_matrix = np.column_stack([children, distance, no_of_observations]).astype( float # Plot the corresponding dendrogram dendrogram(linkage_matrix, ** kwargs) plot_dendrogram(agg_model, labels = agg_model.labels_) plt.show()
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Let us consider another dataset 'adult.csv' with three variables. age : Age of the person workclass : Category of the work fnlwgt : A deidentifed variable If we have to cluster the adults based on these three attributes, Kmeans cannot be used as the variable workclass is a categorical variable. Kprototypes is a clustering method that can handle both numeric and categorical variables. Therefore, Kprototypes should be used to cluster this dataset instead of Kmeans. Next, we will load this new dataset and build a Kprototype clustering model. In [25]: workclass is a categorical value and it has to be mapped to numeric values beforing using it in the model. In [26]: In [27]: The next step is to build a Kprototypes model. The Kmodes library allows us to do this. <class 'pandas.core.frame.DataFrame'> RangeIndex: 11008 entries, 0 to 11007 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 age 11008 non-null int64 1 workclass 11008 non-null object 2 fnlwgt 11008 non-null int64 dtypes: int64(2), object(1) memory usage: 258.1+ KB [' Private' ' Local-gov' ' Self-emp-not-inc' ' Federal-gov' ' State-gov' ' Self-emp-inc' ' Without-pay' ' Never-worked'] # Load the datset import pandas as pd df = pd.read_csv( r'adult.csv' ) df.info() print (df[ 'workclass' ].unique()) from sklearn.preprocessing import StandardScaler # mapping workclass_map = { ' Private' : 1 , ' Local-gov' : 2 , ' Self-emp-not-inc' : 3 , ' Federal-gov' : 4 , df[ 'workclass' ] = df[ 'workclass' ].map(workclass_map) # convert df to matrix X = df.to_numpy() # scaling scaler = StandardScaler() X = scaler.fit_transform(X)
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
In [28]: Unlike KMeans , KPrototypes does not support the calculation of inertia . However, the clustering cost, defined as the sum distance of all points to their respective cluster centroids, can be calculated. This can be used to identify the optimal K. The cost_ parameter of KPrototypes returns this cost value. In [29]: In [30]: By applying the elbow method on the above plot, the optimal value for lies between 4 and 6. The silhouette score has to be calculated to find the optimal value. Due to the presences of mixed data types (numeric and categorical), the calculation of silhouette score for Kprototypes is different form KMeans . For Kprototypes , two silhouette scores representing numeric variables and categorical variables should be calculated seperately and average should be calculated. We will first see how to calculate this value for . from kmodes.kmodes import KModes from kmodes.kprototypes import KPrototypes # list to save the clusters and cost clusters = [] cost_vals = [] # this whole process should take a while for k in range ( 2 , 10 , 2 ): # train clustering with the specified K model = KPrototypes(n_clusters = k, random_state = rs, n_jobs = 10 ) model.fit_predict(X, categorical = [ 1 ]) # append model to cluster list clusters.append(model) cost_vals.append(model.cost_) # plot the cost vs K values plt.plot( range ( 2 , 10 , 2 ), cost_vals, marker = '*' ) plt.show()
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
In [31]: In [32]: In [33]: Now your task is to identify the optimal K by calculating the silhouette score for K = 6, and K = 8. Then, visualize the optimal cluster using pairplot and distplot and interpret the results similar to KMeans . In [34]: Silscore for numeric variables: 0.3343021534172589 Silscore for categorical variables: -0.09998949153471454 The avg silhouette score for k=4: 0.11715633094127217 The avg Silhouette score for k=4: 0.11715633094127217 The avg Silhouette score for k=6: 0.07907669640648304 The avg Silhouette score for k=8: 0.07049603177523696 X_num = [[row[ 0 ], row[ 2 ]] for row in X] # Variables of X with numeric datatype X_cat = [[row[ 1 ]] for row in X] # variables of X with categorical datatype model = clusters[ 1 ] # cluster[1] holds the K-prtotypes model with K=4 from sklearn.metrics import silhouette_score # Calculate the Silhouette Score for the numeric and categorical variables seperately silScoreNums = silhouette_score(X_num, model.fit_predict(X,categorical = [ 1 ]), metric = 'euclid print ( "Silscore for numeric variables: " + str (silScoreNums)) silScoreCats = silhouette_score(X_cat, model.fit_predict(X,categorical = [ 1 ]), metric = 'hammin print ( "Silscore for categorical variables: " + str (silScoreCats)) # Average the silhouette scores silScore = (silScoreNums + silScoreCats) / 2 print ( "The avg silhouette score for k=4: " + str (silScore)) model = clusters[ 1 ] silScoreNums = silhouette_score(X_num, model.fit_predict(X,categorical = [ 1 ]), metric = 'euclid silScoreCats = silhouette_score(X_cat, model.fit_predict(X,categorical = [ 1 ]), metric = 'hammin silScore = (silScoreNums + silScoreCats) / 2 print ( "The avg Silhouette score for k=4: " + str (silScore)) model = clusters[ 2 ] silScoreNums = silhouette_score(X_num, model.fit_predict(X,categorical = [ 1 ]), metric = 'euclid silScoreCats = silhouette_score(X_cat, model.fit_predict(X,categorical = [ 1 ]), metric = 'hammin silScore = (silScoreNums + silScoreCats) / 2 print ( "The avg Silhouette score for k=6: " + str (silScore)) model = clusters[ 3 ] silScoreNums = silhouette_score(X_num, model.fit_predict(X,categorical = [ 1 ]), metric = 'euclid silScoreCats = silhouette_score(X_cat, model.fit_predict(X,categorical = [ 1 ]), metric = 'hammin silScore = (silScoreNums + silScoreCats) / 2 print ( "The avg Silhouette score for k=8: " + str (silScore))
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
In [35]: Cluster membership 3 3452 0 3311 1 2860 2 1385 Name: Cluster_ID, dtype: int64 import seaborn as sns import matplotlib.pyplot as plt model = clusters[ 1 ] y = model.fit_predict(X, categorical = [ 1 ]) df[ 'Cluster_ID' ] = y # how many records are in each cluster print ( "Cluster membership" ) print (df[ 'Cluster_ID' ].value_counts()) # pairplot the cluster distribution. cluster_g = sns.pairplot(df, hue = 'Cluster_ID' ,diag_kind = 'hist' ) plt.show()
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
In [36]: Distribution for cluster 0 C:\Users\bsthi\anaconda3\lib\site-packages\seaborn\distributions.py:2619: Fu tureWarning: `distplot` is a deprecated function and will be removed in a fu ture version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning) C:\Users\bsthi\anaconda3\lib\site-packages\seaborn\distributions.py:1699: Fu tureWarning: The `bw` parameter is deprecated in favor of `bw_method` and `b w_adjust`. Using 1.5 for `bw_method`, but please see the docs for the new pa rameters and update your code. warnings.warn(msg, FutureWarning) C:\Users\bsthi\anaconda3\lib\site-packages\seaborn\distributions.py:2619: Fu tureWarning: `distplot` is a deprecated function and will be removed in a fu ture version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `kdeplot` (an axes-level function for kernel density plots). warnings.warn(msg, FutureWarning) C:\Users\bsthi\anaconda3\lib\site-packages\seaborn\distributions.py:2619: Fu tureWarning: `distplot` is a deprecated function and will be removed in a fu ture version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning) C:\Users\bsthi\anaconda3\lib\site-packages\seaborn\distributions.py:1699: Fu tureWarning: The `bw` parameter is deprecated in favor of `bw_method` and `b w_adjust`. Using 1.5 for `bw_method`, but please see the docs for the new pa rameters and update your code. warnings.warn(msg, FutureWarning) C:\Users\bsthi\anaconda3\lib\site-packages\seaborn\distributions.py:2619: Fu tureWarning: `distplot` is a deprecated function and will be removed in a fu ture version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `kdeplot` (an axes-level function for kernel density plots). warnings.warn(msg, FutureWarning) import pandas as pd import numpy as np # prepare the column and bin size. Increase bin size to be more specific, but 20 is more th cols = [ 'age' , 'workclass' , 'fnlwgt' ] n_bins = 20 clusters_to_inspect = [ 0 , 1 , 2 , 3 ] for cluster in clusters_to_inspect: print ( "Distribution for cluster {}" .format(cluster)) fig, ax = plt.subplots(nrows = 3 ) ax[ 0 ].set_title( "Cluster {}" .format(cluster)) for j, col in enumerate (cols): bins = np.linspace( min (df[col]), max (df[col]), 20 ) sns.distplot(df[df[ 'Cluster_ID' ] == cluster][col], bins = bins, ax = ax[j], norm_hist = T sns.distplot(df[col], bins = bins, ax = ax[j], hist = False , color = "k" ) plt.tight_layout() plt.show()
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
C:\Users\bsthi\anaconda3\lib\site-packages\seaborn\distributions.py:2619: Fu tureWarning: `distplot` is a deprecated function and will be removed in a fu ture version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning) C:\Users\bsthi\anaconda3\lib\site-packages\seaborn\distributions.py:1699: Fu tureWarning: The `bw` parameter is deprecated in favor of `bw_method` and `b w_adjust`. Using 1.5 for `bw_method`, but please see the docs for the new pa rameters and update your code. warnings.warn(msg, FutureWarning) C:\Users\bsthi\anaconda3\lib\site-packages\seaborn\distributions.py:2619: Fu tureWarning: `distplot` is a deprecated function and will be removed in a fu ture version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `kdeplot` (an axes-level function for kernel density plots). warnings.warn(msg, FutureWarning) Distribution for cluster 1 C:\Users\bsthi\anaconda3\lib\site-packages\seaborn\distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure -level function with similar flexibility) or `histplot` (an axes-level fun ction for histograms). warnings.warn(msg, FutureWarning) C:\Users\bsthi\anaconda3\lib\site-packages\seaborn\distributions.py:1699: FutureWarning: The `bw` parameter is deprecated in favor of `bw_method` an d `bw_adjust`. Using 1.5 for `bw_method`, but please see the docs for the new parameters and update your code. warnings.warn(msg, FutureWarning) C:\Users\bsthi\anaconda3\lib\site-packages\seaborn\distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure -level function with similar flexibility) or `kdeplot` (an axes-level func tion for kernel density plots). warnings.warn(msg, FutureWarning) C:\Users\bsthi\anaconda3\lib\site-packages\seaborn\distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure -level function with similar flexibility) or `histplot` (an axes-level fun ction for histograms). warnings.warn(msg, FutureWarning) C:\Users\bsthi\anaconda3\lib\site-packages\seaborn\distributions.py:1699:
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
FutureWarning: The `bw` parameter is deprecated in favor of `bw_method` an d `bw_adjust`. Using 1.5 for `bw_method`, but please see the docs for the new parameters and update your code. warnings.warn(msg, FutureWarning) C:\Users\bsthi\anaconda3\lib\site-packages\seaborn\distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure -level function with similar flexibility) or `kdeplot` (an axes-level func tion for kernel density plots). warnings.warn(msg, FutureWarning) C:\Users\bsthi\anaconda3\lib\site-packages\seaborn\distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure -level function with similar flexibility) or `histplot` (an axes-level fun ction for histograms). warnings.warn(msg, FutureWarning) C:\Users\bsthi\anaconda3\lib\site-packages\seaborn\distributions.py:1699: FutureWarning: The `bw` parameter is deprecated in favor of `bw_method` an d `bw_adjust`. Using 1.5 for `bw_method`, but please see the docs for the new parameters and update your code. warnings.warn(msg, FutureWarning) C:\Users\bsthi\anaconda3\lib\site-packages\seaborn\distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure -level function with similar flexibility) or `kdeplot` (an axes-level func tion for kernel density plots). warnings.warn(msg, FutureWarning) Distribution for cluster 2 C:\Users\bsthi\anaconda3\lib\site-packages\seaborn\distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure -level function with similar flexibility) or `histplot` (an axes-level fun ction for histograms). warnings.warn(msg, FutureWarning) C:\Users\bsthi\anaconda3\lib\site-packages\seaborn\distributions.py:1699: FutureWarning: The `bw` parameter is deprecated in favor of `bw_method` an d `bw_adjust`. Using 1.5 for `bw_method`, but please see the docs for the new parameters and update your code. warnings.warn(msg, FutureWarning) C:\Users\bsthi\anaconda3\lib\site-packages\seaborn\distributions.py:2619:
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure -level function with similar flexibility) or `kdeplot` (an axes-level func tion for kernel density plots). warnings.warn(msg, FutureWarning) C:\Users\bsthi\anaconda3\lib\site-packages\seaborn\distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure -level function with similar flexibility) or `histplot` (an axes-level fun ction for histograms). warnings.warn(msg, FutureWarning) C:\Users\bsthi\anaconda3\lib\site-packages\seaborn\distributions.py:1699: FutureWarning: The `bw` parameter is deprecated in favor of `bw_method` an d `bw_adjust`. Using 1.5 for `bw_method`, but please see the docs for the new parameters and update your code. warnings.warn(msg, FutureWarning) C:\Users\bsthi\anaconda3\lib\site-packages\seaborn\distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure -level function with similar flexibility) or `kdeplot` (an axes-level func tion for kernel density plots). warnings.warn(msg, FutureWarning) C:\Users\bsthi\anaconda3\lib\site-packages\seaborn\distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure -level function with similar flexibility) or `histplot` (an axes-level fun ction for histograms). warnings.warn(msg, FutureWarning) C:\Users\bsthi\anaconda3\lib\site-packages\seaborn\distributions.py:1699: FutureWarning: The `bw` parameter is deprecated in favor of `bw_method` an d `bw_adjust`. Using 1.5 for `bw_method`, but please see the docs for the new parameters and update your code. warnings.warn(msg, FutureWarning) C:\Users\bsthi\anaconda3\lib\site-packages\seaborn\distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure -level function with similar flexibility) or `kdeplot` (an axes-level func tion for kernel density plots). warnings.warn(msg, FutureWarning) Distribution for cluster 3
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
C:\Users\bsthi\anaconda3\lib\site-packages\seaborn\distributions.py:2619: Fu tureWarning: `distplot` is a deprecated function and will be removed in a fu ture version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning) C:\Users\bsthi\anaconda3\lib\site-packages\seaborn\distributions.py:1699: Fu tureWarning: The `bw` parameter is deprecated in favor of `bw_method` and `b w_adjust`. Using 1.5 for `bw_method`, but please see the docs for the new pa rameters and update your code. warnings.warn(msg, FutureWarning) C:\Users\bsthi\anaconda3\lib\site-packages\seaborn\distributions.py:2619: Fu tureWarning: `distplot` is a deprecated function and will be removed in a fu ture version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `kdeplot` (an axes-level function for kernel density plots). warnings.warn(msg, FutureWarning) C:\Users\bsthi\anaconda3\lib\site-packages\seaborn\distributions.py:2619: Fu tureWarning: `distplot` is a deprecated function and will be removed in a fu ture version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning) C:\Users\bsthi\anaconda3\lib\site-packages\seaborn\distributions.py:1699: Fu tureWarning: The `bw` parameter is deprecated in favor of `bw_method` and `b w_adjust`. Using 1.5 for `bw_method`, but please see the docs for the new pa rameters and update your code. warnings.warn(msg, FutureWarning) C:\Users\bsthi\anaconda3\lib\site-packages\seaborn\distributions.py:2619: Fu tureWarning: `distplot` is a deprecated function and will be removed in a fu ture version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `kdeplot` (an axes-level function for kernel density plots). warnings.warn(msg, FutureWarning) C:\Users\bsthi\anaconda3\lib\site-packages\seaborn\distributions.py:2619: Fu tureWarning: `distplot` is a deprecated function and will be removed in a fu ture version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning) C:\Users\bsthi\anaconda3\lib\site-packages\seaborn\distributions.py:1699: Fu tureWarning: The `bw` parameter is deprecated in favor of `bw_method` and `b w_adjust`. Using 1.5 for `bw_method`, but please see the docs for the new pa rameters and update your code. warnings.warn(msg, FutureWarning) C:\Users\bsthi\anaconda3\lib\site-packages\seaborn\distributions.py:2619: Fu tureWarning: `distplot` is a deprecated function and will be removed in a fu ture version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `kdeplot` (an axes-level function for kernel density plots). warnings.warn(msg, FutureWarning)
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Output: End Notes We learned how to build, tune and explore clustering models. We also used visualisation to help us explain the cluster/segment profiles produced by the model. The goal of cluster analysis is to identify distinct groupings of cases across a set of inputs without the presence of target variable.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help