Tutorial_W9_solution(1) (1)
pdf
keyboard_arrow_up
School
Queensland University of Technology *
*We aren’t endorsed by this school
Course
509
Subject
Computer Science
Date
Dec 6, 2023
Type
Pages
40
Uploaded by CountGalaxy14444
Week 9 Computer Tutorial
Dr Richi Nayak, r.nayak@qut.edu.au
(mailto:r.nayak@qut.edu.au)
Practical Topics:
1. Preparing data for clustering
2. Running K-means Clustering
3. Understanding and visualising a clustering model
4. Running the Agglomerative Clustering Algorithm
5. Running the Kprototypes Clustering
Part 1 - Reflective exercises
In this practical, you will be introduced to a new dataset, data preparation for clustering, and clustering task.
Please reflect on the clustering related theoretical concepts and answer the following questions.
Exercise 1: Clustering Introduction: Basics
1. In data mining, Data objects that do not comply with general behavior or model of the data are called as
a. Clusters b. Centroids c. Outliers d. None of these.
2. Which of the following is a common use of unsupervised clustering?
a. detect outliers b. determine a best set of input attributes for supervised learning
c. determine if meaningful relationships can be found in a dataset d. All of a,b and d are common uses of unsupervised clustering.
3. Suppose you are a luxury automobile dealer. Your dealership is planning to sell a new model. Given the
following questions, which question could be answered by applying clustering on their customer dataset?
a. How much should you charge for the new BMW X6M? b. How likely the person X will buy the new BMW X5W? c. What type of customers has bought the silver BMW M5? List their features.
4. The K-Means algorithm terminates when
a. a user-defined minimum value for the summation of squared error differences between instances and their corresponding cluster center is seen. b. the cluster centers for the current iteration are identical to the cluster
centers for the previous iteration. c. the number of instances in each cluster for the current iteration is identi
cal to the number of instances in each cluster of the previous iteration. d. the number of clusters formed for the current iteration is identical to the number of clusters formed in the previous iteration.
5. This clustering algorithm initially assumes that each data instance represents a single cluster.
a. agglomerative clustering b. conceptual clustering c. K-Means clustering d. expectation maximization
6. List the aspects on which k-median performs better and worse than k-means. Discuss a main challenge
common to both the K-means and K-medoids algorithms.
Exercise 2: Clustering process: proximity measures
1. A popular criterion for measure of closeness of two objects in clustering analysis is a distance function.
Name the other criterion that can be used to differentiate two objects.
2. Consider two clustering solutions with the intra- and inter-cluster similarity measures. Let the similarity
measure be in the range of [0,1] where 0 is lowest and 1 is highest. Which is the optimal solution?
Solution 1: Intra-cluster similarity = 0.99, inter-cluster similarity = 0.01 Solution 2: Intra-cluster similarity = 0.01, inter-cluster similarity = 0.99 3. List various distance measures, and discuss how they are used in clustering.
4. Consider a very small data set with three observations (or data points), calculate the distance between, 1)
S1 and S2 2) S1 and S3. Use dot product, Euclidean, and Manhattan distance measure.
Exercise 3: Evaluation
1. Unsupervised evaluation can be internal or external. Which of the following is an internal method for
evaluating alternative clusterings produced by the K-Means algorithm?
a. Use a production rule generator to compare the rule sets generated for each clustering. b. Compute and compare class resemblance scores for the clusters formed by eac
h clustering. c. Compare the sum of squared error differences between instances and their co
rresponding cluster centers for each alternative clustering. d. Create and compare the decision trees determined by each alternative cluste
ring.
2. Explain inter-cluster and intra-cluster distances and their relationship when used to evaluate clustering
results?
3. What are the different ways you could subgroup the following superheroes?
4. Suppose there are three clustering solutions with the following silhouette score. Identify which is the best
and worst solution?
Solution 1: Silhouette score = 0.34 Solution 2: Silhouette score = -0.45 Solution 3: Silhouette score = 0.15 Exercise 4: Partition based clustering
1. Suppose a dataset consists of {0, 1, 3, 4, 100, 102, 106, 108} which are points in one dimension. Perform
K-means clustering on these points. Create three clusters by assigning each point to the nearest centroid.
Show all the clusters and the total error calculated using the absolute distance measure (|a–b|) for each set
of centroids where a and b are two data points. You can choose three random numbers to start the
process. Or assume initial centroids are given as {0.5, 3.5, 104}. Repeat with the initial centroids given as {2, 101, 107}. Repeat the above exercise for creating two clusters with centroids as {2, 104} Repeat the above process of
creating clusters but using “squared error” ((a-b)^2). Analyse the difference.
2. Suppose there are 125 data points. Using the empirical method, determine the number of clusters.
3. Based on the graph shown below, determine the number of clusters using the elbow method.
4. Suppose a dataset consists of {(10, 23) (12, 30) (8, 32) (4, 15), (3,17) (1, 10)} which are points in two
dimensions. Perform K-means clustering on these points, assuming k = 2.
Exercise 5: Hierarchical clustering
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Suppose a dataset consists of 8 records {0, 1 3, 4, 100, 102, 106, 108} which are points in one dimension.
Perform hierarchical clustering on these points. Use the following process to assign clusters. Initially, each point
is in a cluster by itself. At each step, merge the two clusters with the closest centroids, and continue until only
two clusters remain. For simplicity, use the absolute distance measure (|a-b|) where a and b are two data
points.
Part 2 - Practical exercises
This practical introduces you to clustering using Python. You will learn how to preprocess data for clustering,
building clustering solutions and evaluate/visualise the results. Different from the predictive mining
algorithms/models, a dataset used for clustering is unlabelled
. A dataset used in clustering do not have the
label information that is mandatory in predictive data mining. The clustering task assists in finding common
labels in the dataset. There exist multiple clustering algorithms. Depending on the data types, a clustering
algorithm is selected. Algorithms such as K-means, Agglomerative, K-modes, and K-prototypes are some of the
commonly used algorithms. In this practical, we will learn to build K-means, Agglomerative and K-prototypes
clustering models.
1. Preparing data for clustering
We will be using the Census2000
dataset which contains the postal code-level summary of the 2000 United
States Census. There are 7 variables in this dataset:
ID
: Postal code of the region
LOCX
: Region longitude
LOCY
: Region latitude
MEANHHSZ
: Average household size in the region
MEDHHINC
: Median household income in the region
REGDENS
: Region population density percentile (1=lowest density, 100 = highest density)
REGPOP
: Number of people in the region
There is no target known in this data therefore we will utilize an unsupervised learning method, clustering, to
analyse this data. The goal of the analysis is to group people into distinct subsets based on urbanization,
household size, and income factors. These factors are common to matching commercial life-style and life-stage
segmentation products (for example, see www.claritas.com
(http://www.claritas.com) or
www.spectramarketing.com
(http://www.spectramarketing.com)
).
Similar with past practicals, we will use pandas
to load the data and perform data preprocessing.
In [1]:
From the .info()
output, you should notice the RegDens
variable type is set incorrectly. The output lists RegDens
as object/categorical, while based on the dataset description given above, RegDens
should be an
interval/numerical variable. Run .describe()
and .value_counts()
on the Series to get more information.
In [2]:
The output of these functions reveals the cause of the incorrect type, which is a number of empty strings in this
Series. A solution is to replace them with nan
to denote them as missing values and typecast the series into
float data type.
<class 'pandas.core.frame.DataFrame'> RangeIndex: 33178 entries, 0 to 33177 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ID 33178 non-null object 1 LocX 33178 non-null float64 2 LocY 33178 non-null float64 3 RegDens 33178 non-null object 4 RegPop 33178 non-null int64 5 MedHHInc 33178 non-null int64 6 MeanHHSz 33178 non-null float64 dtypes: float64(3), int64(2), object(2) memory usage: 1.8+ MB count 33178 unique 101 top freq 1013 Name: RegDens, dtype: object 1013 10 322 44 322 25 322 11 322 ... 3 321 1 321 74 321 45 321 42 321 Name: RegDens, Length: 101, dtype: int64 import
pandas as
pd
import
numpy as
np
# not skipping empty values, to demonstrate data preprocessing steps later
df =
pd.read_csv(
'census2000.csv'
, na_filter
=
False
)
df.info()
# get more information from RegDens
print
(df[
'RegDens'
].describe())
print
(df[
'RegDens'
].value_counts())
In [3]:
Visualisation is a great way to spot data problems within the dataset. Again, we will use seaborn
and matplotlib
for that purpose. Plot the distribution of the variables using distplot
.
# replace the empty strings in the series with nan and typecast to float
df[
'RegDens'
] =
df[
'RegDens'
].replace(
''
, np.nan).astype(
float
)
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
In [4]:
C:\Users\bsthi\anaconda3\lib\site-packages\seaborn\distributions.py:2619: Fu
tureWarning: `distplot` is a deprecated function and will be removed in a fu
ture version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning) C:\Users\bsthi\anaconda3\lib\site-packages\seaborn\distributions.py:2619: Fu
tureWarning: `distplot` is a deprecated function and will be removed in a fu
ture version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning) import
seaborn as
sns
import
matplotlib.pyplot as
plt
# Distribution of RegDens
regdens_dist =
sns.distplot(df[
'RegDens'
].dropna())
plt.show()
# Distribution of MedHHInc
medhhinc_dist =
sns.distplot(df[
'MedHHInc'
].dropna())
plt.show()
# Distribution of MeanHHSz
meanhhsz_dist =
sns.distplot(df[
'MeanHHSz'
].dropna())
plt.show()
The last two distplots show anomalies (outliers) in MeanHHSz
and MedHHInc
. For both of these variables,
there is a large number of very low valued entries. Focus on MeanHHSz
first. "Zoom in" on the distribution plot
by increasing the number of bins.
C:\Users\bsthi\anaconda3\lib\site-packages\seaborn\distributions.py:2619: Fu
tureWarning: `distplot` is a deprecated function and will be removed in a fu
ture version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning)
In [5]:
It is apparent that many of the records are valued close to zero, and logically it is unlikely for an household to
have less than 1 member (need to have at least 1 person in a household). This suggests a data problem with
this variable. As mentioned before, MedHHInc
also contains some errorneous values. There is a chance that
these anomalies are related. We could explore this relation using FacetGrid
.
C:\Users\bsthi\anaconda3\lib\site-packages\seaborn\distributions.py:2619: Fu
tureWarning: `distplot` is a deprecated function and will be removed in a fu
ture version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning) # Distribution of MeanHHSz, with increased number of bins. More bins = more specific distpl
meanhhsz_dist =
sns.distplot(df[
'MeanHHSz'
].dropna(), bins
=
100
)
plt.show()
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
In [6]:
FacetGrid shows that errorneous data in MeanHHSz
are correlated with errorneous data in MedHHInc
. Based
on this insight, we should eliminate all rows with errorneous MeanHHSz
.
In [7]:
Plot all three variables for a final check.
Row # before dropping errorneous rows 33178 Row # after dropping errorneous rows 32079 # create a mask of errorneous MeanHHSz values
df[
'HasError_MeanHHSz'
] =
df[
'MeanHHSz'
] <
1
# use FaceTGrid to plot the distribution of MedHHInc when MeanHHSZ is errorneous
g =
sns.FacetGrid(df, col
=
'HasError_MeanHHSz'
)
g =
g.map(plt.hist, 'MedHHInc'
, bins
=
100
)
plt.show()
# before
print
(
"Row # before dropping errorneous rows"
, len
(df))
# a very easy way to drop rows with MeanHHSz values below 1
df =
df[df[
'MeanHHSz'
] >=
1
]
# after
print
(
"Row # after dropping errorneous rows"
, len
(df))
In [8]:
C:\Users\bsthi\anaconda3\lib\site-packages\seaborn\distributions.py:2619: Fu
tureWarning: `distplot` is a deprecated function and will be removed in a fu
ture version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning) C:\Users\bsthi\anaconda3\lib\site-packages\seaborn\distributions.py:2619: Fu
tureWarning: `distplot` is a deprecated function and will be removed in a fu
ture version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning) # Distribution of RegDens
regdens_dist =
sns.distplot(df[
'RegDens'
].dropna())
plt.show()
# Distribution of MedHHInc
medhhinc_dist =
sns.distplot(df[
'MedHHInc'
].dropna())
plt.show()
# Distribution of MeanHHSz
meanhhsz_dist =
sns.distplot(df[
'MeanHHSz'
].dropna())
plt.show()
2. Running K-means Clustering
Once the data is prepared, we are ready to build a clustering model. Before building the model, we should set
the objective of this clustering process. There are a number of good grouping objectives to be applied on this
dataset. The suburbs can be clustered based on location (
LocX
and LocY
), demographic characteristic
(
RegDens
, MedHHInc
, MeanHHSz
and RegPop
) or both. As clustering suburbs based on geographical
location is quite straightforward, we will focus on clustering based on demographic characteristics in this
practical. Moreover, as in predictive mining, we do not use ID-like variables whose values are unique for each
record such as
LocX
and LocY
. These fields do not add any values to data mining process.
Thus, we will use MedHHInc
, MeanHHSz
and RegDens
and drop the rest of the features. We will also drop RegPop
as it is redundant with RegDens
. RegPop
is also highly influenced by suburb area size, an
information we do not have in this dataset. To compare regions based on their demographic information
accurately, RegDens
is more suitable.
Clustering is sensitive to inputs on different scale. Recall from the lecture that clustering uses
proximity/distance measure
. The most common distance measure is Euclidian distance
. With inputs on
different scale, Euclidian distance favors features on larger scale. Thus, we need to apply scaling before
C:\Users\bsthi\anaconda3\lib\site-packages\seaborn\distributions.py:2619: Fu
tureWarning: `distplot` is a deprecated function and will be removed in a fu
ture version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning)
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
performing clustering.
In [9]:
In [10]:
sklearn
has many clustering algorithms implemented. In this practical, we will first focus on the most
common clustering algorithm, K-Means
. K-means starts by picking random points as the initial cluster centers
(centroids). For each iteration of K-means, all data points are assigned to the closest centroids. Each centroids
are then updated to get closer to the mean of each cluster.
In your project space/iPython console, start by importing the K-Means clustering. Initialise the clustering
function with n_clusters
hyperparameter (
) of 3 and fit it to the dataset given. Similar to many data mining
models, K-Means clustering has the element of randomness, which is controlled by the random_state
hyperparameter.
In clustering, we want to minimize the intra-cluster distance while maximizing the inter-cluster distances. After
the model is fitted, print out its inertia
(sum of distances of samples to their closest cluster center/centroid)
and centroid locations.
count 32079 unique 32079 top 00601 freq 1 Name: ID, dtype: object print
(df[
'ID'
].describe())
from
sklearn.preprocessing import
StandardScaler
# take 3 variables and drop the rest
df2 =
df[[
'MedHHInc'
, 'MeanHHSz'
, 'RegDens'
]]
# convert df2 to matrix
X =
df2.to_numpy()
# scaling
scaler =
StandardScaler()
X =
scaler.fit_transform(X)
In [11]:
The number of clusters is controlled by the n_clusters
hyperparameter. Setting k-value is a subjective
process due to the absence of the label information. It usually depends upon the domain information whether a
small or high number of clusters is required for data understanding. Or it can be a trial and error process. We
explain later a systematic process to set the k value. A higher will result in more centroids and clusters,
which typically results in lower inertia and a finer-grained cluster solution. However it may not be an indication
of right solution as a very high value of can create many small meaningless clusters.
In [12]:
3. Understanding and Visualising a Clustering Model
Sum of intra-cluster distance: 52450.59019715356 Centroid locations: [1.31801745 0.91730431 0.75459506] [-0.41317147 -0.08392096 -0.88327097] [-0.17519304 -0.41742819 0.83816619] Sum of intra-cluster distance: 27714.36231225028 Centroid locations: [-0.27454857 -0.1577295 -0.00554968] [1.11461835 0.31026514 0.77228009] [-0.19642367 -1.05685288 1.14381746] [-0.43545227 0.68390749 1.13766219] [-0.23679633 0.51857153 -0.9360597 ] [-0.3157633 3.24487675 0.14829235] [3.42065995 0.55990691 0.97283184] [-0.5722583 -0.68466511 -1.1917228 ] from
sklearn.cluster import
KMeans
# random state, we will use 42 instead of 10 for a change
rs =
42
# set the random state. different random state seeds might result in different centroids lo
model =
KMeans(n_clusters
=
3
, random_state
=
rs)
model.fit(X)
# sum of intra-cluster distances
print
(
"Sum of intra-cluster distance:"
, model.inertia_)
print
(
"Centroid locations:"
)
for
centroid in
model.cluster_centers_:
print
(centroid)
# set a different n_clusters
model =
KMeans(n_clusters
=
8
, random_state
=
rs)
model.fit(X)
# sum of intra-cluster distances
print
(
"Sum of intra-cluster distance:"
, model.inertia_)
print
(
"Centroid locations:"
)
for
centroid in
model.cluster_centers_:
print
(centroid)
We will take a closer look at the generated clustering model. A common method to understand clustering results
is to visualise the distribution of variables in clusters. We have done this in a very limited way by printing the
values of centroids.
To gain a better view of how the clusters are spread out in the dataset, we can use seaborn's pairplot. Before
that, we will use the generated clustering model to assign each record in the dataset with a cluster ID.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
In [13]:
C:\Users\bsthi\AppData\Local\Temp/ipykernel_12504/214265829.py:6: SettingWit
hCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/
stable/user_guide/indexing.html#returning-a-view-versus-a-copy (https://pand
as.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-v
ersus-a-copy) df2['Cluster_ID'] = y Cluster membership 1 15319 2 10575 0 6185 Name: Cluster_ID, dtype: int64 model =
KMeans(n_clusters
=
3
, random_state
=
rs).fit(X)
# assign cluster ID to each record in X
# Ignore the warning, does not apply to our case here
y =
model.predict(X)
df2[
'Cluster_ID'
] =
y
# how many records are in each cluster
print
(
"Cluster membership"
)
print
(df2[
'Cluster_ID'
].value_counts())
# pairplot the cluster distribution.
cluster_g =
sns.pairplot(df2, hue
=
'Cluster_ID'
,diag_kind
=
'hist'
)
plt.show()
Your clustering plots should also look similar, except the cluster IDs might not be same. You may notice
some varaiations as there can be mulitple solutions generated by k-means. That is totally fine, not
something that you can control even with random_state
. This relates to the fact, discussed in the lecture,
that k-means clustering algorithm yields a local solution and every run may generate a different solution.
The pairplot
shows us how different cluster members have different value distribution on different variables.
Here is how to interpret the plots:
1. Look at MeanHHSz
and RegDens
plot (second row, third column) and we could see the difference
between suburbs in cluster 1 and 2. Cluster 1 covers less densely populated suburbs with smaller
households, while cluster 2 covers more crowded regions and still small families.
2. For MedHHInc
(first row, second column), pairplot shows that cluster 0 covering regions with higher
median household.
The visualisation helps us to profile the clusters as follow:
Cluster 0: Suburbs with large households and medium-high earnings.
Cluster 1: Sparse populated suburbs with smaller, low earning households.
Cluster 2: Dense populated suburbs with smaller, low earning households.
While this pairplot
is useful to provide overall cluster profiles, it can get very cluttered and hard to
understand if there are more clusters. In addition, you might only want to understand a specific cluster,
therefore a PairPlot with all clusters might not be necessary.
Consider the following clustering model with .
In [14]:
Sum of intra-cluster distance: 27714.36231225028 Centroid locations: [-0.27454857 -0.1577295 -0.00554968] [1.11461835 0.31026514 0.77228009] [-0.19642367 -1.05685288 1.14381746] [-0.43545227 0.68390749 1.13766219] [-0.23679633 0.51857153 -0.9360597 ] [-0.3157633 3.24487675 0.14829235] [3.42065995 0.55990691 0.97283184] [-0.5722583 -0.68466511 -1.1917228 ] # set a different n_clusters
model =
KMeans(n_clusters
=
8
, random_state
=
rs)
model.fit(X)
# sum of intra-cluster distances
print
(
"Sum of intra-cluster distance:"
, model.inertia_)
print
(
"Centroid locations:"
)
for
centroid in
model.cluster_centers_:
print
(centroid)
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
In [15]:
Cluster membership 0 7356 7 6227 4 5155 1 4890 2 3986 3 2462 6 1026 5 977 Name: Cluster_ID, dtype: int64 C:\Users\bsthi\AppData\Local\Temp/ipykernel_12504/4055980514.py:3: SettingWi
thCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/
stable/user_guide/indexing.html#returning-a-view-versus-a-copy (https://pand
as.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-v
ersus-a-copy) df2['Cluster_ID'] = y # again, ignore the warning
y =
model.predict(X)
df2[
'Cluster_ID'
] =
y
# how many in each
print
(
"Cluster membership"
)
print
(df2[
'Cluster_ID'
].value_counts())
# pairplot
cluster_g =
sns.pairplot(df2, hue
=
'Cluster_ID'
,diag_kind
=
'hist'
)
plt.show()
As the number of clusters increases, the pairplot
plots become more specific, confusing and difficult to
interpret.
Assume you would like to get insights on cluster 1
and 7
from the K-Means clustering solution. An
alternative way to profile clusters is to plot their respective variable distributions against the distributions from all
data. This method shows certain characteristics of a cluster compared to characteristics of the whole dataset.
We will use distplot
to visualise variable distribution. Use the following code:
In [16]:
Distribution for cluster 1 C:\Users\bsthi\anaconda3\lib\site-packages\seaborn\distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure
-level function with similar flexibility) or `histplot` (an axes-level fun
ction for histograms). warnings.warn(msg, FutureWarning) C:\Users\bsthi\anaconda3\lib\site-packages\seaborn\distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure
-level function with similar flexibility) or `kdeplot` (an axes-level func
tion for kernel density plots). warnings.warn(msg, FutureWarning) C:\Users\bsthi\anaconda3\lib\site-packages\seaborn\distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure
-level function with similar flexibility) or `histplot` (an axes-level fun
ction for histograms). warnings.warn(msg, FutureWarning) C:\Users\bsthi\anaconda3\lib\site-packages\seaborn\distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure
-level function with similar flexibility) or `kdeplot` (an axes-level func
tion for kernel density plots). warnings.warn(msg, FutureWarning) C:\Users\bsthi\anaconda3\lib\site-packages\seaborn\distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure
-level function with similar flexibility) or `histplot` (an axes-level fun
ction for histograms). warnings.warn(msg, FutureWarning) C:\Users\bsthi\anaconda3\lib\site-packages\seaborn\distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in # prepare the column and bin size. Increase bin size to be more specific, but 20 is more th
cols =
[
'MedHHInc'
, 'MeanHHSz'
, 'RegDens'
]
n_bins =
20
# inspecting cluster 1 and 7
clusters_to_inspect =
[
1
,
7
]
for
cluster in
clusters_to_inspect:
print
(
"Distribution for cluster {}"
.format(cluster))
# create subplots
fig, ax =
plt.subplots(nrows
=
3
)
ax[
0
].set_title(
"Cluster {}"
.format(cluster))
for
j, col in
enumerate
(cols):
# create the bins
bins =
np.linspace(
min
(df2[col]), max
(df2[col]), 20
)
# plot distribution of the cluster using histogram
sns.distplot(df2[df2[
'Cluster_ID'
] ==
cluster][col], bins
=
bins, ax
=
ax[j], norm_hist
# plot the normal distribution with a black line
sns.distplot(df2[col], bins
=
bins, ax
=
ax[j], hist
=
False
, color
=
"k"
)
plt.tight_layout()
plt.show()
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
a future version. Please adapt your code to use either `displot` (a figure
-level function with similar flexibility) or `kdeplot` (an axes-level func
tion for kernel density plots). warnings.warn(msg, FutureWarning) Distribution for cluster 7 C:\Users\bsthi\anaconda3\lib\site-packages\seaborn\distributions.py:2619: Fu
tureWarning: `distplot` is a deprecated function and will be removed in a fu
ture version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning) C:\Users\bsthi\anaconda3\lib\site-packages\seaborn\distributions.py:2619: Fu
tureWarning: `distplot` is a deprecated function and will be removed in a fu
ture version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `kdeplot` (an axes-level function for kernel density plots). warnings.warn(msg, FutureWarning) C:\Users\bsthi\anaconda3\lib\site-packages\seaborn\distributions.py:2619: Fu
tureWarning: `distplot` is a deprecated function and will be removed in a fu
ture version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning) C:\Users\bsthi\anaconda3\lib\site-packages\seaborn\distributions.py:2619: Fu
tureWarning: `distplot` is a deprecated function and will be removed in a fu
ture version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `kdeplot` (an axes-level function for kernel density plots). warnings.warn(msg, FutureWarning) C:\Users\bsthi\anaconda3\lib\site-packages\seaborn\distributions.py:2619: Fu
tureWarning: `distplot` is a deprecated function and will be removed in a fu
ture version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning) C:\Users\bsthi\anaconda3\lib\site-packages\seaborn\distributions.py:2619: Fu
tureWarning: `distplot` is a deprecated function and will be removed in a fu
ture version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `kdeplot` (an axes-level function for kernel density plots). warnings.warn(msg, FutureWarning)
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
** NOTE: Your cluster result can be similar, with the subgroups of instances may have assigned a different
cluster ID.**
Here, we plot the distributions of cluster 1
and cluster 7
against the distributions from all data. The black
lines are the distributions from all records, while light-blue lines are for a specific cluster. These plots show us
the key characteristics of the clusters, as follows:
1. Cluster 1:
Slightly higher MedHHInc
, right leaning MeanHHSz
and right leaning RegDens
. Suburbs in
cluster 1 are suburbs with above average household size and dense population.
2. Cluster 7:
Slightly lower MedHHInc
, left leaning MeanHHSz
and left leaning RegDens
. Showing that
suburbs in cluster 7 are suburbs with small average median household income, smaller families and sparse
population.
Determining As noted earlier, or the number of clusters is essential for the cluster building process. A smaller is
easier/faster to train and should show the general groupings of the dataset. A larger results in finer-grained,
more specific clusters, yet it is slow and could "overfit" the dataset. Therefore, the big question is, how do we
determine the optimal ?
In many cases, can be derived from the business question we are trying to answer with clustering. For
example, given a dataset of customers, we would like to build three different marketing approaches. Therefore,
the logical answer is to set , build the clusters and create the marketing plans based on the 3 segment
profiles.
However, sometimes the business question does not provide a clue to set the value. For these cases, an
alternative approach is to visually inspecting your data points and guess a K value. Unfortunately, if the dataset
is quite large, you will soon find this approach cumbersome.
Next, we explain a widely used elbow method
to set the value. In this method, a plot is drawn between the values and the clustering error (in sklearn
it is called inertia
or cost
). Typically, the values are
inversely correlated with the clustering error values, i.e. the error gets smaller once K gets larger. As becomes larger, each cluster becomes smaller in size, reducing the intra-cluster distances. The main idea of
the elbow method is to find K at which the error decreases abruptly. This produces an "elbow" effect. The plot is
drawn to estimate the minimal number of clusters that best accounts for the variability in the dataset. The
variability is captured by comparing the error value obtained with a specific solution versus the error value
obtained by clustering a uniformly distributed set of points.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Usually, the practice is to go with the minimal number of clusters that subgroups the dataset most effectively
(unless you have been provided with a number, or the interpretation is meaningful with more clusters).
Therefore, you select the cluster number as per the first valley (i.e., elbow) in the chart, as it indicates the “local
minima” to choose the number. It is not a global minimum, as in a chart there may be many valley/peaks. A
valley/peak in the graph indicates that you were getting decreased values with an additional number of clusters
before it starts increasing again. This increase can be interpreted as “overfitting”, therefore you choose the point
before the model starts to overfit. This overfitting indicates that the cluster solution you are fitting to the data
with X number of clusters fits worse than uniformly distributed points. Same concept as predictive models – you
choose the model before overfitting starts happening. In the following graph, you will choose as the best
clustering solution.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
In [17]:
C:\Users\bsthi\anaconda3\lib\site-packages\sklearn\cluster\_kmeans.py:792: F
utureWarning: 'n_jobs' was deprecated in version 0.23 and will be removed in 1.0 (renaming of 0.25). warnings.warn("'n_jobs' was deprecated in version 0.23 and will be" C:\Users\bsthi\anaconda3\lib\site-packages\sklearn\cluster\_kmeans.py:792: F
utureWarning: 'n_jobs' was deprecated in version 0.23 and will be removed in 1.0 (renaming of 0.25). warnings.warn("'n_jobs' was deprecated in version 0.23 and will be" C:\Users\bsthi\anaconda3\lib\site-packages\sklearn\cluster\_kmeans.py:792: F
utureWarning: 'n_jobs' was deprecated in version 0.23 and will be removed in 1.0 (renaming of 0.25). warnings.warn("'n_jobs' was deprecated in version 0.23 and will be" C:\Users\bsthi\anaconda3\lib\site-packages\sklearn\cluster\_kmeans.py:792: F
utureWarning: 'n_jobs' was deprecated in version 0.23 and will be removed in 1.0 (renaming of 0.25). warnings.warn("'n_jobs' was deprecated in version 0.23 and will be" C:\Users\bsthi\anaconda3\lib\site-packages\sklearn\cluster\_kmeans.py:792: F
utureWarning: 'n_jobs' was deprecated in version 0.23 and will be removed in 1.0 (renaming of 0.25). warnings.warn("'n_jobs' was deprecated in version 0.23 and will be" C:\Users\bsthi\anaconda3\lib\site-packages\sklearn\cluster\_kmeans.py:792: F
utureWarning: 'n_jobs' was deprecated in version 0.23 and will be removed in 1.0 (renaming of 0.25). warnings.warn("'n_jobs' was deprecated in version 0.23 and will be" C:\Users\bsthi\anaconda3\lib\site-packages\sklearn\cluster\_kmeans.py:792: F
utureWarning: 'n_jobs' was deprecated in version 0.23 and will be removed in 1.0 (renaming of 0.25). warnings.warn("'n_jobs' was deprecated in version 0.23 and will be" # list to save the clusters and cost
clusters =
[]
inertia_vals =
[]
# this whole process should take a while
for
k in
range
(
2
, 15
, 2
):
# train clustering with the specified K
model =
KMeans(n_clusters
=
k, random_state
=
rs, n_jobs
=
10
)
model.fit(X)
# append model to cluster list
clusters.append(model)
inertia_vals.append(model.inertia_)
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
In [18]:
Here, the elbow is somewhere between 4 and 6. Either values can be selected as the optimal .
While being a good heuristic, the elbow method does not always yield the "obvious" K. In many cases, the error
plot can be very smooth and shows no distinct K. As an alternative, silhouette score
is commonly used.
Silhouette score is a measure of how similar an object is to its own cluster (cohesion) compared to other
clusters (separation). The silhouette ranges from -1 to 1, where a high value indicates that the object is well
matched to its own cluster and poorly matched to neighboring clusters. If most objects have a high value, then
the clustering configuration is deemed high quality. If many data points have a low or negative value, then the
clustering configuration may have too many or too few clusters. However, the computation of silhouette score is
an expensive process and incurs addition overheads. In large datasets, it may not be feasible to compute the
score for all objects/samples/records.
More info on silhouette
(https://en.wikipedia.org/wiki/Silhouette_%28clustering%29)
In the underlying clustering problem, a decision will have to be made by choosing between and .
We can use the silhouette_score
from sklearn
, which returns the mean silhouette score for all samples
for both solutions.
In [19]:
silhouette_score
returns a mean silhouette score of 0.33 for and 0.25 for . This shows
clusters in are more appropriately matched to its own cluster then . Therefore, we could choose KMeans(n_clusters=4, n_jobs=10, random_state=42) Silhouette score for k=4 0.33091719115444046 KMeans(n_clusters=6, n_jobs=10, random_state=42) Silhouette score for k=6 0.25399659182558476 # plot the inertia vs K values
plt.plot(
range
(
2
,
15
,
2
), inertia_vals, marker
=
'*'
)
plt.show()
from
sklearn.metrics import
silhouette_score
print
(clusters[
1
])
print
(
"Silhouette score for k=4"
, silhouette_score(X, clusters[
1
].predict(X)))
print
(clusters[
2
])
print
(
"Silhouette score for k=6"
, silhouette_score(X, clusters[
2
].predict(X)))
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
over on the basis of this score.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
In [20]:
Sum of intra-cluster distance: 42564.197151149216 Centroid locations: [-0.33314101 2.30743581 0.22102952] [-0.40106113 -0.19147382 -0.89295077] [1.80312021 0.4098886 0.84130225] [-0.15468572 -0.37277263 0.79552382] Cluster membership 1 14526 3 10853 2 4544 0 2156 Name: Cluster_ID, dtype: int64 C:\Users\bsthi\AppData\Local\Temp/ipykernel_12504/2214640681.py:13: SettingW
ithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/
stable/user_guide/indexing.html#returning-a-view-versus-a-copy (https://pand
as.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-v
ersus-a-copy) df2['Cluster_ID'] = y # visualisation of K=4 clustering solution
model =
KMeans(n_clusters
=
4
, random_state
=
rs)
model.fit(X)
# sum of intra-cluster distances
print
(
"Sum of intra-cluster distance:"
, model.inertia_)
print
(
"Centroid locations:"
)
for
centroid in
model.cluster_centers_:
print
(centroid)
y =
model.predict(X)
df2[
'Cluster_ID'
] =
y
# how many in each
print
(
"Cluster membership"
)
print
(df2[
'Cluster_ID'
].value_counts())
# pairplot
# added alpha value to assist with overlapping points
cluster_g =
sns.pairplot(df2, hue
=
'Cluster_ID'
, diag_kind
=
'hist'
)
plt.show()
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
4. Running the Agglomerative Clustering Algorithm As an alternative to K-means clustering which uses centroid-based (or partitional) approach,
agglomerative/hierarchical clustering is also commonly used to perform clustering on dataset. Agglomerative
clustering starts from bottom and assigns each data point as its own cluster. For each pair of clusters,
agglomerative clustering recursively merges the pair of clusters, minimising linkage distance between each
cluster.
Similar to KMeans, you need to import an agglomerative clustering algorithm using the sklearn.cluster
module.
In [21]:
Once the model is imported, you can build a model using the following code. You also need to specify K
or the
number of clusters. Here, we will use K = 3
. For visualisation purpose (later in this section), we will only build
this model on 50 data points, but agglomerative clustering can handle many data points just fine.
from
sklearn.cluster import
AgglomerativeClustering
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
In [22]:
Once the model is build, the dendrogram of this model can be visualized using the following code.
In [23]:
In [24]:
Cluster labels are presented on the X axis, with the last K=3
joins on the top of the tree being the cluster
centroids.
5. Running the Kprototypes Clustering Out[22]:
AgglomerativeClustering(n_clusters=3)
agg_model =
AgglomerativeClustering(n_clusters
=
3
)
agg_model.fit(X[:
50
]) # subset of X, only 50 data points
from
matplotlib import
pyplot as
plt
from
scipy.cluster.hierarchy import
dendrogram
def
plot_dendrogram
(model, **
kwargs):
# Children of hierarchical clustering
children =
model.children_
# Distances between each pair of children
# Since we don't have this information, we can use a uniform one for plotting
distance =
np.arange(children.shape[
0
])
# The number of observations contained in each cluster level
no_of_observations =
np.arange(
2
, children.shape[
0
]
+
2
)
# Create linkage matrix and then plot the dendrogram
linkage_matrix =
np.column_stack([children, distance, no_of_observations]).astype(
float
# Plot the corresponding dendrogram
dendrogram(linkage_matrix, **
kwargs)
plot_dendrogram(agg_model, labels
=
agg_model.labels_)
plt.show()
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Let us consider another dataset 'adult.csv' with three variables.
age
: Age of the person
workclass
: Category of the work
fnlwgt
: A deidentifed variable
If we have to cluster the adults based on these three attributes, Kmeans cannot be used as the variable workclass
is a categorical variable. Kprototypes is a clustering method that can handle both numeric and
categorical variables. Therefore, Kprototypes should be used to cluster this dataset instead of Kmeans. Next,
we will load this new dataset and build a Kprototype clustering model.
In [25]:
workclass
is a categorical value and it has to be mapped to numeric values beforing using it in the model.
In [26]:
In [27]:
The next step is to build a Kprototypes model. The Kmodes
library allows us to do this.
<class 'pandas.core.frame.DataFrame'> RangeIndex: 11008 entries, 0 to 11007 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 age 11008 non-null int64 1 workclass 11008 non-null object 2 fnlwgt 11008 non-null int64 dtypes: int64(2), object(1) memory usage: 258.1+ KB [' Private' ' Local-gov' ' Self-emp-not-inc' ' Federal-gov' ' State-gov' ' Self-emp-inc' ' Without-pay' ' Never-worked'] # Load the datset
import
pandas as
pd
df =
pd.read_csv(
r'adult.csv'
)
df.info()
print
(df[
'workclass'
].unique())
from
sklearn.preprocessing import
StandardScaler
# mapping
workclass_map =
{
' Private'
:
1
, ' Local-gov'
: 2
, ' Self-emp-not-inc'
: 3
, ' Federal-gov'
: 4
, df[
'workclass'
] =
df[
'workclass'
].map(workclass_map)
# convert df to matrix
X =
df.to_numpy()
# scaling
scaler =
StandardScaler()
X =
scaler.fit_transform(X)
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
In [28]:
Unlike KMeans
, KPrototypes
does not support the calculation of inertia
. However, the clustering cost,
defined as the sum distance of all points to their respective cluster centroids, can be calculated. This can be
used to identify the optimal K. The cost_
parameter of KPrototypes
returns this cost value.
In [29]:
In [30]:
By applying the elbow method on the above plot, the optimal value for lies between 4 and 6. The silhouette
score has to be calculated to find the optimal value.
Due to the presences of mixed data types (numeric and categorical), the calculation of silhouette score for Kprototypes
is different form KMeans
. For Kprototypes
, two silhouette scores representing numeric
variables and categorical variables should be calculated seperately and average should be calculated. We will
first see how to calculate this value for .
from
kmodes.kmodes import
KModes
from
kmodes.kprototypes import
KPrototypes
# list to save the clusters and cost
clusters =
[]
cost_vals =
[]
# this whole process should take a while
for
k in
range
(
2
, 10
, 2
):
# train clustering with the specified K
model =
KPrototypes(n_clusters
=
k, random_state
=
rs, n_jobs
=
10
)
model.fit_predict(X, categorical
=
[
1
])
# append model to cluster list
clusters.append(model)
cost_vals.append(model.cost_)
# plot the cost vs K values
plt.plot(
range
(
2
,
10
,
2
), cost_vals, marker
=
'*'
)
plt.show()
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
In [31]:
In [32]:
In [33]:
Now your task is to identify the optimal K
by calculating the silhouette score for K = 6, and K = 8. Then,
visualize the optimal cluster using pairplot
and distplot
and interpret the results similar to KMeans
.
In [34]:
Silscore for numeric variables: 0.3343021534172589 Silscore for categorical variables: -0.09998949153471454 The avg silhouette score for k=4: 0.11715633094127217 The avg Silhouette score for k=4: 0.11715633094127217 The avg Silhouette score for k=6: 0.07907669640648304 The avg Silhouette score for k=8: 0.07049603177523696 X_num =
[[row[
0
], row[
2
]] for
row in
X] # Variables of X with numeric datatype
X_cat =
[[row[
1
]] for
row in
X] # variables of X with categorical datatype
model =
clusters[
1
] # cluster[1] holds the K-prtotypes model with K=4
from
sklearn.metrics import
silhouette_score
# Calculate the Silhouette Score for the numeric and categorical variables seperately
silScoreNums =
silhouette_score(X_num, model.fit_predict(X,categorical
=
[
1
]), metric
=
'euclid
print
(
"Silscore for numeric variables: " +
str
(silScoreNums))
silScoreCats =
silhouette_score(X_cat, model.fit_predict(X,categorical
=
[
1
]), metric
=
'hammin
print
(
"Silscore for categorical variables: " +
str
(silScoreCats))
# Average the silhouette scores
silScore =
(silScoreNums +
silScoreCats) /
2
print
(
"The avg silhouette score for k=4: " +
str
(silScore))
model =
clusters[
1
]
silScoreNums =
silhouette_score(X_num, model.fit_predict(X,categorical
=
[
1
]), metric
=
'euclid
silScoreCats =
silhouette_score(X_cat, model.fit_predict(X,categorical
=
[
1
]), metric
=
'hammin
silScore =
(silScoreNums +
silScoreCats) /
2
print
(
"The avg Silhouette score for k=4: " +
str
(silScore))
model =
clusters[
2
]
silScoreNums =
silhouette_score(X_num, model.fit_predict(X,categorical
=
[
1
]), metric
=
'euclid
silScoreCats =
silhouette_score(X_cat, model.fit_predict(X,categorical
=
[
1
]), metric
=
'hammin
silScore =
(silScoreNums +
silScoreCats) /
2
print
(
"The avg Silhouette score for k=6: " +
str
(silScore))
model =
clusters[
3
]
silScoreNums =
silhouette_score(X_num, model.fit_predict(X,categorical
=
[
1
]), metric
=
'euclid
silScoreCats =
silhouette_score(X_cat, model.fit_predict(X,categorical
=
[
1
]), metric
=
'hammin
silScore =
(silScoreNums +
silScoreCats) /
2
print
(
"The avg Silhouette score for k=8: " +
str
(silScore))
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
In [35]:
Cluster membership 3 3452 0 3311 1 2860 2 1385 Name: Cluster_ID, dtype: int64 import
seaborn as
sns
import
matplotlib.pyplot as
plt
model =
clusters[
1
]
y
=
model.fit_predict(X, categorical
=
[
1
]) df[
'Cluster_ID'
] =
y
# how many records are in each cluster
print
(
"Cluster membership"
)
print
(df[
'Cluster_ID'
].value_counts())
# pairplot the cluster distribution.
cluster_g =
sns.pairplot(df, hue
=
'Cluster_ID'
,diag_kind
=
'hist'
)
plt.show()
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
In [36]:
Distribution for cluster 0 C:\Users\bsthi\anaconda3\lib\site-packages\seaborn\distributions.py:2619: Fu
tureWarning: `distplot` is a deprecated function and will be removed in a fu
ture version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning) C:\Users\bsthi\anaconda3\lib\site-packages\seaborn\distributions.py:1699: Fu
tureWarning: The `bw` parameter is deprecated in favor of `bw_method` and `b
w_adjust`. Using 1.5 for `bw_method`, but please see the docs for the new pa
rameters and update your code. warnings.warn(msg, FutureWarning) C:\Users\bsthi\anaconda3\lib\site-packages\seaborn\distributions.py:2619: Fu
tureWarning: `distplot` is a deprecated function and will be removed in a fu
ture version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `kdeplot` (an axes-level function for kernel density plots). warnings.warn(msg, FutureWarning) C:\Users\bsthi\anaconda3\lib\site-packages\seaborn\distributions.py:2619: Fu
tureWarning: `distplot` is a deprecated function and will be removed in a fu
ture version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning) C:\Users\bsthi\anaconda3\lib\site-packages\seaborn\distributions.py:1699: Fu
tureWarning: The `bw` parameter is deprecated in favor of `bw_method` and `b
w_adjust`. Using 1.5 for `bw_method`, but please see the docs for the new pa
rameters and update your code. warnings.warn(msg, FutureWarning) C:\Users\bsthi\anaconda3\lib\site-packages\seaborn\distributions.py:2619: Fu
tureWarning: `distplot` is a deprecated function and will be removed in a fu
ture version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `kdeplot` (an axes-level function for kernel density plots). warnings.warn(msg, FutureWarning) import
pandas as
pd
import
numpy as
np
# prepare the column and bin size. Increase bin size to be more specific, but 20 is more th
cols =
[
'age'
, 'workclass'
, 'fnlwgt'
]
n_bins =
20
clusters_to_inspect =
[
0
,
1
,
2
,
3
]
for
cluster in
clusters_to_inspect:
print
(
"Distribution for cluster {}"
.format(cluster))
fig, ax =
plt.subplots(nrows
=
3
)
ax[
0
].set_title(
"Cluster {}"
.format(cluster))
for
j, col in
enumerate
(cols):
bins =
np.linspace(
min
(df[col]), max
(df[col]), 20
)
sns.distplot(df[df[
'Cluster_ID'
] ==
cluster][col], bins
=
bins, ax
=
ax[j], norm_hist
=
T
sns.distplot(df[col], bins
=
bins, ax
=
ax[j], hist
=
False
, color
=
"k"
)
plt.tight_layout()
plt.show()
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
C:\Users\bsthi\anaconda3\lib\site-packages\seaborn\distributions.py:2619: Fu
tureWarning: `distplot` is a deprecated function and will be removed in a fu
ture version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning) C:\Users\bsthi\anaconda3\lib\site-packages\seaborn\distributions.py:1699: Fu
tureWarning: The `bw` parameter is deprecated in favor of `bw_method` and `b
w_adjust`. Using 1.5 for `bw_method`, but please see the docs for the new pa
rameters and update your code. warnings.warn(msg, FutureWarning) C:\Users\bsthi\anaconda3\lib\site-packages\seaborn\distributions.py:2619: Fu
tureWarning: `distplot` is a deprecated function and will be removed in a fu
ture version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `kdeplot` (an axes-level function for kernel density plots). warnings.warn(msg, FutureWarning) Distribution for cluster 1 C:\Users\bsthi\anaconda3\lib\site-packages\seaborn\distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure
-level function with similar flexibility) or `histplot` (an axes-level fun
ction for histograms). warnings.warn(msg, FutureWarning) C:\Users\bsthi\anaconda3\lib\site-packages\seaborn\distributions.py:1699: FutureWarning: The `bw` parameter is deprecated in favor of `bw_method` an
d `bw_adjust`. Using 1.5 for `bw_method`, but please see the docs for the new parameters and update your code. warnings.warn(msg, FutureWarning) C:\Users\bsthi\anaconda3\lib\site-packages\seaborn\distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure
-level function with similar flexibility) or `kdeplot` (an axes-level func
tion for kernel density plots). warnings.warn(msg, FutureWarning) C:\Users\bsthi\anaconda3\lib\site-packages\seaborn\distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure
-level function with similar flexibility) or `histplot` (an axes-level fun
ction for histograms). warnings.warn(msg, FutureWarning) C:\Users\bsthi\anaconda3\lib\site-packages\seaborn\distributions.py:1699:
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
FutureWarning: The `bw` parameter is deprecated in favor of `bw_method` an
d `bw_adjust`. Using 1.5 for `bw_method`, but please see the docs for the new parameters and update your code. warnings.warn(msg, FutureWarning) C:\Users\bsthi\anaconda3\lib\site-packages\seaborn\distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure
-level function with similar flexibility) or `kdeplot` (an axes-level func
tion for kernel density plots). warnings.warn(msg, FutureWarning) C:\Users\bsthi\anaconda3\lib\site-packages\seaborn\distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure
-level function with similar flexibility) or `histplot` (an axes-level fun
ction for histograms). warnings.warn(msg, FutureWarning) C:\Users\bsthi\anaconda3\lib\site-packages\seaborn\distributions.py:1699: FutureWarning: The `bw` parameter is deprecated in favor of `bw_method` an
d `bw_adjust`. Using 1.5 for `bw_method`, but please see the docs for the new parameters and update your code. warnings.warn(msg, FutureWarning) C:\Users\bsthi\anaconda3\lib\site-packages\seaborn\distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure
-level function with similar flexibility) or `kdeplot` (an axes-level func
tion for kernel density plots). warnings.warn(msg, FutureWarning) Distribution for cluster 2 C:\Users\bsthi\anaconda3\lib\site-packages\seaborn\distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure
-level function with similar flexibility) or `histplot` (an axes-level fun
ction for histograms). warnings.warn(msg, FutureWarning) C:\Users\bsthi\anaconda3\lib\site-packages\seaborn\distributions.py:1699: FutureWarning: The `bw` parameter is deprecated in favor of `bw_method` an
d `bw_adjust`. Using 1.5 for `bw_method`, but please see the docs for the new parameters and update your code. warnings.warn(msg, FutureWarning) C:\Users\bsthi\anaconda3\lib\site-packages\seaborn\distributions.py:2619:
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure
-level function with similar flexibility) or `kdeplot` (an axes-level func
tion for kernel density plots). warnings.warn(msg, FutureWarning) C:\Users\bsthi\anaconda3\lib\site-packages\seaborn\distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure
-level function with similar flexibility) or `histplot` (an axes-level fun
ction for histograms). warnings.warn(msg, FutureWarning) C:\Users\bsthi\anaconda3\lib\site-packages\seaborn\distributions.py:1699: FutureWarning: The `bw` parameter is deprecated in favor of `bw_method` an
d `bw_adjust`. Using 1.5 for `bw_method`, but please see the docs for the new parameters and update your code. warnings.warn(msg, FutureWarning) C:\Users\bsthi\anaconda3\lib\site-packages\seaborn\distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure
-level function with similar flexibility) or `kdeplot` (an axes-level func
tion for kernel density plots). warnings.warn(msg, FutureWarning) C:\Users\bsthi\anaconda3\lib\site-packages\seaborn\distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure
-level function with similar flexibility) or `histplot` (an axes-level fun
ction for histograms). warnings.warn(msg, FutureWarning) C:\Users\bsthi\anaconda3\lib\site-packages\seaborn\distributions.py:1699: FutureWarning: The `bw` parameter is deprecated in favor of `bw_method` an
d `bw_adjust`. Using 1.5 for `bw_method`, but please see the docs for the new parameters and update your code. warnings.warn(msg, FutureWarning) C:\Users\bsthi\anaconda3\lib\site-packages\seaborn\distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure
-level function with similar flexibility) or `kdeplot` (an axes-level func
tion for kernel density plots). warnings.warn(msg, FutureWarning)
Distribution for cluster 3
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
C:\Users\bsthi\anaconda3\lib\site-packages\seaborn\distributions.py:2619: Fu
tureWarning: `distplot` is a deprecated function and will be removed in a fu
ture version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning) C:\Users\bsthi\anaconda3\lib\site-packages\seaborn\distributions.py:1699: Fu
tureWarning: The `bw` parameter is deprecated in favor of `bw_method` and `b
w_adjust`. Using 1.5 for `bw_method`, but please see the docs for the new pa
rameters and update your code. warnings.warn(msg, FutureWarning) C:\Users\bsthi\anaconda3\lib\site-packages\seaborn\distributions.py:2619: Fu
tureWarning: `distplot` is a deprecated function and will be removed in a fu
ture version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `kdeplot` (an axes-level function for kernel density plots). warnings.warn(msg, FutureWarning) C:\Users\bsthi\anaconda3\lib\site-packages\seaborn\distributions.py:2619: Fu
tureWarning: `distplot` is a deprecated function and will be removed in a fu
ture version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning) C:\Users\bsthi\anaconda3\lib\site-packages\seaborn\distributions.py:1699: Fu
tureWarning: The `bw` parameter is deprecated in favor of `bw_method` and `b
w_adjust`. Using 1.5 for `bw_method`, but please see the docs for the new pa
rameters and update your code. warnings.warn(msg, FutureWarning) C:\Users\bsthi\anaconda3\lib\site-packages\seaborn\distributions.py:2619: Fu
tureWarning: `distplot` is a deprecated function and will be removed in a fu
ture version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `kdeplot` (an axes-level function for kernel density plots). warnings.warn(msg, FutureWarning) C:\Users\bsthi\anaconda3\lib\site-packages\seaborn\distributions.py:2619: Fu
tureWarning: `distplot` is a deprecated function and will be removed in a fu
ture version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning) C:\Users\bsthi\anaconda3\lib\site-packages\seaborn\distributions.py:1699: Fu
tureWarning: The `bw` parameter is deprecated in favor of `bw_method` and `b
w_adjust`. Using 1.5 for `bw_method`, but please see the docs for the new pa
rameters and update your code. warnings.warn(msg, FutureWarning) C:\Users\bsthi\anaconda3\lib\site-packages\seaborn\distributions.py:2619: Fu
tureWarning: `distplot` is a deprecated function and will be removed in a fu
ture version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `kdeplot` (an axes-level function for kernel density plots). warnings.warn(msg, FutureWarning)
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Output:
End Notes
We learned how to build, tune and explore clustering models. We also used visualisation to help us explain the
cluster/segment profiles produced by the model. The goal of cluster analysis is to identify distinct groupings of
cases across a set of inputs without the presence of target variable.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Related Documents
Recommended textbooks for you

Systems Architecture
Computer Science
ISBN:9781305080195
Author:Stephen D. Burd
Publisher:Cengage Learning
Np Ms Office 365/Excel 2016 I Ntermed
Computer Science
ISBN:9781337508841
Author:Carey
Publisher:Cengage

Principles of Information Systems (MindTap Course...
Computer Science
ISBN:9781285867168
Author:Ralph Stair, George Reynolds
Publisher:Cengage Learning

Information Technology Project Management
Computer Science
ISBN:9781337101356
Author:Kathy Schwalbe
Publisher:Cengage Learning

Principles of Information Systems (MindTap Course...
Computer Science
ISBN:9781305971776
Author:Ralph Stair, George Reynolds
Publisher:Cengage Learning
Recommended textbooks for you
- Systems ArchitectureComputer ScienceISBN:9781305080195Author:Stephen D. BurdPublisher:Cengage LearningNp Ms Office 365/Excel 2016 I NtermedComputer ScienceISBN:9781337508841Author:CareyPublisher:CengagePrinciples of Information Systems (MindTap Course...Computer ScienceISBN:9781285867168Author:Ralph Stair, George ReynoldsPublisher:Cengage Learning
- Information Technology Project ManagementComputer ScienceISBN:9781337101356Author:Kathy SchwalbePublisher:Cengage LearningPrinciples of Information Systems (MindTap Course...Computer ScienceISBN:9781305971776Author:Ralph Stair, George ReynoldsPublisher:Cengage Learning

Systems Architecture
Computer Science
ISBN:9781305080195
Author:Stephen D. Burd
Publisher:Cengage Learning
Np Ms Office 365/Excel 2016 I Ntermed
Computer Science
ISBN:9781337508841
Author:Carey
Publisher:Cengage

Principles of Information Systems (MindTap Course...
Computer Science
ISBN:9781285867168
Author:Ralph Stair, George Reynolds
Publisher:Cengage Learning

Information Technology Project Management
Computer Science
ISBN:9781337101356
Author:Kathy Schwalbe
Publisher:Cengage Learning

Principles of Information Systems (MindTap Course...
Computer Science
ISBN:9781305971776
Author:Ralph Stair, George Reynolds
Publisher:Cengage Learning