Python_H - Jupyter Notebook
pdf
keyboard_arrow_up
School
Northeastern University *
*We aren’t endorsed by this school
Course
2510
Subject
Industrial Engineering
Date
Dec 6, 2023
Type
Pages
5
Uploaded by qwertykeyboard27
11/10/23, 12:48 PM
Python_H - Jupyter Notebook
localhost:8889/notebooks/Python_H.ipynb
1/5
In [1]:
The CLV data provides the average net income per week (dollars) that a sample of high school
students made from part-time work during the summer. It also shows how much they spent
online for goods and services each week (again in $US). Amazon.com (who was the
marketplace for much of this online spending) wants to analyze this data for potential market
segmentation strategies.
In [2]:
In [3]:
Out[2]:
INCOME
SPEND
0
233
150
1
250
187
2
204
172
3
236
178
4
354
163
Out[3]:
INCOME
SPEND
count
303.000000
303.000000
mean
245.273927
149.646865
std
48.499412
22.905161
min
126.000000
71.000000
25%
211.000000
133.500000
50%
240.000000
153.000000
75%
274.000000
166.000000
max
417.000000
202.000000
#Load the required libraries
import
pandas as
pd
import
matplotlib.pyplot as
plt
import
seaborn as
sns
#read the data from the file and display the first five rows
dataset
=
pd.read_csv(
'CLV.csv'
)
dataset.head()
#descriptive statistics of the dataset
dataset.describe()
11/10/23, 12:48 PM
Python_H - Jupyter Notebook
localhost:8889/notebooks/Python_H.ipynb
2/5
In [4]:
/Users/keneshakhurana/anaconda3/lib/python3.11/site-packages/seaborn/ax
isgrid.py:118: UserWarning: The figure layout has changed to tight
self._figure.tight_layout(*args, **kwargs)
Out[4]:
<seaborn.axisgrid.PairGrid at 0x15a61e7d0>
#visualize the raw data
sns.pairplot(dataset)
11/10/23, 12:48 PM
Python_H - Jupyter Notebook
localhost:8889/notebooks/Python_H.ipynb
3/5
In [5]:
In [6]:
In [7]:
/Users/keneshakhurana/anaconda3/lib/python3.11/site-packages/sklearn/cl
uster/_agglomerative.py:1005: FutureWarning: Attribute `affinity` was d
eprecated in version 1.2 and will be removed in 1.4. Use `metric` inste
ad
warnings.warn(
#Create a dendrogram of the data using ward method
#Can use horizontal line with dendogram to help choose number of clusters
import
scipy.cluster.hierarchy as
sch
dend
=
sch.dendrogram(sch.linkage(dataset,method
=
"ward"
))
plt.title(
"Dendrogram"
)
plt.xlabel(
'Customer'
)
plt.ylabel(
'Euclidean Distance'
)
plt.show()
#Try first with 4 clusters
#Fitting Hierarchical Clustering to the dataset
#Number of clusters = 4 and uses Euclidean distances
from
sklearn.cluster import
AgglomerativeClustering
hc =
AgglomerativeClustering(n_clusters
=
4
,affinity
=
'euclidean'
) #initiali
y_hc =
hc.fit_predict(dataset) #fits the data
#Adds the clusters to the original dataset in a column title Label
dataset[
'Label'
]
=
y_hc
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
11/10/23, 12:48 PM
Python_H - Jupyter Notebook
localhost:8889/notebooks/Python_H.ipynb
4/5
In [8]:
In [9]:
In [10]:
Out[8]:
<Axes: xlabel='INCOME', ylabel='SPEND'>
/Users/keneshakhurana/anaconda3/lib/python3.11/site-packages/sklearn/cl
uster/_agglomerative.py:1005: FutureWarning: Attribute `affinity` was d
eprecated in version 1.2 and will be removed in 1.4. Use `metric` inste
ad
warnings.warn(
#Generate scatterplot to show data and how it was clustered
#Data colored based on LABEL values, color set by palette
#legend = full ensures entire legend is displayed
sns.scatterplot(x
=
'INCOME'
,y
=
'SPEND'
,hue
=
'Label'
,palette
=
'Set1'
,legend
=
'f
#Try next with 7 clusters (replace n_clusters=4 with n_clusters=7)
dataset1
=
pd.read_csv(
'CLV.csv'
)
from
sklearn.cluster import
AgglomerativeClustering
hc1 =
AgglomerativeClustering(n_clusters
=
7
,affinity
=
'euclidean'
) #initial
y_hc1 =
hc1.fit_predict(dataset1) #fits the data
#Adds the clusters to the original dataset in a column title LABEL
dataset1[
'Label2'
]
=
y_hc1
11/10/23, 12:48 PM
Python_H - Jupyter Notebook
localhost:8889/notebooks/Python_H.ipynb
5/5
In [11]:
In [ ]:
Out[11]:
<Axes: xlabel='INCOME', ylabel='SPEND'>
#Generate scatterplot to show data and how it was clustered
#Data colored based on LABEL values, color set by palette
#legend = full ensures entire legend is displayed
sns.scatterplot(x
=
'INCOME'
,y
=
'SPEND'
,hue
=
'Label2'
,palette
=
'Set1'
,legend
=