Clustering_assignment_F23_Arun_Pratap

pdf

School

University of North Texas *

*We aren’t endorsed by this school

Course

CSCE 5215

Subject

Computer Science

Date

Apr 3, 2024

Type

pdf

Pages

9

Uploaded by PrivateCraneMaster777

Report
Clustering_assignment_F23_Arun_Pratap November 28, 2023 0.1 Student name: Arun Pratap Tomar 0.2 UNT ID: 11652000 0.3 Assignment: Clustering 0.3.1 Part 1: Data Wrangling (60 pts) 6 pts for each subtask except for the first one. [34]: import pandas as pd import numpy as np from google.colab import files uploaded = files . upload() df = pd . read_csv( "Credit card transactions - India - Simple.csv" ) df . info() <IPython.core.display.HTML object> Saving Credit card transactions - India - Simple.csv to Credit card transactions - India - Simple (3).csv <class 'pandas.core.frame.DataFrame'> RangeIndex: 26052 entries, 0 to 26051 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 index 26052 non-null int64 1 City 26052 non-null object 2 Date 26052 non-null object 3 Card Type 26052 non-null object 4 Exp Type 26052 non-null object 5 Gender 26052 non-null object 6 Amount 26052 non-null int64 dtypes: int64(2), object(5) memory usage: 1.4+ MB [35]: # Remove column "index" from df df . drop( 'index' , axis =1 , inplace = True ) [36]: df . head( 5 ) 1
[36]: City Date Card Type Exp Type Gender Amount 0 Delhi, India 29-Oct-14 Gold Bills F 82475 1 Greater Mumbai, India 22-Aug-14 Platinum Bills F 32555 2 Bengaluru, India 27-Aug-14 Silver Bills F 101738 3 Greater Mumbai, India 12-Apr-14 Signature Bills F 123424 4 Bengaluru, India 5-May-15 Gold Bills F 171574 [37]: df . columns = df . columns . str . lower() df[ 'city' ] = df[ 'city' ] . str . replace( ', India' , '' ) df . head( 5 ) [37]: city date card type exp type gender amount 0 Delhi 29-Oct-14 Gold Bills F 82475 1 Greater Mumbai 22-Aug-14 Platinum Bills F 32555 2 Bengaluru 27-Aug-14 Silver Bills F 101738 3 Greater Mumbai 12-Apr-14 Signature Bills F 123424 4 Bengaluru 5-May-15 Gold Bills F 171574 [38]: # Convert column date to datetype df[ 'date' ] = pd . to_datetime(df[ 'date' ]) df . info(verbose = True ) <class 'pandas.core.frame.DataFrame'> RangeIndex: 26052 entries, 0 to 26051 Data columns (total 6 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 city 26052 non-null object 1 date 26052 non-null datetime64[ns] 2 card type 26052 non-null object 3 exp type 26052 non-null object 4 gender 26052 non-null object 5 amount 26052 non-null int64 dtypes: datetime64[ns](1), int64(1), object(4) memory usage: 1.2+ MB [39]: ''' Visualize amount spent on each exp type, the color channel is on gender ''' import seaborn as sns sns . barplot(data = df, x = 'exp type' , y = 'amount' , hue = 'gender' ) [39]: <Axes: xlabel='exp type', ylabel='amount'> 2
[40]: citySpending = df . groupby( 'city' )[ 'amount' ] . mean() citySpendingSorted = citySpending . sort_values() print ( str . title( ' \' ' + citySpendingSorted . head( 1 ) . index . tolist()[ 0 ] + ' \' ' )) 'Bahraich' [41]: print ( str . title( ' \' ' + citySpendingSorted . tail( 1 ) . index . tolist()[ 0 ] + ' \' ' )) 'Thodupuzha' [42]: # Write your code to answer the sum of amounts by all card types on Fuel, gender = female from beginning until 2014-05-05 fuel_Fm_df = df[(df[ 'exp type' ] == 'Fuel' ) & (df[ 'gender' ] == 'F' ) & (df[ 'date' ] <= '2014-05-05' )] fuel_Fm_total = fuel_Fm_df[ 'amount' ] . sum() print (fuel_Fm_total) 138077354 3
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
[43]: df . drop( 'date' , axis =1 ,inplace = True ) city_percent = df[ 'city' ] . value_counts(normalize = True ) filter_cities = city_percent[city_percent > 0.008 ] sort_cities = filter_cities . sort_values(ascending = False ) print (sort_cities) Bengaluru 0.136343 Greater Mumbai 0.134078 Ahmedabad 0.134001 Delhi 0.133656 Hyderabad 0.030094 Chennai 0.029710 Kolkata 0.029671 Kanpur 0.029326 Lucknow 0.029134 Jaipur 0.028865 Surat 0.028750 Pune 0.028673 Name: city, dtype: float64 [44]: city_count = df[ 'city' ] . value_counts(normalize = True ) mask = city_count . lt( 0.008 ) df[ 'city' ] = df[ 'city' ] . apply( lambda x: 'Other' if x in city_count[mask] . index else x) city_count = df[ 'city' ] . value_counts(normalize = True ) print (city_count) Other 0.227698 Bengaluru 0.136343 Greater Mumbai 0.134078 Ahmedabad 0.134001 Delhi 0.133656 Hyderabad 0.030094 Chennai 0.029710 Kolkata 0.029671 Kanpur 0.029326 Lucknow 0.029134 Jaipur 0.028865 Surat 0.028750 Pune 0.028673 Name: city, dtype: float64 In the given dataset, we have four categorical columns: “city”, “card type”, “exp type”, and “gender”. Since there is no natural ordering among the categories in these columns, we will use one-hot encoding to convert them into numeric values. [45]: from sklearn.preprocessing import OrdinalEncoder pd . set_option( 'display.max_columns' , None ) 4
en_cd = OrdinalEncoder(categories = [[ 'Signature' , 'Silver' , 'Gold' , 'Platinum' ]],dtype = int ) df[ 'card type' ] = en_cd . fit_transform(df[[ 'card type' ]]) df = pd . get_dummies(df, columns = [ 'city' , 'exp type' , 'gender' ]) df . head( 5 ) [45]: card type amount city_Ahmedabad city_Bengaluru city_Chennai \ 0 2 82475 0 0 0 1 3 32555 0 0 0 2 1 101738 0 1 0 3 0 123424 0 0 0 4 2 171574 0 1 0 city_Delhi city_Greater Mumbai city_Hyderabad city_Jaipur city_Kanpur \ 0 1 0 0 0 0 1 0 1 0 0 0 2 0 0 0 0 0 3 0 1 0 0 0 4 0 0 0 0 0 city_Kolkata city_Lucknow city_Other city_Pune city_Surat \ 0 0 0 0 0 0 1 0 0 0 0 0 2 0 0 0 0 0 3 0 0 0 0 0 4 0 0 0 0 0 exp type_Bills exp type_Entertainment exp type_Food exp type_Fuel \ 0 1 0 0 0 1 1 0 0 0 2 1 0 0 0 3 1 0 0 0 4 1 0 0 0 exp type_Grocery exp type_Travel gender_F gender_M 0 0 0 1 0 1 0 0 1 0 2 0 0 1 0 3 0 0 1 0 4 0 0 1 0 0.3.2 Part 2: Clustering (40 pts) 10 pts for each subtask [46]: # Using sklearn library to split df into df_seen and df_unseen from sklearn.model_selection import train_test_split 5
df_seen, df_unseen = train_test_split(df, test_size =0.2 , random_state =42 ) # Print the shape of the training and testing sets print ( "Shape of df_seen:" , df_seen . shape) print ( "Shape of df_unseen:" , df_unseen . shape) Shape of df_seen: (20841, 23) Shape of df_unseen: (5211, 23) [47]: from sklearn.cluster import KMeans kmeans = KMeans(n_clusters =4 , random_state =42 ) kmeans . fit(df_seen) labels = kmeans . predict(df_unseen) print (labels[: 5 ]) /usr/local/lib/python3.10/dist-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning warnings.warn( [0 0 1 1 0] The two predictions are identical to one another. It could be due to the following causes: Random Initialization: To choose the initial cluster centers, KMeans and KMeans++ employ random initialization. For this reason, in certain situations, both algorithms may lead to comparable solutions and clustering outcomes. Similar data structure: There might not be much of a difference between KMeans and KMeans++ in situations where the clusters are unclear or extremely close to one another. Few clusters: It’s also possible that there aren’t many clusters, in which case the distinction between KMeans and KMeans++’s initialization techniques might not have a significant effect on the final clustering outcomes. But compared to the KMeans algorithm, KMeans++ initialization typically results in superior clusterings. When there are many clusters, this is even more significant. This is so that there is less chance of becoming stuck in less-than-ideal solutions because KMeans++ chooses more representative initial cluster centers. Furthermore, it is anticipated that KMeans++ will converge more quickly than KMeans, which is useful when working with big datasets or numerous clusters. When choosing which algorithm to use for a given problem, it is advised to try both and compare the outcomes as the advantage of KMeans++ over KMeans may not always be evident in all datasets. [48]: from sklearn.cluster import KMeans kmeans = KMeans(n_clusters =4 , init = 'k-means++' , random_state =42 ) kmeans . fit(df_seen) labels = kmeans . predict(df_unseen) print (labels[: 5 ]) 6
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
/usr/local/lib/python3.10/dist-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning warnings.warn( [0 0 1 1 0] In clustering analysis, choosing the ideal number of clusters is a crucial step because it has a big influence on the accuracy and interpretability of the findings. The ideal number of clusters varies depending on the dataset, the clustering algorithm, and the problem domain, so there is no one-size-fits-all answer. There are many methods to determine the optimal number of clusters.These techniques are all grounded in science and statistics. Elbow method: The “elbow” point, or the point at which the rate of decrease in WCSS starts to level off, is found by plotting the within-cluster sum of squares (WCSS) against the number of clusters. This technique can assist in determining the point at which more clusters do not yield significantly more information. Silhouette analysis: This technique can be used to assess the quality of clustering outcomes for varying numbers of clusters by calculating the degree to which each data point fits into its designated cluster. The number of clusters that maximizes the average silhouette width over all data points is the ideal number. Gap statistic: Under a null reference distribution of the data, the gap statistic compares the within- cluster variation for various values of k with their expected values. The number of clusters that maximizes the gap statistic is the ideal number. Using hierarchical clustering, a dendrogram that illustrates the hierarchical relationships between data points can be produced. The level in the dendrogram where the clusters start to get too small or too large can be used to calculate the ideal number of clusters. Here is a code that uses the elbow method to calculate the ideal number of clusters. For every value of k, it will plot a graph of the within-cluster sum of squares, and we can find the “elbow” point, or the point at which the rate of decrease in WCSS starts to level off. [49]: from sklearn.cluster import KMeans import matplotlib.pyplot as plt import numpy as np wcss = [] for i in range ( 1 , 11 ): kmeans = KMeans(n_clusters = i, init = 'k-means++' , random_state =42 ) kmeans . fit(df) wcss . append(kmeans . inertia_) plt . plot( range ( 1 , 11 ), wcss) plt . title( 'Elbow Method' ) plt . xlabel( 'Number of clusters' ) plt . ylabel( 'Within-Cluster Sum of Squares (WCSS)' ) plt . show() /usr/local/lib/python3.10/dist-packages/sklearn/cluster/_kmeans.py:870: 7
FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning warnings.warn( /usr/local/lib/python3.10/dist-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning warnings.warn( /usr/local/lib/python3.10/dist-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning warnings.warn( /usr/local/lib/python3.10/dist-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning warnings.warn( /usr/local/lib/python3.10/dist-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning warnings.warn( /usr/local/lib/python3.10/dist-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning warnings.warn( /usr/local/lib/python3.10/dist-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning warnings.warn( /usr/local/lib/python3.10/dist-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning warnings.warn( /usr/local/lib/python3.10/dist-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning warnings.warn( /usr/local/lib/python3.10/dist-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning warnings.warn( 8
[49]: 9
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help