ICA #13 - Clustering Using Python

docx

School

Temple University *

*We aren’t endorsed by this school

Course

2502

Subject

Computer Science

Date

Dec 6, 2023

Type

docx

Pages

13

Uploaded by EarlAtom10263

Report
ICA: Clustering Using Python What to submit: a single word/pdf file with answers for the questions in Part 7 (Try it yourself) . Before you start You’ll need two files to do this exercise: Clustering.ipynb (the Jupyter Notebook file) and Census2000.csv (the data file 1 ). Both files can be found on the course site. ( Download both files and save them to the folder where you keep your R files.) Part 1: Look at the Data File 1) Open the Census2000.csv data file in Excel. If it warns you, that’s ok. Just click “Yes” and/or “OK.” You’ll see something like this: This is the raw data for our analysis. The data file contains 32,038 rows of census data for regions across the United States. Each row represents a region. In this type of input data file for a Cluster analysis, each row represents an observation, and each column describes a characteristic of the observation. We will use this data set to create groups (clusters) of similar regions, based on these descriptor variables as dimensions. A region of a cluster should be more similar to the other regions in its cluster than regions in any other cluster. For the Census2000 data set, here is the complete list of variables: Variable Description RegionID postal code of the region (unique identifier for each region) RegionLongitude region longitude RegionLatitude region latitude RegionDensityPercentile region population density percentile (1=lowest density, 100=highest density) RegionPopulation number of people in the region MedianHouseholdIncome median household income in the region AverageHouseholdSize average household size in the region 2) Close the Census2000.csv file. If it asks you to save the file, choose “Don’t Save”. 1 Adapted from SAS Enterprise Miner sample data set.
Page 2
Part 2: Explore the Clustering.ipynb Script 1) Open the Clustering.ipynb file in Jupyter. This notebook contains the Python code that performs the clustering analysis. 2) Look at cell 2. These contain the parameters for the clustering analysis. Here’s a rundown: Variable Name in R Value Description INPUT_FILENAME Census2000.csv The data is contained in Census2000.csv TEST_CLUSTERS 10 The number of clusters we went to check to find the optimal number of clusters NUM_CLUSTER 5 The number of clusters to generate for solution MAX_ITERATION 500 The number of times the algorithm should refine its clustering effort before stopping. RANDOM_STATE 5 Random number for training and testing data sets. COLUMNS_FOR_ANALYSIS ["RegionDensityPercentile", "MedianHouseholdIncome", "AverageHouseholdSize"] The variables to be included in the analysis (check the first row of the data set, or the table above, for variable names within the Census2000 data set) 3) Look at cell 1. These lines install (when needed) and load the packages. Those packages perform the clustering analysis and visualization. Part 3: Execute the Clustering.ipynb Script 1) Run each cell in the Jupyter Notebook file. After some of the cells, you should see an output/visualization. Part 4: Reading Plots 1) After cell 5 in the Python file, you’ll see this graphic: These are histograms for the three variables used to cluster the cases. Recall that these variables were specified in cell 2 of the script using the COLUMNS_FOR_ANALYSIS variable: Page 3
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
We can see that RegionDensityPercentile, MedianHouseholdIncome and AverageHouseholdSize have a right-skewed distribution. 2) Now look at the line graph after cell 11 in the Python file: This shows the total within-cluster sum of squares error (i.e., total within-cluster SSE ) as the number of clusters increases. As we would expect, the error decreases within a cluster when the data is split into more clusters. We can also see that the benefit of increasing the number of clusters decreases as the number of clusters increases . The biggest additional benefit is from going from 2 to 3 clusters (of course, you’d see a big benefit going from 1 to 2 – 1 cluster is really not clustering at all!). We see things really start to flatten out around 8 to 10 clusters. This suggests that we probably wouldn’t want to create a solution with more clusters than that. It is difficult to find the optimal K from the graph above, but often it would be clear from the graph. The basic idea is this. If the plot looks like an arm (which means there is a significant drop in the Page 4
marginal reduction in SSE after that point), then the elbow on the arm is optimal k. In cell 12 of this file, we manually calculate the optimal number of clusters with the elbow method. Page 5
3) From cell 2 of our script, we know that we specified our solution to have 5 clusters: 4) Now look at the output after cell 16: This table characterizes the centroids (cluster average) resulting clusters. For example, the first column shows the centroid (cluster average) for each cluster with respect to RegionDensityPercentile. As we can see, Cluster 1 has the lowest average RegionDensityPercentile, and Clusters 3 and 5 have the highest average RegionDensityPercentile. The centroids are compared against the population average (which is always zero). We can see from the first column that for Clusters 3, 4, and 5 have an average RegionDensityPercentile higher than the population average, and Clusters 1 and 2 have an average RegionDensityPercentile lower than the population average. Part 5: Reading Output 1) Reading Summary Statistics. Cell 4 shows the summary statistics for each variable: Page 6
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
2) Now look the statistics after cell 8 after the data has been normalized and missing data has been dropped (left), and after cell 9 after outliers have been removed (right). As seen, we started with 32,038 observations. After removing the cases with missing data, we were left with 31,951 observations. And after removing the outliers, we were left with 30,892 observations. 3) Look again at the centroids for each cluster after cell 16 Specifically, these are the standardized cluster means: The cluster means are normalized values because the original data was normalized before it was clustered. This was done in the cells 7 and 8: Page 7
The StandardScaler() function The StandardScaler() function is used to normalize (standardize kdata). It will first "center" each column by subtracting the column means from the corresponding column, and then "rescale" each column by dividing each (centered) column by its standard deviation. Why do we need to standardize the data? It is important to standardize the data so that it is all on the same scale. For example, a typical value for household income is going to be much larger than a typical value for household size, and the variance will therefore be larger. By normalizing, we can be sure that each variable will have the same influence on the composition of the clusters. How to interpret standardized values? For standardized values, “0” is the average value for that variable in the population. Look at cell 9 of the Python file. This is the summary statistics for the standardized data. From the summary statistics for after normalization, the population mean for each variable becomes close to 0 and standard deviation becomes close to 1. Now let’s look at the normalized values for cluster (group) 1: Page 8
As seen, the averages of RegionDensityPercentile (-1.121), MedianHouseholdIncome (-0.560) and AverageHouseholdSize (-0.514) for cluster 1 are negative, thus are below the population average (i.e., 0). In other words, the regions in cluster 1 are less dense, have lower income, and fewer people in their families than the overall population. Contrast that with cluster (group) 5: Cluster 5 has a higher than average RegionDensityPercentile (0.874. In other words, these regions are more dense than the overall population. 4) Within-Cluster SSE (Cohesion) and Between-Cluster SSE (Separation) We want to better understand the “quality” of the clusters. Let’s look at the within-cluster sum of squares error (i.e. within-cluster SSE). The within-cluster SSE measures cohesion – how similar the observations within a cluster are to each other. The following are the lines which contain that statistic (cell 19): These are presented in order, so 4931.736 is the cohesion for cluster 1, 6773.003 is the cohesion for cluster 2, etc. We can use this to compare the cohesion of this set of clusters to another set of clusters we will create later using the same data. IMPORTANT: Generally, we want higher cohesion (i.e., observations within a cluster should be tightly grouped); that means less Within-Cluster SSE (withiness). So, the smaller these WSS values are, Page 9
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
the higher the cohesion, and the better the clusters. 5) Finally, look at the between-clusters sum of squares errors (i.e. between-cluster SSE). The between-cluster SSE measures separation – how different the clusters are from each other (cluster 1 vs. cluster 2, cluster 1 vs. cluster 3, etc.). The following are the lines which contain that statistic (cell 20): We are interested in the average BSS . That gives us the average difference between clusters. We can use this to compare the separation of this set of clusters to another set of clusters we will create later. IMPORTANT: Generally, we want higher separation (i.e. different clusters should be separated); that means higher between-clusters SSE. So, the larger the average BSS value is, the higher the separation, and the better the clusters. Part 6: Comparing Two Sets of Clustering Results Now we’re going to create another set of clusters (10 clusters instead of 5) and examine the WSS and BSS to understand the tradeoff between the number of clusters, cohesion, and separation. 1) Return to the Cluster.ipynb file in Jupyter Notebook. 2) Look at cell 2: Change this value from 5 to 10. 3) Re-run each cell. 4) You’ll notice now, in the cluster means section (cell 16) there are 10 clusters: Page 10
Cluster 4 has the highest median household income, while Cluster 3 has the highest average household size. Because these values are normalized , you aren’t looking at the actual values (i.e., the number of people in an average household). But it does let you compare clusters to each other. Most importantly, we can compare the WSS (left) and BSS (right) statistics for this new set of clusters to our previous configuration of 5 clusters (cells 19 and 20): We can see that the WSS error ranges from 1312.907 (cluster 2) to 2027.685 (cluster 6). 5) Compare the 10 cluster solution to our 5 cluster solution , where WSS ranges from 2731.722 (cluster 3) to 6773.003 (cluster 2). The WSS error is clearly lower for our 10-cluster solution; the clusters in the 10- cluster solution have higher cohesion than the 5-cluster solution . This makes sense – if we put our observations into more clusters, we’d expect those clusters to (1) be smaller and (2) more similar to each other. However, we can see that the separation is lower (i.e., worse) in our 10-cluster solution. For the 10-cluster solution, the average BSS error is 5459.778; for the 5-cluster solution, the average BSS error was 9060.387. This means the clusters in the 10-cluster solution have lower separation than the 5-cluster solution. This also makes sense – if we have more clusters using the same data, we’d expect those clusters to be closer together. Page 11
How many clusters should I choose? So, our 10-cluster solution (compared to the 5-cluster solution) has (1) higher cohesion (good) but (2) lower separation (bad). How do we decide which one is better? As you might expect, there’s no single answer, but the general principle is to obtain a solution with the fewest clusters of the highest quality. A solution with fewer clusters is appealing because it is simpler. Take our census example: It is easier to explain the composition of five segments of population regions than 10. Also when separation is lower you’ll have a more difficult time coming up with meaningful distinctions between them – the means for each variable across clusters will get more and more similar. However, too few clusters are also meaningless. You may get higher separation but the cohesion will be lower. This means there is such variance within the cluster (WSS error) that the average variable value doesn’t really describe the observations in that cluster. To see how that works, let’s take a hypothetical list of six exam scores: 100, 95, 90, 25, 20, 15 If these were all in a single cluster, the mean exam score would be 57.5. But none of those values are even close to that score – the closest we get is 32.5 points away (90 and 25). If we created two clusters: 100, 95, 90 AND 25, 20, 15 Then our cluster averages would be 95 (group 1) and 20 (group 2). Now the scores in each group are much closer to their group means – no more than 5 points away. So here’s what you can do: 1) Choose solutions with the fewest possible clusters of high quality. 2) But also make sure the clusters means are describing distinct groups. 3) Make sure that the range of values on each variable within a cluster is not too large to be useful. Part 7: Try it yourself Use the Clustering.ipynb script and the same Census2000.csv dataset to create a set of 7 clusters: 1) Compare the characteristics (RegionDensityPercentile, MedianHouseholdIncome, and AverageHouseholdSize) of cluster 3 to the population as a whole? a. For RegionDensityPercentile and MedianHouseholdIncome, it doesn’t seem to be that far off from the other clusters. However, for AverageHouseholdSize we can see that it clearly is much larger than the other clusters. 2) What is the range of WSS error for those 7 clusters (lowest to highest)? a. Cluster 3 2343.260748453353 b. Cluster 4 2549.1156928598703 c. Cluster 7 2610.99799429565 d. Cluster 2 2793.113684239379 e. Cluster 5 3050.787257554455 f. Cluster 1 3730.2035865641046 Page 12
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
g. Cluster 6 4211.275483993797 3) Is the cohesion generally higher or lower than the 5-cluster solution? a. The cohesion seems to be lower when there are 7 clusters 4) What is the average BSS error for those 7 clusters? a. Average BSS 7158.791131552202 5) Is the separation higher or lower than the 10-cluster solution? a. The separation is slightly higher 6) Based on the analyses we have done so far, use your own words to summarize how the number of clusters can affect WSS error (cohesion) and BSS error (separation). a. The number of clusters can vary the look of the data. If we have more clusters, the data will group together and seem to make sense. With less clusters, there are more outliers and the data seems different. Page 13