Shaun_McKellarJr_HW4

docx

School

Syracuse University *

*We aren’t endorsed by this school

Course

707

Subject

Mechanical Engineering

Date

Apr 3, 2024

Type

docx

Pages

Uploaded by shaun6

Shaun McKellar Jr IST 707- Applied Machine Learning HW 4: Clustering Use Clustering to Solve a Mystery in History Introduction: The Federalist Papers consisted of a collection of 85 essays aimed at persuading the people of New York to support the adoption of the newly proposed U.S. Constitution. These essays, authored by Alexander Hamilton, James Madison, and John Jay, were initially published anonymously under the pseudonym "Publius" in New York newspapers during 1787 and 1788. Although a bound version of the essays emerged in 1788, it wasn't until the 1818 edition, printed by Jacob Gideon, that the true authors were disclosed. The Federalist Papers hold immense significance as a key resource for interpreting the original intentions behind the Constitution. Among these 85 essays, Alexander Hamilton is credited with writing 51, James Madison with 15, John Jay with 5, and Hamilton and Madison collaborated on 3. However, there is ongoing debate about the authorship of the remaining 11 essays. Historians have grappled with the mystery of whether these essays can be attributed to Hamilton or Madison, a question that has persisted for many years. About the Data The Federalist Papers data set was used to conduct this analysis. This data set initially contained 85 rows and 72 columns. Each row referred to a paper written by one of the authors, and 70 of the columns represented a word used within the paper. The value within the cells referred to the word’s relative frequency within a particular document. The remaining two columns referred to the author’s name and the file’s name/the paper in question. The data set contained no missing values, but data cleansing and transformation were still necessary. The columns containing the author’s name and the file’s name were not necessary for the clustering analyses but having the file’s name as the row label was necessary to identify a particular observation. However, the file names were very long, and this could decrease utility when attempting to identify observations from the clustering analyses/graphs. n this R code, a series of data preprocessing and exploratory steps were carried out. To begin with, several R libraries were loaded, such as word cloud , quanteda , arules , and ggplot2 , providing tools for text mining, data analysis, and visualization. The working directory was set to a specific location on the desktop, ensuring that R could locate and save files. The "Federalist

Shaun McKellar Jr Papers" dataset was loaded from a CSV file, and a backup copy called "FederalistPapers_Orig" was created to preserve the original data. The dataset was then explored using the View function to interactively examine its contents, and a check for missing values was conducted to ensure data completeness. To prepare the text data for analysis, thresholds for term frequency were set to filter out overly common and extremely rare words. Additionally, a list of stop words, including common English words, was defined to exclude them from the analysis. Furthermore, a summary of the "Federalist Papers" dataset was generated to gain insights into its structure and content. Lastly, available transformations were inspected. These preprocessing steps are crucial in text mining and natural language processing projects, as they lay the foundation for meaningful analysis by addressing data quality, term frequency, and stop words to focus on relevant patterns and insights within the text data. Model/Results

Shaun McKellar Jr

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version

Access to all documents
Unlimited textbook solutions
24/7 expert homework help

Shaun McKellar Jr The centroid values on these dimensions should ideally be distant from each other to effectively differentiate the clusters. In the context of k-Means clustering, attributes like "a," "and," "as," "be," "by," "in," "is," "of," "that," and "the" are deemed the most valuable for this clustering process.

Shaun McKellar Jr The graph above demonstrates a common method for determining the optimal number of clusters in unsupervised learning, particularly for algorithms like k-means, which require the user to specify the number of clusters (k) before the algorithm is run. The analysis of clustering in this context reveals several important insights. Firstly, there is a notable drop in the silhouette score when transitioning from 2 to 3 clusters, indicating that the data does not naturally align with 3 clusters as well as it does with 2 clusters. Following this drop, as the number of clusters increases from 3 to 10, the silhouette scores stabilize, demonstrating only a slight decreasing trend. This plateau signifies that expanding the number of clusters beyond 3 doesn't yield significant improvements in the clustering structure.

Shaun McKellar Jr

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version

Access to all documents
Unlimited textbook solutions
24/7 expert homework help

Shaun McKellar Jr I The bar chart presented above visually represents the outcomes of a cluster analysis applied to resolve the authorship question surrounding the Federalist Papers. Upon examining the chart, several observations that I made: Firstly, the majority of the papers appear to be attributed to Hamilton and are distributed across various clusters. Secondly, Madison's papers are also identifiable, though they are notably fewer compared to those attributed to Hamilton. Jay, on the other hand, has the fewest papers linked to him, aligning with historical records that indicate he authored only a limited number of papers. Lastly, the disputed papers, potentially indicated as "dispt," appear to be dispersed among different clusters.

Shaun McKellar Jr

Shaun McKellar Jr In this section of R code, another Hierarchical Agglomerative Clustering (HAC) analysis is performed on the "FederalistPapers" dataset, using a different distance metric and linkage method. The code calculates the distance matrix "distance3" using the Manhattan distance method, which measures the absolute differences between data points along each dimension. This distance matrix is then used for hierarchical clustering with the complete linkage method, creating the hierarchical structure of the data. Although difficult to see) The dendrogram shows clustering results that are similar to that of the K-Means analysis. HCA was also performed using average linkage (as opposed to complete linkage), and similar results were achieved. The key takeaway from the HCA is that the majority of disputed papers were clustered on branches containing Madison’s papers. Conclusion Therefore, despite instances of imperfect clustering, the data suggests that Madison was the author of the 11 disputed papers. Thus, this brings the distribution of the 85 papers to 51

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version

Access to all documents
Unlimited textbook solutions
24/7 expert homework help

Shaun McKellar Jr written by Hamilton, 26 written by Madison, 5 written by Jay, and 3 written by both Hamilton and Madison. The mystery surrounding the authorship of the disputed Federalist Papers remains unsolved through my analysis. While the clustering results offered valuable insights and suggest potential authorship patterns, they do not provide conclusive evidence. To confirm the findings, further research employing advanced text analysis techniques as well as it is necessary to research more information. I do believe Hamilton wrote most of the papers though nonetheless. The primary difference between the two clustering exercises lies in the methods and distance metrics used. The k-Means clustering relied on Euclidean distance and silhouette scores, whereas the Hierarchical Agglomerative Clustering (HAC) used Manhattan distance and complete linkage. The choice of these metrics can influence the clustering results. However, both exercises showed similar patterns regarding the disputed papers' proximity to Madison's work. The most significant takeaway from this exercise is that text analysis and clustering methods can shed light on historical authorship questions. It underscores the complexity of the authorship attribution challenge and the limitations of relying solely on computational approaches to solve historical mysteries.