Ceara_Stewart_HW4

docx

School

Syracuse University *

*We aren’t endorsed by this school

Course

707

Subject

History

Date

Oct 30, 2023

Type

docx

Pages

Uploaded by EarlTankHamster31

Ceara Stewart IST 707 HW 4 Introduction It is taught in history class that great minds such as Alexander Hamilton, James Madison, and John Jay had a part to play in the creation of the Constitution of the United States of America. The Federalist Papers, a series of eighty-five essays written by Hamilton, Madison, and Jay, urged the citizens of New York to ratify the Constitution. Originally, these eighty-five essays were published under the pen name “Publius”, but it was later revealed that Hamilton, Madison, and Jay were the writers. But not all the essays have a clear author, for which eleven essays were claimed to be written by Hamilton and Madison, to which no clear author has been established. This paper looks at a dataset of the eighty-five essays from The Federalist Papers and aims to use clustering methods to shed light on who wrote those last eleven disputed essays. Data and its Source The data for this paper consists of seventy-four of the essays from The Federalist Papers. Fifty-one of those papers were written by Alexander Hamilton, fifteen of them were written by James Madison, three were co-authored by Hamilton and Madison, five were written by John Jay, and the last eleven are disputed with either Hamilton or Madison being the author. Each essay contains a percentage of function words, the percentage of the word occurrence in an essay. This paper will use both R and Weka to run clustering algorithms of k-Means, expectation maximization (EM), and hierarchical agglomerative clustering (HAC) to try and solve the mystery of who wrote those eleven disputed papers. Data Exploration and Data Cleaning The Federalist Papers are presented in both an Excel csv file and an ARFF file, both titled “fedPapers85”. For Weka, the data cleaning is relatively simple. Once opening the fedPapers85 ARFF file in Weka, I checked for numerical data and used the numeric to nominal filter to convert the numeric data to nominal form so that the clustering algorithms could run. When running the different algorithms, it was important to ignore the “author” column so that it does not group into our clusters. We do this as we are trying to cluster the feature words within the authored papers, to see if a pattern emerges that we can thus use to compare the authored papers to the disputed papers based on the word features. When visualizing the results, this lets each point represent a paper so when we compare the author to various word features, we can see the cluster assigned for each paper. For R, the data cleaning process was a bit more intensive. First it was necessary to library the following packages so that the necessary commands and functions would run: “readr”, “dpylr”, “cluster”, “fpc”, “mClust”, and “ggplot”. I read in the “fedPapers85” csv file using the

Ceara Stewart IST 707 “read.csv()” function and named it “fedPapers85”. Within the “read.csv” function I ended with the “na.string” function in order to automatically remove any missing data values. For each clustering algorithm I ran within R, I removed the “author” column and saved it as a new data frame using the “dataset %>% select(-c(column))” function. Like for Weka, this was done so that the clusters can be created for the word features. Then I made the papers as the row names using the “rownames(dataset) <- dataset[,1]” and “dataset[,1] <- NULL” functions. When running each cluster algorithm I also converted all numeric values to nominal values using the “as.factor” function so that the algorithms would cluster correctly. Output and Analysis I first used Weka to run my analysis. I started with the k-Means algorithm, choosing to run three clusters for the purpose of simplicity, choosing 500 iterations and with a set seed of 10. I also chose to use the Euclidean distance. The number of clusters, iterations, set seed, and distance type were kept constant throughout all algorithms. After ignoring the “author” column, I ran the k-Means and visualized the results. The k-Means algorithm focuses on the centroid of the clusters which is the gravity center of the cluster, for which we use the mode for nominal data. When looking at the cluster data, we can interpret the values for each word feature within each cluster as the mode of that word feature. When comparing the author and then the filename to different word features using the visualization feature in Weka, you can see that for Hamilton all three clusters are heavily present, while most word features for the first cluster are present for Madison. Below are screenshots from Weka of author on the x-axis and the word feature ‘to’ on the y-axis. k-Means: author vs 'to' Next, I ran the expectation maximization or EM algorithm in Weka. The EM algorithm focuses on using log-likelihoods to determine the distribution of the latent variables. When

Ceara Stewart IST 707 comparing each feature word against author, Hamilton is primarily made up of cluster two. Cluster one and three are not as prevalent in this algorithm run but compared to the k-Means algorithm results, now Madison has no clear cluster assignment. Another interesting development of this model is that HM and Jay are mostly made up of points from cluster two. We also now see that the disputed author contains points mainly from cluster two. Below are Weka screenshots of author versus the word feature ‘no’. EM: author vs 'no' The last algorithm I ran in Weka was the hierarchical agglomerative clustering or HAC. The HAC produces a set of nested clusters organized as a hierarchical tree, which records the number of merges or splits. After trying various link types, I found the centroid method, which finds the distance between the centroids, to produce the best results and assign more word features to more clusters. Like the k-Means algorithm, Hamilton and his set papers are mainly composed of points from cluster two, while Madison and his set papers are mainly composed of points from cluster one. When looking at the various word features, the disputed papers are shown to have a more equal make-up of both cluster one and cluster two. Below is a screenshot of author versus the word feature ‘have’.

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version

Access to all documents
Unlimited textbook solutions
24/7 expert homework help

Ceara Stewart IST 707 HAC: author vs 'have' I used R to compare the results found by using Weka and to confirm and assumptions that will be presented in the conclusion. I only ran the k-Means and HAC algorithms within R as their visualization within Weka were similar in clustering method outcomes. Like Weka, I only ran 3 clusters for the k-Means within R. It produced the below cluster plot. We can see that two of the clusters overlap and contain most paper points while the remaining cluster makes up a larger area but contains less paper points. To visualize the clusters by author, I had R construct a bar graph, where each bar represents the author, and the clusters are broken down into color. You can see that papers written by Hamilton consist of only cluster two and three, while papers written by Madison fall within all three clusters, with the majority being from cluster two. The disputed papers encompass all three clusters, but cluster one is significantly smaller than the proportion taken up by cluster two and three. k-Means: Cluster Plot

Ceara Stewart IST 707 k-Means: Bar Graph For the HAC evaluation in R, I chose to use the Euclidean distance as that is the distance I chose to use within Weka. Below is the cluster dendrogram I had R produce. From the dendrogram, you can see that the first cluster is made up of only papers written by Jay. This follows what is seen when running the k-Means algorithm in R where papers written by Jay only appear in cluster one. The second cluster is made up of three papers written by Hamilton, one disputed paper, and five papers written by Madison. Cluster three contains the remaining papers. HAC: Euclidean Distance Dendrogram

Ceara Stewart IST 707 Conclusion Based off the data presented, my assumption would be that both Hamilton and Madison have claims to the disputed essays, but Madison has claim to more of those disputed essays than Hamilton. The k-Means algorithm run within Weka suggests that Madison wrote most of the disputed essays, as most of the essays written by Madison belong to the same cluster that most of the disputed essays belong to. The EM algorithm run within Weka suggests that Hamilton and Madison both have claim to different disputed essays, as their individual papers belong to all three clusters like that of the disputed essays. And the HAC algorithm in Weka suggests that Hamilton and Madison have claim to different disputed essays, like that of the EM algorithm. The k-Means algorithm run within R suggests that Madison has claim to the disputed essays as his essays fall within the same clusters proportions as the disputed essays. The HAC dendrogram suggests that Madison has claim to most of the disputed essays. If you follow the disputed essays up the branches, you can see that they stem from the same branches that most of the essays Madison wrote stem from. When we consider the essays co-authored by Hamilton and Madison, most algorithm runs place these joint papers as split in either the cluster mainly containing essays written by Hamilton, and the cluster mainly containing essays written by Madison. This outcome is expected. Another constant among all algorithm runs is that the papers written by Jay encompass only one of the clusters. Out of all of the algorithms run, I chose to base my assumption on the R outcomes of the HAC dendrogram and the k-Mean algorithm of the Weka run. This is because the SSE produced by the k-Means algorithm was extremely high, and that the k-Means method is relatively efficient, seen by how Hamilton’s essays were mainly in one cluster while Madison’s essays were mainly in another. The HAC dendrogram in R can easily correspond to meaningful taxonomies, and thus the branch splits of the trees can be interpreted as meaningful. In both the k-Means run in Weka and the HAC run in R led me to the conclusion that most of the disputed essays were written by Madison. But, despite my assumption that most of the disputed essays were written by Madison, I think a better way to go about this problem would be to only analyze the disputed essays and the essays written by Hamilton and Madison. Since the k-Mean algorithm and different HAC linkage types are sensitive to outliers and noise, any noise or outliers produced by having the joint authorship essays and the essays written by Jay would be removed and produce more accurate results with less noise and data skewing outliers.

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version