In a previous blogpost, I introduced the project A Republic of Emails, where we created a dataset of the 30k Hillary Clinton Emails by scraping Wikileaks. Now that we have the data, we can start exploring with what I like to call the W-questions: What is the collection about? Where do described events take place? When did these events occur? Who are the actors involved? In this second blogpost, we will look at what the emails from the Hillary Clinton corpus are about. I will describe how we prepared the data to analyse a) the raw text, b) normalised text, and c) entities in the text (named entity recognition). Finally, we will look at a small subset of the emails using Voyant Tools. For all the steps I will point to the respective scripts on our GitHub so you can reproduce the project.
What we have now
In the previous blogpost we explained how we scraped the Wikileaks email collection. We created a JSON and a CSV file containing the metadata, and stored the contents of the emails in individual TXT files, separated in folders per 1,000. The JSON and CSV files contain the following metadata: From, To, Data, Subject, src (the location on the harddisk), url (the location on wikileaks), and finally a randomly generated ID.
Below, we will work with the TXT files to look at what the emails are about.
Processing the data
We will look at the emails mostly by counting words, and looking at the most frequently used terms. However, words can vary in spelling even when they are supposed to refer to the same thing. First, to combine similar words for counting, we will normalise the text. Second, we will extract named entities from the texts to look at just those. The scripts to perform the below steps can all be found on the GitHub.
In order to normalise the text, we do the following. Take for example the sentence “FYI we are putting out the following statement” from email 8.
- Tokenising all the words to remove interpunction and spacing: FYI,we,are,putting,out,the,following,statement
- Lowercasing: fyi,we,are,putting,out,the,following,statement
- Removal of numeric values (not in this example): fyi,we,are,putting,out,the,following,statement
- Stemming based on UAE-lite: fyi,we,are,put,out,the,following,statement
- Removal of stopwords: fyi,put,follow,statement
See below screenshots before normalisation and after normalisation of the same email.
Here I described this process as separate steps, but since we have over 30k emails, we want to do this quickly obviously. We have published a script to do just that on the GitHub repository. To follow a step by step guide of text normalisation of the emails, see https://github.com/C2DH/A-Republic-of-Emails.
Named Entity Recognition
We are also interested in finding specific things that are mentioned in the emails. Notably, we are interested in which people, organisations, and locations are mentioned. To do so, we will first need to extract all the terms that refer to these things, which can be done by named entity recognition (NER). Put very simply, NER is an algorithm that’s trained to recognise words that indicate specific things, such as people, organisations, and locations. The Stanford NER tool can also recognise money, percentages, dates, and times. We will extract all these things, but in the end only look at the first three types. See below a screenshot of the NER tagged email 8.
For easier analysis, we then made a copy of the CSV file we already had, and added columns for NEs people, organisations, locations, with individual entities separated by commas. More on these named entities will follow in future blogposts when we look at the who and where questions. First, let’s return to the texts and normalised texts and see what’s in there.
To see what the emails are about, we used Voyant Tools. At first we tried putting 1,000 emails in a ZIP-file to upload to Voyant, but Voyant did not appreciate showing so many files, even though each text is very short. Instead, we put 1,000 emails in a single text-file, so that Voyant sees them as a single long text which works much better. I have not tried yet how Voyant will respond to 30 of such TXT files. As an assignment, we instead let students compare the emails 6000-6999 to emails 7000-7999. An issue students encountered was that the emails contain a lot of words not of interest, so they had to spend quite some time to adding more stopwords before the word clouds became meaningful; the word clouds helped them see which words to add to the stopword-list. The other features of Voyant however helped students to move between distant and close reading of the emails. For example, two students I have asked students if they allowed me to use their results in a blogpost and if they would like to be acknowledged by name or remain anonymous. As a result I mention some students by name and others not. compared the use of “Egypt” and “Mubarak” in the two corpora using the word trends feature, finding the use of “Egypt” had a similar pattern, but “Mubarak” almost disappeared in the second corpus.
Wenceslas Schommer, Charlie Dentzer, and another student chose to use the words in contexts feature with the aim to interpret how a topic is discussed. They looked at how the emails mentioned “Assad” and noticed that the language is very clearly opposed to his regime.
Finally, students Johann Cox and Otilia Tira decided to analyse who is included in the emails by analysing the word links with “CC”, and from this saw that Jacob Sullivan and Cheryll Milss are the key collaborators, while Huma Abedin was more prominent in the first batch than in the second.
In conclusion, students found Voyant to be rather time intensive, and not always very user friendly. This can in part be explained by the data; because the email headers are part of the emails, the data contains a lot of noise that has to be manually removed, even after text normalisation. What the above examples do show is that the emails within Voyant were engaging enough for students to move beyond the word cloud (or Cirrus as it’s called in Voyant), and play with the data to find perspectives by switching between distant and close reading. However, to perform this switching between distant and close reading, the raw texts were preferred to the normalised texts which were much more difficult to read especially due to the tokenisation; all the words were on a single line in the Voyant reader. This shows an interesting opposition; while for distant reading normalised texts may be preferred, for close reading the raw texts are required.
In the next assignment, students will use Google Spreadsheets to create timeline charts of emails sent to approach the W-question when. More on that in the following blogpost.
Although this post is published on my personal blog, and I am the teacher of the Doing Digital History course, the project was developed in collaboration with Catherine (Kate) Jones and Daniele Guido at the University of Luxembourg. Without Kate’s great ideas and Daniele’s coding skills I would not have been able to create the A Republic of Emails project. I also thank Johann Cox, Tira Otilia, Wenceslas Schommer, and the other students for making such great reports with the emails and allowing me to publish some examples here on the blog.
|↑1||I have asked students if they allowed me to use their results in a blogpost and if they would like to be acknowledged by name or remain anonymous. As a result I mention some students by name and others not.|