This year I will teach for the second time the Doing Digital History course for the History master at the University of Luxembourg. Just like last year, students will ask several W-questions. What is the collection about? Where do described events take place? When did these events occur? Who are the actors involved? In contrast with last year, where we had different collections per week, this year students will work with a single collection to experiment with throughout the course. In a series of blogposts I will describe the collection that the students will be exploring and the methods/tools that will be used to conduct close and distant reading. If you have feedback to further improve our ideas, please comment. If you wish to reproduce the project for your own courses, the blogposts should allow just that. As a reference to the historical Republic of Letters, I like to call this project A Republic of Emails.
Description of the collection
To do a proper class project, we needed a collection that is
- large enough to benefit from digital methods,
- diverse enough to see different things, yet homogeneous enough to make claims about the entire collection, and
- available in digital, machine-readable, form.
Unfortunately, very few historical collections exist that tick these three boxes. I am personally a fan of using letters, because there is a very nice meta/data distinction: you have the contents of the letters, and the sender, receiver, locations, and dates. Attributes which can all be approached with the four W-questions described above. However, as far as I know no online collection of letters exists that ticks the three boxes and that can easily be used in class with different tools. Reading an analysis of Hacking Team’s emails made me realise how perfect emails would be for the assignments, and after some pondering we came up with Hillary Clinton’s email archive hosted by WikiLeaks. For more background on where this collection comes from, read the Wikipedia page, but what is important for now is that the emails weren’t leaked as a result of a hack, but released by the US government as part of a FOIA request.See https://en.wikipedia.org/wiki/Hillary_Clinton_email_controversy#Freedom_of_Information_lawsuits The collection contains 30,322 emails and attachments, totalling 50,547 pages, from the period 30 June 2010 to 12 August 2013, with a total of 7,570 emails sent by Hillary Clinton (25%). As you can see, this fits our three requirements perfectly.
Creation: Scraping WikiLeaks
It appears the emails were released as a batch of PDFs, which WikiLeaks published as webpages with the PDF attached. Since Wikileaks provides the
email-ID in the URL in consecutive form from 1 to 30,322, it proved quite easy to scrape the pages. In order to scrape, we gratefully used the sandcrawler.js library created by Guillaume Plique from SciencePo’s Medialab. At first we tried scraping all emails and put them in a single JSON file and a CSV file with columns
Contents. However, after 19,000 emails the files were over 40 MB and my laptop stalled. As a result, we opted to put the contents in separate TXT files per email, in folders per 1,000 emails. The JSON and CSV files thus contain only the metadata of the emails. We’re not entirely sure if we should publish the resulting dataset (CSV, JSON, and TXT files) as we don’t know who ‘owns’ the data, but if you have any ideas about that please let me know in the comments. The least we can share is the method to scrape so anyone with an internet connection and access to a command-line can scrape it themselves.
To install the script and learn how to use it, see https://github.com/C2DH/A-Republic-of-Emails. The GitHub is a work in progress so if you have any feedback, or get stuck in the process, let us know and we will update the documentation accordingly.
Some first impressions
Without fully analysing the collection yet (I will do so in subsequent blogposts), there are a couple of things that stand out.
First, the emails have been marked by the US Department of State to indicate fully unclassified or partially unclassified information. Hiding of text is represented visually in the PDF, but not in the web version, see for example email 580 and the PDF. More annoying to us however is that the web version contains the annotation of the US Department of State and release date: “UNCLASSIFIED U.S. Department of State Case No. F-2014-20439 Doc No. C05776006 Date: 08/31/2015”. This is now part of our dataset, even though it was not part of the original emails. Is this necessary context of the emails, or is it noise in the data?
Second, the emails supposedly range from 30 June 2010 to 12 August 2013. The first email 1 can be easily identified as its subject is “TEST”, and we can clearly see that this is a test of whether the email server works. However, when sorting the spreadsheet by date I found some entries outside this range. For example, email 6995 is from 27 March 1996, and appears to be a (badly) OCR’d PDF attachment. It is however not clear when this was sent as an attachment, and the accompanying email is not identified. More disconcerting however is that a total of 9474 emails arefrom before this TEST-email. Did Hillary Clinton import old email to her new email-server, or does this indicate something else? For now all we know is that the provenance of the emails is rather imprecise.
Finally, the nature of what an email is appears not so simple. For example, email 123 shows a conversation between Hillary Clinton and Huma Abedin. Is this one email, or a series of emails? Our scraper now only regards the top From, To, Date, and Subject as metadata, and all the rest as content. Should we try to separate all emails-within-emails, be it conversations or forwarded messages? This would have made the scraping a whole lot more complex, so we chose to keep the structure of the WikiLeaks publication, but we should consider this collection a somewhat ‘quick and dirty’ one.
As the first impressions show, the collection is created in a somewhat ‘quick and dirty’ fashion. This might make the collection unsuitable for research, but actually all the more interesting for my course, as students will be confronted with these very same questions during their assignments. In their first assignment they will touch the surface of the collection, by looking at possible word trends using Bookworm. More on that will come in a later blogpost. If you have any comments on the project regarding how we should share the data, pedagogical ideas, or other interesting things we can do with it, let us hear in the comments!
Although this post is published on my personal blog, and I am the teacher of the Doing Digital History course, the project was developed in collaboration with Catherine (Kate) Jones and Daniele Guido at the University of Luxembourg. Without Kate’s great ideas and Daniele’s coding skills I would not have been able to create the A Republic of Emails project.