For the journal of Internet Histories I had the pleasure of reading Ian Milligan’s recent book History in the age of abundance? How the web is transforming historical research (McGill-Queen’s University Press, 2019). In this book, Milligan discusses the necessity of studying the web for historical research, as well as the problem this introduces with respect to abundance. In my review, I focus on how for Milligan this abundance is both a promise as a pitfall for historians. You can read the review in the journal here, or read my self-archived copy below.
We cannot properly understand the lives of people in the 1990s if we cannot study what they did online. Yet while the 1990s are already becoming a period of historical interest, historians are lagging behind in acquiring the necessary skills to study this period. However, the problem is not just that historians need to learn how to read HTML documents or how to follow hyperlinks; the real issue is that the online traces of people are so vast and complex that traditional methods fall short. Milligan’s central thesis is that the abundance of web archives requires historians to think differently about studying a period. The traces people have left and still leave online, in blogs, comments, discussion boards, or otherwise, provide a rich view on what people thought, did, or desired unseen for other historical periods. Yet this abundance simultaneously introduces complexities that Milligan discusses and tries to alleviate throughout the book.
The traces that people leave online are to a large degree ephemeral; websites are born, edited, and disappear. Yet an increasing number of institutions, most notably the Internet Archive, have taken up the task of preserving as much of the web as they can. Where the Internet Archive aims to preserve a wide selection of the web, national libraries are increasingly working to preserve their national web, insofar as that can be discerned, establishing archives of hundreds of terabytes to petabytes of data. Yet while their task is to preserve historical traces, this is perhaps the only aspect that is similar to traditional archives. Archiving the web is to a large degree done automatically via web scrapers. Due to the sheer size, traditional archival practices of appraisal are not feasible, nor can archivists provide detailed provenance and context to sources. As such, historians can no longer trust the archivist to ensure that the archive is of adequate quality. Despite being the largest collections there are, one cannot assume this makes web archives representative, for who is online and who gains enough online traction in order to be selected for preservation is strongly biased by people’s socio-economic background. As such, too little is collected to adequately represent an entire society. At the same time, too much is collected with respect to individual’s private lives. Whether a blog should be considered a publication for all to see and preserve in the future is a difficult matter, introducing ethical issues. Milligan provides examples of his own traces left online when he was a teenager, ignorant that those messages might someday be stored in databases and made discoverable through search engines that did not exist at the time he wrote those messages. For web archives, abundance means it is difficult, or impossible, to draw the lines where too little is preserved or too much.
Yet in the case of privacy concerns, abundance may be as much a problem at the time of preservation, as a way out at the time of analysis. Due to the sheer scale of historical data available, the historian’s task may be not so much to describe individual lives, but instead to ‘zoom out’ and provide an analysis at the aggregate level. To this end, Milligan introduces what he and his collaborators term the FAAV cycle, consisting of:
- Filter – select the information that is of interest, based on metadata (e.g. date) or content (keywords).
- Analyse – investigate matters of interest that can be found in the selected dataset (e.g. hyperlinks or keywords).
- Aggregate – count hyperlinks or keywords.
- Visualize – present the data in tables, network graphs, or timelines.
The potential use of web archives and their analysis through this FAAV cycle is demonstrated through a case study in chapter 5. Milligan analyses Geocities in the period 1995–1999, and investigates whether this could rightly be called a ‘community’. Consisting of approximately 7 million users and 186 million web documents, traditional methods of close reading quickly fall short to answer his question. Instead, he combines close reading of a small sample of pages with statistics over the larger set. He investigates how images were used on web pages, how pages link to one another, and how certain users (identified by user names) behave in the network. He furthermore performs text analysis, through topic modelling, of certain groups of pages and guest books. Through this, he concludes that Geocities users indeed acted like a community.
While the case study demonstrates what can be done with web archives, Milligan does not truly solve the problems of abundance. Rather, he appears not entirely sure how to tackle the abundance of web archives either. This becomes apparent at two points where he argues in favour of opposing directions. First, Milligan argues on page 120 that historians tend to emphasise the textual content over the form of source material, a position that he considers insufficient to investigate the visually rich nature of the web. Yet contemporary methods cannot properly process this visually rich material, and instead he argues that ‘deforming the primary source to generate simple plain text’ is a much more fruitful avenue on page 124. And even this plain text is beyond the computational capacity of most historians for analysis. Milligan consequently argues to just use the metadata of pages, the page description, keywords, author, date of creation, date of web scrape, as well as the hyperlinks present on a page; a rather big step away from the visually rich nature of the web. Second, Milligan argues on page 124 (quoting Andrew Jackson from the British Library) that as historical sources, there is no meaningful way to rank archived web pages as more to less important; limiting the applicability of search engines, instead requiring more flexible approaches such as the FAAV cycle. Yet later in the book he argues that historians need relevance-ranking to make sense of web archives, and criticises existing platforms for not providing such a functionality properly (pp. 148–150). In these seeming self-contradictions, the book shows how Milligan himself still struggles with the abundance of web archives. Even if theoretically it is desirable to preserve the multimedia experiences of the web and give each page equal attention, we simply cannot.
In conclusion, Milligan convincingly argues that to study the 1990s will require a different set of practices to make sense of all the source material. This set of practices is both technical, requiring developments in methods and computational tools, as well as theoretical, requiring historians to think about the ethical issues of peering into people’s lives in ways not feasible before. The book thereby provides a clear introduction for historians of the 1990s to learn how to expand the source material with web archives, how to study these, and what ethical questions to consider. Milligan does not provide readymade answers, as he too struggles with the abundance of web archives. He explicitly provides as little technical detail of methods as possible, fearing any computational matter will be outdated within a short time. Yet one might hope that his other concerns will also become outdated in time as historians figure out how to analyse web archives in appropriate manners.