Last week I was at ECHIC 2018 in Leuven, which focused on infrastructures in the humanities. The conference was small, allowing very integrated discussions in a single-session format, combining participants with backgrounds in the humanities, as well as a large number of librarians. This variety in backgrounds, and the shared concern over infrastructural problems of sustainable data storage and access in itself was already an interesting demonstration of my paper’s point that digital humanities brings infrastructure into focus. In this blog post, I want to draw a bit of the debate around the main question of the conference: do the humanities require their own research infrastructures?
At the very start of the conference, Tom Willaert (KU Leuven) provocated with the statement “aren’t we all data scientists in some way, so that we can share infrastructures with the sciences?” He was met with skepticism and the reply that the humanities do need specific infrastructures. However, what makes the research infrastructures of the humanities specific to require separation was not answered in a technical manner. Instead, scholars such as keynote speaker Jane Ohlmeyer (Trinity College) and Sally Chambers (DARIAH, GhentCDH) emphasized the political need to distinguish, so that the humanities do not disappear in funding and political debate as part of the larger sciences, but “keep their seat at the table”.
As part of this debate, I explored the relation between digital humanities and the development of research infrastructures in my paper Infrastructure As Afterthought. I problematized this relation as ambivalent; while many DH projects are arguably about infrastructure, scholars in these projects do not necessarily feel their real research is about the infrastructure. Whether this is because the idea that DH is about infrastructure is simply wrong, or because what the concept “infrastructure” refers to is too narrow is something I further explored. In the debate afterwards, the question came up whether different models of infrastructures would lead to different practices of digital humanities. I therefore want to use this blogpost to review some models that were discussed and presented at the conference and identify some opportunities and limitations of these different models. If I have missed possible models, or possible advantages and limitations, I welcome comments below.
Models of DH infrastructures
|Type||Data||Tools||As done by||Advantages||Limitations|
|Integrated system||C||C||Most DH projects||►Specificity||►Seldom sustainable|
|VRE||D||C||CLARIAH, DARIAH, Parthenos||►User-friendly
|API||C||D||KB, Europeana, most digital collections||►Specificity
|Metadata||D||D||CLARIN, LOD, TEI||►Cheap
|No infrastructure||D||D||DMI, DIRT||►Cheap
|►Lack of findability
The first model is the integrated system approach, where an infrastructure is created that contains both a dataset or several harmonized datasets, and accompanying tools for exploring and analyzing those datasets. This seems to me the model of most DH projects, and the size and scope of the infrastructure is different per project. The costs of implementation are cut into many smaller projects, distributing workload and risk. The main advantage is that any project can make infrastructure as specific as they desire. The main limitation is that these infrastructures are seldom sustainable, and once project funding runs out systems tend to become outdated or go offline. Finally, adoption of these infrastructures by the scholarly community is not trivial.
The second model is the Virtual Research Environment (VRE) approach, where an infrastructure is created that integrates several tools, and users can apply this VRE to different datasets. This is a favoured model by more large-scale infrastructure projects, aiming at a large group of scholars. The main advantage is the strong end-user perspective, a system that scholars become aware of due to its size, and tools that are user-friendly. The main disadvantages are that development is expensive, and that tools need to be generic enough to serve this larger user group. This final point is especially difficult in the humanities, where either scholars do not really know what they need from infrastructure,As Joke Daems presented in her talk “Understanding the infrastructural needs of researchers working on digital text analysis” at ECHIC2018 or scholars propose too many specific requirements that are hard to generalize.Kemman, M., & Kleppe, M. (2015). User Required? On the Value of User Research in the Digital Humanities. In J. Odijk (Ed.), Selected Papers from the CLARIN 2014 Conference, October 24-25, 2014, Soesterberg, The Netherlands (pp. 63–74). Linköping University Electronic Press. Finally, data needs to be fitted into the infrastructure, which either becomes the responsibility of the user, or the developers of the infrastructure will need to invest time and money into this process. For example, the Dutch CLARIAH VRE is envisioned to provide access to distributed data, but this data is centralized to increase performance, so that data might be outdated, or when data is improved within the VRE (e.g. improving of OCR errors) this is not fed back into the original datasets.
The third model is the API approach, where an infrastructure is created that makes a digital dataset accessible to anyone that wants to build a tool for it. This seems to be the model of most cultural heritage institutions. The main advantage is that this allows very specific tools to be developed for different users, and the dataset itself is sustainable even when tools are not. The main limitation is that APIs demand more technological expertise than most scholars possess, and are thus not particularly user-friendly.Edmond, J., & Garnett, V. (2015). APIs and Researchers: The Emperor’s New Clothes? International Journal of Digital Curation, 10(1), 287–297. http://doi.org/10.2218/ijdc.v10i1.369
The fourth model is the Metadata approach, where the infrastructure consists of a data standard so distributed datasets and tools can in principle talk with one another. Well known applications of this are the Linked Open Data and TEI approaches, and in Europe CLARIN has adopted a similar approach. This is a relatively cheap approach, as developers only need to connect their tools or data to a certain metadata standard, while remaining as specific for their research purposes as they desire. A limitation is that often these metadata are a bit of an afterthought, and a real infrastructure is hard to be seen. That is, in several LOD and CLARIN projects I’ve observed that data was specific to the project, and at the very end this was quickly mapped to DBPedia or to the metadata standard of CLARIN. Very few research projects actually work across more than two connected datasets, either due to a lack of computational power, or a lack of sufficiently rich distributed data.
Finally, there is the approach of no infrastructure, simply having distributed tools and datasets. This approach is relatively cheap as there is no need for a top-down funding of infrastructure, and allows developers to make their tools and data as specific as they desire. A limitation is that tools are not necessarily easy to discover, although DIRT and DMI maintain lists of tools. The main limitation is perhaps that these tools are seldom very user-friendly, and scholars will need to find and learn new tools from different developers for every research problem.
As can be seen, there is not one model of infrastructures, and these different models are not mutually exclusive but can be, and usually are, combined. To me, it appears the main decision is whether scholars should act as end-users (requiring user-friendly tools that provide functionalities desired by scholars) or more as data scientists (requiring scholars to learn to program and design their own functionalities and workflows). In the first scenario, it is obvious that the humanities need their own research infrastructures, but it is not obvious how granular this should be: will this lead to infrastructures for historians in general, or to e.g. intellectual historians, or maybe even to e.g. intellectual historians of early modern time periods. The more specific and granular, the less obvious is the utility of a shared infrastructure. Moreover, it remains a question whether more generic large-scale infrastructures such as CLARIAH, DARIAH and PARTHENOS will succeed in establishing a user community or will prove a “dead end”.van Zundert, J. (2012). If you build it, will we come? Large scale digital infrastructures as a dead end for digital humanities. Historical Social Research / Historische Sozialforschung, 37(3), 165–186. At the other end, it is a matter of debate whether scholars should strive for the technological expertise that is required to act more as data scientists.
Very little research exists that investigates why scholars opt for one infrastructure model over another, or whether certain models are more effective than others. With millions of euros spend towards all different models, and the clear political agenda embedded in these models, this would be a very interesting research question to investigate.
|↑1||As Joke Daems presented in her talk “Understanding the infrastructural needs of researchers working on digital text analysis” at ECHIC2018|
|↑2||Kemman, M., & Kleppe, M. (2015). User Required? On the Value of User Research in the Digital Humanities. In J. Odijk (Ed.), Selected Papers from the CLARIN 2014 Conference, October 24-25, 2014, Soesterberg, The Netherlands (pp. 63–74). Linköping University Electronic Press.|
|↑3||Edmond, J., & Garnett, V. (2015). APIs and Researchers: The Emperor’s New Clothes? International Journal of Digital Curation, 10(1), 287–297. http://doi.org/10.2218/ijdc.v10i1.369|
|↑4||van Zundert, J. (2012). If you build it, will we come? Large scale digital infrastructures as a dead end for digital humanities. Historical Social Research / Historische Sozialforschung, 37(3), 165–186.|