At the Digital Humanities congress 2013, 159 papers and 52 posters will be presented. All abstracts have been made available online, but fortunately it’s possible to browse the material by several metadata elements conference-related such as room and date, or paper-related such as author and affiliation. I was particularly interested in the browsing by keyword, as I hoped to gain an overview of the papers available, as well as gain quick access to material interesting for my research.
This is where it got interesting; keywords and topics were apparently free-text input. In principle, “Cultural Heritage” (freq: 1) and “cultural heritage” (freq: 2) are treated as different keywords, leading to a rather large list of keywords that is not very informative. Fortunately, the organizers have normalized keywords to solve this, combining these two keywords into a single “cultural heritage” (freq: 3).
Keyword frequencies
Still, the list of keywords is enormous, which got me wondering how much we can actually learn from these keywords. To get a better insight, I’ve created a CSV file of the normalized keywords page that can be downloaded here (CSV). I was mainly interested in how often keywords are reused. There are 754 keywords in this set with 2234 total occurences, describing 236 items (papers, posters, workshops, panels and plenaries); avg. 9.4 keywords per item. Each keyword is used on average 3.0 times, which sounds reasonable. However, 541 keywords (71.8%) occur once. The top 10 keywords account for 347 occurrences, which is still only 15.5%. In other words, there is little reuse of keywords.

How should keywords be used? If the goal is to be able to categorize papers, general (prescribed) keywords might be more useful, limited to a small set of categories. On the other hand, to describe a paper specifically, keywords should perhaps indeed be free to the author to choose; this makes them more useful for keyword search rather than category browsing. However, if keyword search is the goal; what do these keywords add to full-text search? Another thing is that keywords are generally more informative in context, rather than individually: what can you learn from keywords such as ” design” (freq: 1), “genre” (freq: 1) or “other” (freq: 13) without the rest of their context? These are questions authors might want to consider before assigning keywords to their papers.
Improving keywords
Can we improve the keywords given to papers by reducing the cognitive burden for authors? Several solutions are present. The first as hinted to above is to include a list of prescribed keywords or an ontology of keywords from which authors can choose. However, this seems not desirable, as it limits the author’s creativity; sometimes terms are not present in a research domain yet. Another idea might be to extract keywords from the text through Named Entity detection [1]. This might not solve the issue however, as still a wide variety of keywords may be extracted, adding little to full-text search nor to categorization. Moreover, this still is not helpful in linking papers to related papers with other terminology. Alternatively, we might look at how tagging is improved in social web services. Why not include keyword recommendations in paper submission software? This can be done by recommending keywords based on author-chosen keywords, as already available in Flickr [3], or by retrieving tags from similar papers [4]. If I noticed anything while browsing for related literature is that a lot of research already exists.
I hope paper submission software such as EasyChair or ConfTool (the one used for DH2013) will take notion of this research; it might make paper keywords much more useful. Not only is it easier for authors, it makes papers more discoverable for searchers as well.
References
[1] de Rooij, O., Vishneuski, A., & de Rijke, M. (2012). xTAS: Text Analysis in a Timely Manner. DIR 2012: 12th Dutch-Belgian Information Retrieval Workshop.
[3] Sigurbjörnsson, B., & Van Zwol, R. (2008, April). Flickr tag recommendation based on collective knowledge. In Proceedings of the 17th international conference on World Wide Web (pp. 327-336). ACM.
[4] Sood, S., Owsley, S., Hammond, K. J., & Birnbaum, L. (2007). TagAssist: Automatic Tag Suggestion for Blog Posts. ICWSM’07.
Hi Max,
fun topic to research – both from the angle of:
– personal archiving (I reckon I use too many different keywords – hence never retrieve what I’m looking for);
– personal branding & impact measurement for academic gearing towards general marketing rules: Mailchimp has an experimental feature where you can ‘research’ the popularity of certain words in the header of your newsletter / mail. Not sure if they published any studies supporting this feature, though.
Kind regards, e.