Who? Investigating the social entities in a corpus

Max Kemman
University of Luxembourg
December 6, 2016

Online slides optimised for Full-HD screens in full-screen mode
Download PDF here

Doing Digital History: Introduction to Tools and Technology

Where assignment

How is the assignment going so far? Any questions about the tools?


  • Final assignment
  • Networks
  • From Hermeneutics to Data to Networks
  • Next time

Final assignment

  1. Analyse the 30k emails with the W-questions, or specify a subselection
  2. Reflect upon your analysis

W questions

  1. What?
    • What are the emails about?
    • How does this change over time?
  2. Where?
    • Where are the locations mentioned in the emails?
    • What does this say about the (inter)national perspective of the writer(s)?
  1. When?
    • When were the emails sent?
    • How do the emails change over time?
  2. Who?
    • Who are the emails sent from & to?
    • Who are the people mentioned in the emails, and how do they relate to the writer & reader?
    • What does this say about the social perspective of the writer?

Can you come up with more W questions?

The report

Work in groups of three or four (in group of 3: discuss 3 W questions)

Include a link to your Google Sheet (via the Share button) or other sources

Hand in the assignment in HTML, include your name and a decent profile photo

3000-5000 words, in English


Grading of the course

  • Weekly assignments (40%)
  • Final group project (60%)

Grading of the final assignment

  • 1pt for the HTML
  • 1pt for CSS
  • 2pts for documentation of your process
  • 4pts for discussion of the W questions
  • 2pts for critical reflection


Send in your assignment before 20 January 2017 23:59 (tentative)

Send them to max.kemman@uni.lu as usual: I will confirm your submission


Our final W question

Historical research incorporates:

  • What - what happened?
  • Where - where did this happen?
  • When - when did this happen?
  • Who - who was involved?

How to describe the people

Given a corpus, multiple ways of describing people

  • A list of all the people
  • Biographies
  • Classes of people
  • Genealogies
  • Networks of people

What is a network?

Two components:

  1. Actors - the people - represented as nodes
  2. Relations - the connections - represented as edges

(Images and information based on Martin Grandjean's tutorial)

What is a network?

Attributes of nodes:

  1. Label
    • Here: Name
  2. Colour
    • Here: Gender
  3. Size
    • Number of connections
    • Not in the data, but derived

What is a network?

Attributes of edges:

  1. Label
  2. Colour
  3. Size
  4. Direction
    • Networks can be directed or undirected
    • Here: directed

Reading a network

Imagine the connection here means "likes"

  • John likes many people, but no one likes John
  • Everybody likes Diana, but Diana doesn't like anyone
  • There are no 2 people who like each other
  • Everyone is connected
    • No isolated nodes

Types of network

  1. Graphs - a web of relations including circles
  2. Trees - no circles

Types of network

  1. Graphs - a web of relations including circles
  2. Trees - no circles
  3. Bipartite - 2 sets of nodes with links between the sets but not within each set

Analysing the network

Four types of centrality measures

  1. Degree centrality - the numbers of connections
  2. Closeness centrality - closeness to the entire network
  3. Betweenness centrality - bridges
  4. Eigenvector centrality - connection to well-connected nodes

Central nodes

  1. Which node has the most connections?
  2. Which node is the closest to the entire network?
  3. Which node acts as a bridge between different communities?
  4. Which node is connected to well-connected nodes?

Besides nodes, we see communities

A network of letter writers

For historical research, letters are an interesting corpus for network analysis

We (usually) know:

  1. Sender
  2. Location of the sender
  3. Receiver
  4. Location of the receiver
  5. Date of the letter
  6. Contents of the letter


For example, ePistolarium or Six Degrees of Francis Bacon

From Hermeneutics to Data to Networks

The following slides are based on Marten Düring's tutorial From Hermeneutics to Data to Networks: Data Extraction and Network Visualization of Historical Sources

Available from http://programminghistorian.org/lessons/creating-network-diagrams-from-historical-sources

Structured data

As mentioned, we can show letters (or emails) as a network

  • Nodes: senders & receivers
  • Edges: the sending of a letter
  • Attribute of nodes: location

An Excel sheet of metadata of letters is what we call structured data

But what if the data is unstructured?

Anything goes

When the data does not itself define the relations, we can come up ourselves with the relations we are interested in

For example: nodes can besides people be a film, a place, a job title, a point in time, a venue

Likewise, edges can besides direct connections represent how two theaters could be connected by a film shown in both of them, or by co-ownership, geographical proximity, or being in business in the same year

The nature of the nodes and edges thus depends on your research interests

Network Data Extraction

It is more difficult to extract network data from unstructured text

The challenge is to systematize text interpretation

The data will not represent the full complexity of the source, but acts as a model of the relationships you are interested in

Any data you produce will only be as clear as your coding scheme

Developing a coding scheme

First task: decide who should be part of the network, and which relations between actors are to be coded

Questions to ask:

  1. Which aspects of relationships between two actors are relevant?
  2. Who is part of the network? Who is not?
  3. Which attributes matter?
  4. What do you aim to find?

Düring's research

Marten Dürings PhD concerned the covert support networks during WWII

Three research questions:

  1. To what extent can social relationships can help explain why ordinary people took the risks associated with helping?
  2. How did such relationships enable people to provide these acts of help given that only very limited resources were available to them?
  3. How did social relationships help Jewish refugees to survive in the underground?

Case study: first person narrative of Ralph Neuman, a Jewish survivor of the Holocaust.
PDF: http://bit.ly/neumantext

His answers to develop his coding scheme

  1. Which aspects of relationships between two actors are relevant?
    • Any action which directly contributed to the survival of persecuted persons in hiding
  2. Who is part of the network? Who is not?
    • Anyone who is mentioned as a helper, involved in helping activities, involved in activities which aimed to suppress helping behaviour
  3. Which attributes matter?
    • Concerning edges: Rough categorizations of: Form of help, intensity of relationships, duration of help, time of help, time of first meeting (both coded in 6-months steps).
    • Concerning nodes: Mainly racial status according to National Socialist legislation.
  4. What do you aim to find?
    • A deeper understanding of who helps whom how, and discovery of patterns in the data that correspond to network theory

Creating our own coding schema

What do we know we will need to describe?

  • Nodes: givers & recipients of help
  • Relations: help given
  • Attributes: ?

Let's create a Google Sheet with columns Giver and Recipient

Consider the sentence: Alice gave Paul some food for the road, what can we describe?

Another sentence: In September 1944 Paul stayed at his friend Alice’s place; they had met around Easter the year before

We need at least two columns describing the attributes

Coding the sample sentence

In September 1944 Paul stayed at his friend Alice’s place; they had met around Easter the year before


Notice that instead of text, the data contain numbers: easier to process afterwards

Notice the 99: this represents an unknown value

What if we have multiple values? For example:
In September 1944 Paul stayed at his friend Alice’s place; Alice gave Paul forged documents for the road

Solution: Make another row to describe the second relation

Describing the actors

Now we know that Alice helped Paul, but what can we tell about these people?

Remember: Düring was interested in the helping of Jews, and self-help

In a new sheet, we can describe the actors

Coding all sources

Unfortunately, the source will rarely describe sentences like Person A is connected to Persons B, C and D through relation X at time Y

So, a lot of close reading is required

Moreover, when reading more sources, you will discover more actors and connections of interest, expanding your codes and forcing you to go back and update earlier coded sources

Let's try

Let's try with the case study: http://bit.ly/neumantext

Look up p15, Living underground and describe codes for the first 3 paragraphs

To Networks

Now that we have structured data, we can create a network

This is for next week!

For next time

13 December

Who? Investigating the social entities in a corpus

Reading: (see Moodle)