University of Luxembourg
October 4, 2015
Online slides optimised for Full-HD screens in full-screen mode
Download PDF here
Doing Digital History: Introduction to Tools and Technology
Why would we want to write for the web?
Can we write an HTML document?
What is a library?
What is an archive?
Is a digital library or archive more than a database?
Terras - Digitisation and Digital Resources in the Humanities
What are the 8 things Terras describes?
Enhancing of digital images: Google Art Project
Collection of geographically dispersed material: Europeana
Terras describes 3 stages of digitisation, what are they?
Terras describes three forms of material:
(all slides concerning digitisation of text kindly provided by eCodicology - Hannah Busch)
If you thought text was hard...
Photos kindly provided by NISV - made by Marco Hofsté
After digitising the film, need to synchronize with the audio
Two characteristics of interest
Re-keying vs OCR?
Re-keying: manual transcription
OCR (Object Character Recognition): computer interprets each letter
OCR is not perfect (image source)
Letters change: s / ſ / f (image source)
OCR quality depends on
Speech to text
Bush - As We May Think
Too much information out there
Compression for storage is not enough: need to be able to consult it
Not just extraction, but selection
Searching libraries and archives?
In non-digital archives & libraries, distinction between:
Metadata is used to find the object
Indexing: data sorted alphabetically or numerically
Alphabetical list with points to location
Full-text search: the contents used to find the object: meta/data?
Keyword search: term frequency-inverse document frequency
Bush: human mind works by association
Memex: tying items together
Keyword search: Google PageRank
Linked Data / Semantic Web
Keyword search: Google Knowledge Graph (example)
Search in video?
(slides from AXES project - Martijn Kleppe
About 10% digitized
In Europeana: 12% of digitized material
Estimated cost of digitising 100%: €100 billion
Does a digital archive reflect this?
Keyword search: no order, limited context
No authentic documents
Full-text search works, but limited by imperfections of OCR
Audiovisual search is starting to get interesting
With these millions of objects, Terras states simple access tools are not enough
Can we research the digital library or archive as a whole?
During this course we will use a collection of letters
How are letters different from other texts (Dobson)?
Data & Metadata
What is the letter about?
Why did the author write this letter?
What are the letters about?
Are there differences between the letters?
Who are the senders and receivers?
Do we find a community?
What kind of subjects are covered in the collection?
Are there differences in time?
Who are the senders and receivers?
Do we find communities of people writing one another?
To do such research with a computer, we need a lot of letters in digital form
As we just saw, digitisation is not trivial
Can we use digital-born letters?
Some more background: https://en.wikipedia.org/wiki/Hillary_Clinton_email_controversy
Let's try with one email: https://wikileaks.org/clinton-emails/emailid/2
Let's try another one: https://wikileaks.org/clinton-emails/emailid/123
What is an email? Is it the same as a letter?
Can we do this for 30,322 emails?
We 'scraped' wikileaks automatically to get all the emails
Because of the size, we separated the content from the metadata and saved these per 1,000:
Is our database complete?
Does it matter?
Reading: (see Moodle)