Digital Libraries & Archives

Max Kemman
University of Luxembourg
October 4, 2015

Online slides optimised for Full-HD screens in full-screen mode
Download PDF here

Doing Digital History: Introduction to Tools and Technology

Recap from last time

Why would we want to write for the web?

Can we write an HTML document?

Today

  • Libraries & Archives
  • Turning the "analog signal" into a "digital signal"
  • Turning the "digital signal" into machine-readable data
  • Making the machine-readable data searchable
  • Current state of the art
  • A Digital Archive of Letters
  • Next time

Libraries & Archives

What is a library?

What is an archive?

Aspects of an archive

  • Provenance
    • Respect des fonds
    • Respect de l'ordre
  • Context
  • Historical sensation?

What is a digital library/archive?

  • Content collected on behalf of users
  • Institution
  • Service

Is a digital library or archive more than a database?

Borgman, C. L. (1999). What are digital libraries? Competing visions. Information Processing and Management, 35(3), 227–243.

Reasons for digitising

Terras - Digitisation and Digital Resources in the Humanities

What are the 8 things Terras describes?

  1. Access
  2. Search
  3. Reinstate out of print materials
  4. Display material in inaccessible formats
  5. Enhancing of digital images
  6. Conserve fragile objects
  7. Integration into teaching materials
  8. Collection of geographically dispersed material

Reasons for digitising

Enhancing of digital images: Google Art Project

Collection of geographically dispersed material: Europeana

What is digitisation?

Terras describes 3 stages of digitisation, what are they?

  • Turning the "analog signal" into a "digital signal"
  • Turning the "digital signal" into machine-readable data
  • Making the machine-readable data searchable

Turning the "analog signal" into a "digital signal"

Terras describes three forms of material:

  • Text
  • Sound and moving images
  • 3D objects

Text

  1. Digital photography
    • Grazer Büchertisch
    • Wolfenbütteler Buchspiegel
    • Multispectral photography
  2. Scan
    • Flatbed scanner
    • Overhead scanner

(all slides concerning digitisation of text kindly provided by eCodicology - Hannah Busch)

Digital photography

Grazer Büchertisch

Digital photography

Wolfenbütteler Buchspiegel

Digital photography

Digital photography

Multispectral Imaging

Scan

Flatbed scanner

Overhead scanner

Requirements for digital images

  • Resolution in DPI (dots per inch): minimum of 300
  • RGB colour space
  • TIFF format

Audio and moving images

If you thought text was hard...

Photos kindly provided by NISV - made by Marco Hofsté

Audio and moving images

After digitising the film, need to synchronize with the audio

3D objects

Two characteristics of interest

  • Setting
    • Tabletop
    • Tripod
    • Handheld
  • Light
    • Laser
    • White

Turning the "digital signal" into machine-readable data

Re-keying vs OCR?

Re-keying: manual transcription

OCR (Object Character Recognition): computer interprets each letter

Object Character Recognition difficulties

OCR is not perfect (image source)

Letters change: s / ſ / f (image source)

OCR difficulties

OCR quality depends on

  • Quality of the original document: letters and pages
  • Quality of the image
  • Not possible for hand-written material

Handwritten material

(Monk project)

Audio and visual material (simplified)

Speech to text

Keyframes

Edge detection

Making the machine-readable data searchable

Bush - As We May Think

Too much information out there

Compression for storage is not enough: need to be able to consult it

Not just extraction, but selection

Selecting material

Searching libraries and archives?

In non-digital archives & libraries, distinction between:

  • Data - the object
  • Metadata - the description of the object

Metadata is used to find the object

Indexing: data sorted alphabetically or numerically

Index

Alphabetical list with points to location

Full-text search: the contents used to find the object: meta/data?

Keyword search: term frequency-inverse document frequency

Association of documents

Bush: human mind works by association

Memex: tying items together

Web: hyperlinks!

Keyword search: Google PageRank

Association of documents/objects

Linked Data / Semantic Web

https://www.youtube.com/embed/TJfrNo3Z-DU

Keyword search: Google Knowledge Graph (example)

Audiovisual material

Similarity search

Content search?

Audiovisual material

Search in video?

(slides from AXES project - Martijn Kleppe

Current state of the art

Heritage digitized in Europe

About 10% digitized

In Europeana: 12% of digitized material

Estimated cost of digitising 100%: €100 billion

Aspects of an archive

  • Provenance
    • Respect des fonds
    • Original order
  • Context
  • Historical sensation?

Does a digital archive reflect this?

Keyword search: no order, limited context

No authentic documents

Search

Full-text search works, but limited by imperfections of OCR

Audiovisual search is starting to get interesting

Search

With these millions of objects, Terras states simple access tools are not enough

Can we research the digital library or archive as a whole?

A Digital Archive of Letters

During this course we will use a collection of letters

How are letters different from other texts (Dobson)?

Data & Metadata

  • Content of the letters
  • Sender
  • Receiver
  • Date
  • Location

A single letter

What is the letter about?

Why did the author write this letter?

A set of letters

What are the letters about?

Are there differences between the letters?

Who are the senders and receivers?

Do we find a community?

A whole lot of letters

What kind of subjects are covered in the collection?

Are there differences in time?

Who are the senders and receivers?

Do we find communities of people writing one another?

Digital letters

To do such research with a computer, we need a lot of letters in digital form

As we just saw, digitisation is not trivial

Can we use digital-born letters?

A Republic of Emails

  • Hillary Clinton used her own email server for government business
  • When this was discovered, she was made to disclose her email, and the gov had to provide emails as part of a FOIA request
  • Wikileaks then hosted the emails on their website: https://wikileaks.org/clinton-emails/
  • We have 30,322 emails & attachments, 50,547 pages, from the period 30 June 2010 to 12 August 2013
  • A total of 7,570 emails sent by Hillary Clinton (25%)

Some more background: https://en.wikipedia.org/wiki/Hillary_Clinton_email_controversy

Creating a database of emails

Let's try with one email: https://wikileaks.org/clinton-emails/emailid/2

Let's try another one: https://wikileaks.org/clinton-emails/emailid/123

What is an email? Is it the same as a letter?

Can we do this for 30,322 emails?

Creating a database of emails

We 'scraped' wikileaks automatically to get all the emails

Because of the size, we separated the content from the metadata and saved these per 1,000:

Folder#items   Folder#items   Folder#items   Folder#items
f-0999   f-101,000   f-201,000   f-30323
f-11,000   f-111,000   f-211,000
f-21,000   f-12998   f-221,000
f-31,000   f-13997   f-231,000
f-41,000   f-14998   f-241,000
f-51,000   f-151,000   f-25998
f-61,000   f-161,000   f-261,000
f-71,000   f-171,000   f-271,000
f-81,000   f-181,000   f-28999
f-91,000   f-19998   f-28999

Current state of the database

Is our database complete?

Does it matter?

For next time

11 October

Big Data

Reading: (see Moodle)

  • Wallach, H. (2014). Big Data, Machine Learning , and the Social Sciences: Fairness, Accountability, and Transparency. Medium.
  • Hitchcock, T. (2014). Big Data, Small Data and Meaning. Historyonics.