University of Luxembourg
October 11, 2015
Online slides optimised for Full-HD screens in full-screen mode
Download PDF here
Doing Digital History: Introduction to Tools and Technology
What were aspects of an archive?
What are the three steps of digitisation?
What is the difference between data & metadata?
What meta/data do we have of letters?
Last week we discussed digital libraries/archives
Europeana contains about 53M digital objects
Is this big data?
Term has rhetorical function: "that which is given prior to argument" (Gitelman, 2014)
Common description: "raw data"
But creating data requires vast amount of work (as we saw last week)
Interpretive work into creating data
Metaphors used to describe big data give different interpretations (Awati & Shum, 2014)
'Classic' definition by V's:
Another definition: too much data to handle
Andrew Prescott (2015):
What is the difference between "lots of data" and "big data"? (Lagoze, 2014)
Or, does History have big data?
From the definitions so far:
Is our collection of Hillary Clinton emails 'big data'?
Some say History/Humanities do not have big data
BUT, why are we concerned with big data, but not with particle physics? (Wallach, 2014)
What are the 2 reasons she gives?
Here maybe History/Humanities do have interest in big data
Another definition of big data (Mayer-Schönberger & Cukier, 2014)
Let's discuss these features
"N" refers to the number of observations done as part of the sample size
Sample: a group that represents the entire population
So N=ALL refers to measuring everything, rather than a representative smaller group
A difference between "a lot of data" and "all data"
Remember Rosenzweig from week 1:
The injunction of traditional historians to look at “everything” cannot survive in a digital era in which “everything” has survivedRosenzweig (2003)
If big data is merely a quantitative difference, what's the interest?
But, quantitive can lead to qualitative difference (Mayer-Schönberger, 2014)
Rather than focusing on a very short timespan, see development over ages
Big data has Variety
A heterogeneous dataset
Too much data to manually check
Mayer-Schönberger & Cukier: size makes up for messiness
Exactness is from the age of spare information
The noise can be smoothed out
One way of trying to get someone to look at the data
Need to trust anonymous people
With N=ALL, big data = reality, right?
But (big) data incorporates choices of what to measure
Twitter/Facebook are biased reflections of the world
Big data word-pairs (MIT Technology review)
The average person is a fiction
Hitchcock: it is the exceptions we are interested in!
Wallach agrees: use the granularity of big data to study minorities & exceptions
How do we discover the minorities & exceptions of interest?
To repeat; cannot look at all cases individually
Some statistical analysis is required
Correlation: two variables show a statistical relation
Causation: one variable explains the second
A nice example is Google Flu Trends:
Important to remember: correlation does not equal causation
The keyword searches do not cause the flu!
Sometimes you don't know which variable comes first
Maybe a third variable explains the two measured ones
Does the correlation mean anything?
Google Flu Trends later found not to produce accurate results
Find a correlation yourself:
We cannot only use the statistics, we need to interpret them
But still we do not want to manually check all the possible correlations
Wallach describes herself as machine learning researcher
A simple introduction to machine learning (Geitgey, 2014)
Rather than telling the computer what to do, it learns what to do
Provide enough answers to learn to give a new answer
Computer figures out how to go from data to the answer
Or beat masters at chess or Go
No given answer
Are there patterns? Outliers?
What do pregnant women buy?
How are sentences translated to different languages? (MIT Technology Review)
Issues of biased algorithms:
"We have no idea how these predictions are made"
Often criticism of algorithm, but where does bias come from?
How does this require a rethinking of scholarship?
Ways of reasoning (Dixon, 2012)
Rens Bod: discovery of patterns with tools is Humanities 2.0
Hermeneutic interpretation of these patterns is Humanities 3.0
Fickers: context more interesting than the data
What is the context of each datapoint?
Hitchcock - contextualize using the big data
If content is king, context is its crown
Your search keywords make sense in your context
Remember from week 1: what does this tweet mean as part of 31M?
Or actually: what does this tweet mean outside of Twitter?
Hitchcock describes the macroscope quoting Katy Börner
Macroscopes provide a "vision of the whole," helping us "synthesize" the related elements and detect patterns, trends, and outliers while granting access to myriad details. Rather than make things larger or smaller, macroscopes let us observe what is at once too great, slow, or complex for the human eye and mind to notice and comprehend.
If today we have a public dialogue that gives voice to the traditionally excluded and silenced – women, and minorities of ethnicity, belief and dis/ability – it is in no small part because we now have beautiful histories of small things. In other words, it has been the close and narrow reading of human experience that has done most to give voice to people excluded from ‘power’ by class, gender and race.
Hitchcock argues for interchange of close and distant reading
Distant reading? That's the next lecture