Skip to content

Instantly share code, notes, and snippets.

Last active September 25, 2015 10:50
Show Gist options
  • Save drjwbaker/cace97f9d13542dcf7a7 to your computer and use it in GitHub Desktop.
Save drjwbaker/cace97f9d13542dcf7a7 to your computer and use it in GitHub Desktop.
Beyond Methods of Mining: a workshop on doing historical research using digital data, Utrecht University, 14-15 September 2015

Beyond Methods of Mining: a workshop on doing historical research using digital data, Utrecht University, 14-15 September 2015

Live notes, so an incomplete, partial record of what actually happened.

Tags: beyondmining


My asides in {}



Interest of this conference is history and analysing historical data with digital tools. Trying not to exagerate value of digital tools. Want to interogate how historians are going about the job.


James Baker (University of Sussex): Acts of being in proxies for prints. People in the British Museum catalogue of Political and Personal Satire, 1770-1830

Deck on Slideshare

Code, data and viz on Github

'proxy data' (= metadata, catalogue descriptions as semi-structured data) creating new methodological problems #beyondmining @j_w_baker

— Steven Claeyssens (@sclaeyssens) September 14, 2015
<script async src="//" charset="utf-8"></script>

QA Q: Is the way you make gender distinctions problematic? Would it be better to seperate out parts of speech from descriptions and work on each seperately?

Ian Gregory (Lancaster): Spatial Humanities: texts, GIS places and public health in 19thcentury Britain

Interested in how to do GIS with text. How do you go about getting a corpus into a GIS? And once you acheive it, what can you get out of it?

Geo-parsing. Identify candidate place names using context. Then match them up to a gazeteer. Can do it at Lancaster for corpora up to a billion words...

Three approaches: exploratory spatial, thematic, textual/statistical comparisons

GIS good at making patterns, but not good at explaining patterns.

What is the Register General saying about Bedford? Weather, because there is a weather station there...

What is the Register General saying about London? Illness, public health emerge, water supply.

What is the Register General saying about Manchester and Liverpool? Little coherence, and water gets a negative score (so it appears less often than we might expect)

Kulldorf Scan Statistic: measure that suggests positive or negative relationship compared to words appearing at random in the corpus .. Research shows that Register General obsessed with water born diseases more than we would expect.

Is this justified? Bias in his interest in London to water born diseased given the public health problems they had (other cities with worse problems talked about less, in particular nothern cities...)

We then can ask: what is being said about a place? what places are associated with particular themes?

And do this by combining corpora and quantitative data

QA A: this doesn't extrapolate to health generally, but an interest in health from one source. Q: how do you select close reading without bias? A: worked on tools to select at random things to close read

Amelia Joulain Jay (Lancaster): Exploring Victorian British attitudes towards France and Russia. The Era, 1840-1899

Affiliated both to Spatial Humanities and

What is said about France and Russia in c19th British Newspapers?

The Data

Chose The Era because OCR quality looked good. Weekly Sunday paper. Leading theatrical journal, published by the Licensed Victuallers' Assocation (lots on the catering trade). 1838-1900. 3000 issues. 377m words.

Why France and Russia

Two main British rivals in the 19th century {what about Prussia?}

Peaks and troughts of frequencies and collacates seems to show co-occurance between places and war. But then less than 10-20% of the time is 'war' co-occuring. So something else is going on.

Lists of countries. Cultural relationships between countries stressed. Lots of advertisements (addresses).

But then newspapers are changing too, more advertisements over the period. Meaning that more collacates to do with location over time. Break down by France and Russia and it is clear that Russia tends to be refered to in relation to people, personifications (people of, minister of)

Random approach - taking random collacates and reading them.

Anti-Russian sentiment in the historiograph appears not to be there.

Tessa Hauswedell (University College London) Reporting the Empire. The Pall Mall Gazette, 1870-1900.

Historiography: growth in reports on Empire, increased fervent tone in reporting. This study attempt to move beyond The Times archive and look at the imperialistion of the press. Using the Pall Mall Gazzette, clubland newspaper.

India becomes gradually less important in the PMGZ. Spikes for Egypt and South Africa around events.

Are we just finding the obvious?

{is it an accident that we are all talking about feeling our way into the data? Would we - doing the same from a qualitative, reading some stuff perspective - be given talks that were so provisional, starting point, not declaring a clear interpretation?}

British Empire not overly discussed in terms of power. But sense that discussion are changing tone during the period: towards resources, defence of power, reaffirmation of strength.

M. Erdem Kabadayi (Instanbul Bilgi University) and Murat Güvenç (Kadir Has University): Revisting an old yet unsettled research question in Ottoman historiography: was there an ethno religious division of labour in the multi-ethnic, multi religious Empire?

Histiography: tends to suggest there was an ethnic division of labour.

Source: Source of Incoming Yielding Assets. Coding occupations and standardising incomes.

Multiple correspondence analysis. Way of finding patterns in coded data.

Jane Winters (Institute of Historical Research): What is digital history?

Good or ill, consciously or not, digital is the way we do research.

Practice changing: scholars more likely to seek out hard to find things in digital sources; simultanously research and writing up more common; weighting of one source over others shifting.

Wide variations in how digital sources have influenced behaviour. Early modern historians most obviously influenced - C18 London most digitised time/place? Little joy for 20th centuryists.

Six developments shaping research practice

  1. Aggregation. Eg Connected Histories. Supporting use and reuse through search. But little visibility of processes underlying the search. Obviously benefits come through cross searching.

#BeyondMining @jfwinters proposes big data approaches as a corrective to the default search-box on digital archives for #dhist research

— Max Kemman (@MaxKemman) September 14, 2015
<script async src="//" charset="utf-8"></script>
  1. Big Data. AHRC define this as digitised sources too big for a scholar to use with common software tools. Web archives a big problem for historians (and something they'll need to get to grips with!). But special characteristics: patchy capture, reconstructions fill gaps when we don't have pages. Digging Into Linked Partliamentary Data

New interface by M. Marx (a.o.) to search Dutch and UK parliamentary papers showcased by @jfwinters #beyondmining

— Steven Claeyssens (@sclaeyssens) September 14, 2015
<script async src="//" charset="utf-8"></script>
  1. Dissemination. Should we be wrangling our data/viz into book form? Shouldn't we be thinking about the other end?

  2. Crowdsourcing. We need to go beyond data slaves!

  3. Spatial Turn. Locating London's Past. Virtal Paul's Cross.

  4. Beyond text. How do we do audio-visual without textual metadata. Tim Sherrat's Face Depot about accessibility as well as understanding sources.

We should not forget people. We need to still explore history from the bottom up.


corpus linguistics assumes a very structuralist notion of language. Let's not forget insights from post-structuralism 1/2 #BeyondMining

— Melvin Wevers (@melvinwevers) September 14, 2015
<script async src="//" charset="utf-8"></script>

We need to be aware of the biases produced by strict categorizations. Language is much more fleeting 2/2 #BeyondMining #derrida

— Melvin Wevers (@melvinwevers) September 14, 2015
<script async src="//" charset="utf-8"></script>

@j_w_baker I think we have to go beyond deconstruction and be more generative, albeit not naively positivist #beyondmining

— Melvin Wevers (@melvinwevers) September 14, 2015
<script async src="//" charset="utf-8"></script>



Paul van Trigt (Utrecht University): Microhistory and Big Data. Rewriting a History of Disability by Mixed Methods

Can we use big data to reinvigorate microhistory?

C20 disability research. Logic of care dominant approach to people with disabilities in the Netherlands. This logic normalised.

Using mining of large datasets (searching Delpher, using ngrams) to test hypothesis that come out of the microhistory (where otherwise secondary lit would be used)

Terms are slippery. Mining concepts. Taking umbrella terms to represent the whole has advantages and challenges.

Maarten van den Bos (Utrecht University): Unsupervised walks. Youth, mass culture and the changing future of society in Dutch public discourse, 1945-1965

Key post-war discourse of young people wanting to walk unsupervised. Establishment concerned that loss of values, belief et al creating a formless youth - urge to 'protect' youth from dangerous impact of modern mass culture, threat to moral of dancing. Youth collocates with crime, police, et cetera. With fears.

{bouncing between close reading and ngrams/collocation et al}

Dino Mujadzevic (Ruhr University Bochum): Measuring Turkish influence in Bosnia? Corpus-assisted history of media discourses on Turkey in Bosnia and Herzegovina (2002-2014)

How were media representations of Turkey used to frame Bosnian public opinion on Turkish foreign policy towards Bosnia?

Method from Reinhard Koselleck (conceptual history) ~ as contexts change, so do meanings of words

Used printed media archive (many newspapers in there). Corpus includes 20k articles containing mention of Turkey.

Collocation analysis. 5L5R. Set cut off for MI score and non-interesting words.

Two clear categories emerge: pro- and anti- Turkish. This isn't surprising.

Conclusions: pro-Turkish discourse more dominant. Likely to be concentrated in Sarajevo-based media. Letters to editors played a key role in critical discourse (published against the grove of editorial comment)

QA Q: this is a static representation rather than change over time. A: yes, overall picture for now, annual sub-corpora next.

Hermione Giffard (Utrecht University): Mining Newspapers. A Plea for Significance

Implied equality of all the documents when in corpora. What is relevance? Well in searches, TF-IDF. Is this measure of significance appropriate for historical research? (stood the test of CS values, simplicity, efficiency, reproducibility; here less common words used to differentiate between things) Decision making taken away. Are we sacrificing our values in favour of simplicity? {What are those values, on one hand being critical (which I agree we need more than ever) and on other our judgement, which you imply is different from TF-IDF... So, it strikes me that this is a good opportunity to think about what those 'values' are, values that rest on canons constructed through the lens of totalising, hegemonic 'truths' - that craft is not art, that science fiction is not good literature, that non-Western traditions are exotic. TF-IDF is just another way of building canons, no? And so it seems to me that an effect way of tackling the 'problem' of search and lose of control by doing we've been doing to canons since the linguistic turn, that is tearing them apart}


Judging from the Q&As at #BeyondMining, whether a term's use & meaning is stable throughout time seems the biggest challenge for #dhist

— Max Kemman (@MaxKemman) September 15, 2015
<script async src="//" charset="utf-8"></script>

Some admin...

Creative Commons Licence
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Exceptions: embeds to and from external sources, and direct quotations from speakers

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment