Skip to content

Instantly share code, notes, and snippets.

@drjwbaker
Last active August 29, 2015 14:10
Show Gist options
  • Save drjwbaker/c0799eefc99fe8ad7e3e to your computer and use it in GitHub Desktop.
Save drjwbaker/c0799eefc99fe8ad7e3e to your computer and use it in GitHub Desktop.
'A place full of data', Web Archives as Big Data event, Institute of Historical Research, 3 December 2014

##A place full of data

Notes for a talk I gave at Web Archives as Big Data event, Institute of Historical Research, 3 December 2014

The following text represents my notes rather than precisely what was said on the day and should be taken in that spirit.

Slides: http://www.slideshare.net/drjwbaker/2014-12-ihrwacbigdatatalk

Notes: https://gist.github.com/drjwbaker/c0799eefc99fe8ad7e3e


###Talk

SLIDE In The Government Machine, Jon Agar described how in the wake of the 1901 census, civil servants at the General Register Office faced a data management problem. Huge sheets containing more than 5,000 compartments were being used to generate data on occupations by sex, age, and marital status, stretching enumerative capabilities of humans, paper, and pens to the limit. For the 1911 census, government required population data on the working classes that would satisfy both the welfarists advocating for the alleviation of poverty - and hence they thought, violence - through social expenditure and the eugenicists looking to halt a perceived denigration of the British race. To accommodate these demands, the General Register Office opted for a punched-card system that could handle both the increased complexity of data they received and the changing queries government wanted to ask.

SLIDE We're all here today because historians, a group to which I resolutely self-identify, have a data management problem of their own. An undergraduate starting her History degree this year was born in 1996, a year after the public deployment of the WWW and the very year that public web began its inexorable growth. The British Library has been archiving the web for some years, and in April 2013 non-print legal deposit legislation gave us the right to grab a copy of every web page published in the UK web domain. The volume of this capture is turning us, decisively, into a place of data as much as a place of books and physical stuff. As historians, if the digitised newspaper and book archives we now have available to us were unthinkable just a decade ago, the size of this web corpus is simply staggering. There are of course access challenges - yoking by legislation captured representations of the web to print paradigms. And yet were better access available to the data, the ability of the historical profession (venerable examples we've seen today excepted) to manipulate, let alone navigate, that data is questionable and the macroscopes we need to put in context both an item in the corpus and the corpus with respect to an item, may be in the hands of GCHQ, but - for the most part - they are not in ours. And so for those historians who start their undergraduate degrees this year, are captivated by contemporary history, and go on to undertake doctoral study on some post-1996 subject in - say - 2018, both their skills and the tools at their disposal may well remain far from up to the task.

SLIDE Now a major difference between the problem Agar describes and problem historians face is structural context. One of the profound arguments that Agar makes is that public bodies such as the General Register Office were mechanised before the introduction of machines, and that the introduction of the latter energised the former. He writes: 'The imaginary construction of organizations as like a machine was prior to "real" mechanization - prior to but not proceeding, since systematic management enabled technological change, and, vice versa, technological change reinforced the position of systematic management' (144). In short government mechanised, it solved its 1911 census problems with machines, because it already saw itself as a machine. Whilst the case could be made that university management increasingly see themselves as in charge of businesses that are machines, rather than public bodies and learning environments, the Institute of Historical Research, History Departments up and down the land, and I'll wager almost all professional historians both see their research as far from mechanised and do research that is far from mechanised: a few filing cabinets, stacks of ruled notepaper, excel macros, 'if that then this' pushes, grep commands, Zotero harvesters, and API crawls aside. All of which is to say that using the 1911 General Register Office census case as a foil, the present situation is hardly the right context for the mechanisation the historical profession might need to describe, facet, filter, mine, and embrace billions of web pages.

SLIDE And of course there are strong arguments against mechanisation, not least Tim Hitchcock's recent keynote 'Big Data, Small Data and Meaning', which rejects both the mechanisation of the humanities and research that uses big data only at the scale and scope of distance. SLIDE Nathan Jurgenson, in an excellent essay in The New Inquiry warns that rationalist fantasies of big data can provide a disinterested picture of reality, of human phenomena, a narrative of objectivity, of having harnessed a view from nowhere, are promulgated by a tech industry oblivious to and ignorant of the biases in their data capture. Jurgenson hopes for the sake of data science, a research area he values and wishes to save from its current big-data-gasm, that the buzz around big data is soon pierced. Here historians of science and technology again offer useful additional framing. SLIDE David Edgerton's Shock of the Old is a particularly compelling and enduring counterpoint to the solutionist, self-aggrandising and innovation-laden language of Silicon Valley or Silicon Roundabout. In his use-centred history of things, Edgerton eschews the word 'technology' for its loaded association with novelty and innovation and reminds us of some simple truths: most innovations fail, futurology has been with us some time, SLIDE and more pervasive than rich-world technologies are [quote] ''creole' technologies, technologies transplanted from their place of origin finding uses on a greater scale elsewhere', usually in a rich-world to poor-world transfer. Edgerton prompts us to think of shanty towns built by many hands over many years as opposed to the planned cities of Le Corbusier; we might think of old motorcycle engines turned into generators or the people across West Africa reliant on all that we incontinent consumers waste. SLIDE I often wonder what it would take to stop folks talking about an all-encompassing conglomerate of 'big data' and talk about instead the large 'creole' creation, use, and repurposing of data by citizens - surely as our web archives expand that category of data will be big by any measure.

SLIDE Anyhow, I digress. The point is that for these reasons and many more I'm hesitant to use 'big data' as an appropriate nomenclature for the British Libraries digital collections: the thousands of geo-located historic maps, the 1 million public domain book illustrations on Flickr https://www.flickr.com/photos/britishlibrary , the 3.5 million sound recordings, the 21 million digital images we provide to ECCO, the 57 million catalogue records, the 2 billion plus webpages. What I care about are the practical problems that confront the historian by virtue of now having evidence available at a hitherto unimaginable scale, at a scale of a British Library full of data; practical problems I commend the IHR for taking the initiative on with respect to web archives in particular. To make the best of this data, historians are going to need to let a little more mechanisation into their way of working, into the profession's outlooks, and certainly into the training they offer to young historians, and I posit that one way of doing this might be to observe and engage with the daily mechanisation of their bread and butter: the manuscript source.

SLIDE Off-piste as this may sound, I want to spend the rest of this talk on a totally different collection type to web archives for three reasons: first, personal digital archives are the contemporary counterpoint to the public web; second, like web archives they represent an enormous looming challenge for both collecting institutions and for researchers; and third, they contain enormous amounts of data but data of a very different character to web archives.

The British Library has a small but growing collection of personal digital archives. These include the poet Wendy Cope, the author Hanif Kurieshi, the evolutionary biologist John Maynard Smith, Donald Michie - a pioneer of Artificial Intelligence -, and many more. Often acquired as part of mixed media bequeaths, these collections of materials are held across internal memory, external storage - from floppies to flash drives -, data dumps, email archives, and - I'm sure soon - clouds connected to smartphones. SLIDE As an aside on the latter, Tim recently argued that ...

“The greatest disaster for the historical record wasn’t the computer. It was the invention of the telephone.” —@TimHitchcock at #rrchnm20

— Dan Cohen (@dancohen) November 15, 2014
<script async src="//platform.twitter.com/widgets.js" charset="utf-8"></script>

... I can assure that that the locked down, cloud enabled iPhone is just as bad, if not worse.

SLIDE If you've encountered this category of collection before you probably haven't thought of it as 'big' data. For these personal digital archives can - and in many cases are - presented to the public using models of access that are far from rich and at scale data: hard-drives are navigated, 'proper' documents are exported, pdfs are made, readers get to read them as though they were paper; in some cases on prints outs just like a 'real' manuscript source (think of those filing cabinets of email printouts once ubiquitous in offices...) This is, however, no more than a fudge to comply with data protection, for in fact massive amounts of data are created actively and passively by the 'author' in their use of these media and by the collecting institutions as these media are examined, analysed, and stored.

SLIDE At the British Library we use forensic software (in many cases originally developed for the purpose of criminal investigations) to curate, to explore, to generate data, to understand. On one hand this gives us the potential to create a bootable version of someone's machine, allowing for qualitative and experiential interaction with a historic and computing environment personal to an individual. More appropriate to the present discussion, software are used to capture every bit on every disk: so every file on every disk and forensic metadata for every file and - if we're allowed... - for every deleted file on every disk. We get file trees for every disk: traces of the system - or non-system - by which a person organised their use of a machine or a backup storage device. We get dates all the files on those media were created, modified, and last accessed: traces of patterns of use, working hours, holidays. SLIDE We get hash values - crytographic functions that form a digital signature of a file - and use them as a means of comparing difference between documents (much like bioinformatics sequence analysis).

SLIDE All this is too much to be hand sorted, hand read, and hand curated; it quickly becomes massive amounts of data, hundreds of thousands of line of data for even a small HD by current standards. In the age of the other massive data we are talking about today, the web archive, I believe this category of source to be particularly important for the reason that it compliments our web archives - it counterpoints the vast, expansive web archives type of big with a big on a microscopic scale, at the level of bit size, use dates, file trees, downloads folders. As historians, I'm sure you all see how these collections will help scholars of the post-1995 era smooth out the selection bias of web archives, for these archives are - for the most part - private, personal, local, rough, and mundane. But of course in the look of feel of the things, the hardware, the files, the bit streams, this all could not be further removed from the boxes of manuscripts historians know so well. The personal digital archive is an actor-enabled socio-technical mechanisation of our most beloved source category at such a scale that data - big or no - and the need to prepare for some mechanisation surely cannot escape the attention of the profession for much longer. As web archive folks know a tidal wave of historical data isn't coming, it is here. And if you feel the point isn't getting across to the profession, perhaps the transformation the manuscript has undertaken is the stick - or the carrot? - that web archives need to prove that they are not the complicated digital exception to the physical archival rule.


Some admin...

Creative Commons Licence
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Exceptions: embeds to and from external sources

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment