Skip to content

Instantly share code, notes, and snippets.

What would you like to do?
Hard disks as archives of everyday life, 'Born digital big data and approaches for history and the humanities', School of Advanced Study (University of London), 8 June 2016

Hard disks as archives of everyday life

Notes for a talk I gave at 'Born digital big data and approaches for history and the humanities', School of Advanced Study (University of London), 8 June 2016

The following text represents my notes rather than precisely what was said on the day and should be taken in that spirit.


On Slideshare:


The premise of this talk is that the paper archive has been replaced by the hard disk, a new format that requires historians to think and act afresh. In just 25 years most people in Britain and worldwide have come to create information in a new way. Given that for historians the use of paper archives is essential to conceptions of good scholarship, this talk explores some of the adjustments to our methods that are required to work with hard disks as archives.

By training I'm an historian of eighteenth century British history. I write very normal histories but increasingly over the last five or six year I have written code that interrogates data to help me understand the past. Recently my research interests have moved. Alongside my work on eighteenth century prints I now also undertake research into contemporary history. I'm particularly interested in knowledge organisation in the 1990s onwards and the intersection or otherwise between the business ideal for how we organised our digital stuff, the reality of how we organised our digital stuff, and scholarly perspectives offered by Human Computer Interaction, itself 'disrupted' by the rapid growth in people having the same interactions with computation at home and at work (see Susan Bødker). This all started during my time as a Digital Curator at the British Library. One of my jobs at the British Library was to manage their Personal Digital Archives programme, work that collected, preserved, and stabilised the floppies, flash drives, zip drives, and hard disks given to the library as part of collection acquisitions. This was fascinating and challenging archival work. And yet it also got me thinking about my home profession, History. Because, if you think about it, between the releases of Windows 3.0 in 1990 and the iPhone in 2007, the personal computer dominated social and cultural interaction with computation. During this time the paper archive was replaced by the hard disk. Most people in Britain and worldwide came to create and organise information in a new way. If the digitised collections now available to historians were unthinkable to historians working in 1990, the born digital archives created since 1990 are staggering, counted in billions of documents, media, and software. A single hard disk from this period can contain an archive of considerable scale and complexity, requiring the adoption of new methods to find and retrieve relevant documents (in short, a directory listing is too big for Excel to handle so you'll need a command line). Using hard disks as archives also requires a familiarity with new sources that can help scholars ask new questions: browser caches, file use metadata, downloads folders. Beyond scale and novelty, these hard disks raise ethical and legal questions, particularly in cases where information created by someone other than the depositor is present. How do we treat editorial drafts sent for review in confidence? How do we treat caches of mp3 files downloaded illegally but legitimised by behavioural norms? (to paraphrase the subtitle of a recent book on the topic, what happens when we study an entire generation that has committed the same crime?)

I'm minded that this work is important because although many colleagues today will talk about web archives - for which much wonderful work has already been undertaken - the bread and butter of much humanities research isn't the published material that web archives represent, but rather it is archives of unpublished correspondence, diaries, administrative records, and the like. For historians in particular, the use of paper archives is essential to conceptions of good scholarship; the techniques, conventions, and rituals through which historians research history and validate authority are tied to paper. These are unsuited to working with hard disks containing published and unpublished born-digital documents, media, and software. As Peter Beal writes with respect to seventeenth century scribes 'each manuscript is peculiar: it is physically and ontologically unique': this doesn't follow for born-digital manuscripts. Equally, John Bender and Michael Marrinan describe in Regimes of description: in the archive of the eighteenth century the ways in which archivists describe objects, but what changes when everything in the archive has already been given a precise name and a place in the archive? It is imperative, therefore, that historians are able to adjust their techniques, conventions, and rituals to exploit the opportunities and avoid the pitfalls of research with hard disks. Because rich, complex, and global interpretations of contemporary history could be written with these sources, and yet few historians have the computational and methodological skillset to either undertake this work or to prepare young historians for it.

In short is using computational methods may have been optional for my research into the history of late-Georgian satirical printing, it isn't for the period after 1990. Most undergraduates starting university this September will at have been born in 1998. When that undergraduate gets to start their PhD in, say, 2021, they will see the late-1990s as very legitimately the past and a period ripe for historical study.

What is be required for us and them to undertake research with hard-disks?

If we are using an archive held by a memory institution, they will likely be dealing with a forensic capture: that is, an exact replica of the data held on the disk (operating system, file system, software, documents, et cetera) made into an independent digital file. This is 'digital forensics', a suite of software approaches imported into archival practice from the world of law enforcement. Now, the memory institution capturing the disk, faced with the volume of data on just one disk, is likely to have only described the collection at disk level - taking Greene & Meissner's 'More Product, Less Process' approach in order to some deliver access with the limited resources they have. In order then for the researcher to isolate areas of interest, we will need to be able to query a directory listing with simple computational search tools. Once interesting objects have been isolated, to open and read them we may need an emulator: for software companies usually care little for backwards compatibility. Emulation-as-a-service in the archives sector is making huge strides - as we'll know doubt here about with things like - but we will likely still need to grapple with unfamiliar software tools rather than having everything presented to us. But the pay off from emulation has huge potential: gaining us access not only to documents but whole computer environments

As Matthew G. Kirschenbaum writes:

Gaining access to someone else's computer is therefore like finding a master mey to their house, with the freedom to open the cabinets, cupboards, and desk drawers, to peek at family photo albums, to see what's recently been playing on the stereo or TV, even to sift through what's been left behind in the trash (Track Changes, 215)

We know from captures of Salman Rushdie's computers at Emory University, for example, that he - to quote Kirschenbaum again - 'first opened his first file for his first draft of The Moor's Last Sigh on April 20, 1992, at 11:58 in the morning' (229). We don't know - however, if Rushdie set the clock on his computer correctly and so - to quote from Kirschenbaum's excellent book Track Changes one final time:

A scholar working with such materials must be conversant in the antiquarian cants of vanished operating systems, file formats, and emulators, just as we expect an early modernist doing book history to know something of signatures and collation formulas (233)

The best way to become conversant in these 'cants' is to to do some digital forensics work ourselves, and if we are using a hard-disk not held by a memory institution, we may have to. In the same way that in age of A: and B: drives you couldn't just put a floppy disk in your computer without sliding that little tag that kept it write-protected, you can't just turn on an old computer and start browsing: the act of booting it up adds new data to the archive with fresh data stamps, thus compromising its authenticity. You need hardware (a thing called a write-blocker) and digital forensics software. Open source digital forensic tools aimed at archivists and scholars have made huge strides in recent years thanks largely to the efforts of the BitCurator project led by University of North Carolina Chapel Hill. Nevertheless, in terms of our method, unfamiliar software tools will again be required: a world away from just opening a box and reading.

I hope in the not too distant future to run some workshops on using BitCurator to capture and preserve hard-disks as archives and to begin analysing the born-digital materials they contain. For it isn't super hard and it isn't unachievable. As a convenor of the IHR Digital History seminar I've been heartened to hear tales of 'ordinary' historians achieving through collaboration, tenacity, and small steps what they had previous thought could only be achieved with a big team, computer scientists, and a big grant. The hard disk is one of those things 'ordinary' historians - and indeed, citizens preserving their own family histories - need to figure out.

Some admin...

<img alt="Creative Commons Licence" s§tyle="border-width:0" src="" />
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Exceptions: quotations and embeds to and from external sources

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment