drjwbaker/2015-01-29_Bern-talk.md

## 2015-01-29_Bern-talk.md

      
    Raw
  

              2015-01-29_Bern-talk.md
            
          
    ##Removing black boxes: exposing software to researchers
Notes for a talk I gave at Scholarship in Software, Software as Scholarship: From Genesis to Peer Review, Universität Bern, 29 January 2015
The following text represents my notes rather than precisely what was said on the day and should be taken in that spirit.
Attendance funded by fellowship from the Software Sustainability Institute.
Slides: http://www.slideshare.net/drjwbaker/removing-black-boxes-exposing-software-to-researchers
Notes: https://gist.github.com/drjwbaker/3d54d6712a0a124b0530

###Talk
In light of Willard's talk 'removing' is perhaps a bit strong here, but hey...
SLIDE SSI fellowship. Themes: reproducibility, skills, accreditation for work on and with research software.
In the humanities domain there are, as we know, complex challenges around software sustainability. For if it is true that humanists rely on software, increasingly software developed by their community, many if not most don't value using it and their nascent systems of credit for good software development and reuse practice are fragile. And so to advance humanities research in an era of digital libraries, key stakeholders in the humanities must ascribe the same value to the development of and experimentation with research software as they do to traditional practices such as literature surveys, source critique, and written publications.
The activities of the BL Digital Research team - whether releasing data, making bots, experimenting with services, or hosting open days - often advocate for humanists to value time spent developing and experimenting with software by starkly unsheathing services yoked to print paradigms of their black boxes, by exposing humanists to the software that underpins their research.
In this paper, I just want to work through two examples of these practice based activities

###Personal Digital Archives
SLIDE Personal digital archives interest me as a historian-cum-curator for four reasons: first, they are the contemporary counterpoint to the public web; second, they represent an enormous looming challenge for both collecting institutions and for researchers; third, they contain enormous amounts of data at a microscopic level; and fourth, they represent the almost total mechanisation of the historian's bread and butter: the manuscript source.
The British Library has a small but growing collection of personal digital archives. These include archives for the poet Wendy Cope, the author Hanif Kurieshi, the evolutionary biologist John Maynard Smith, Donald Michie - a pioneer of Artificial Intelligence -, and many more. Often acquired as part of mixed media bequeaths, these collections of materials are held across internal memory, external storage - from floppies to flash drives -, data dumps, email archives, and - I'm sure soon - clouds connected to smartphones. SLIDE As an aside on the latter, Tim recently argued that ...
“The greatest disaster for the historical record wasn’t the computer. It was the invention of the telephone.” —@TimHitchcock at #rrchnm20
— Dan Cohen (@dancohen) November 15, 2014
<script async src="//platform.twitter.com/widgets.js" charset="utf-8"></script>
... I can assure you that the locked down, cloud enabled iPhone is just as bad, if not worse.
SLIDE If you've encountered this category of collection before you probably haven't thought of it as heavy with software. For these personal digital archives can - and in many cases are - presented to the public using models of access that are far from rich and at scale data: hard-drives are navigated, 'proper' documents are exported, pdfs are made, readers get to read them as though they were paper; in some cases on prints outs just like a 'real' manuscript source (think of those filing cabinets of email printouts once ubiquitous in offices...) This is, however, no more than a fudge to comply with data protection, for in fact massive amounts of data are created actively and passively by the 'author' in their use of these media and by the collecting institutions as these media are examined, analysed, and stored.
SLIDE At the British Library we use forensic software (in many cases originally developed for the purpose of criminal investigations) to curate, to explore, to generate data, to understand. On one hand this gives us the potential to create a bootable version of someone's machine, allowing for qualitative and experiential interaction with a historic and computing environment personal to an individual (including, for example being able to rerun scientific experiments in the software environment they were originally run in). More appropriate to the present discussion, software are used to capture every bit on every disk: so every file on every disk and forensic metadata for every file and - if we're allowed... - for every deleted file on every disk. We get file trees for every disk: traces of the system - or non-system - by which a person organised their use of a machine or a backup storage device. We get dates all the files on those media were created, modified, and last accessed: traces of patterns of use, working hours, holidays. SLIDE We get hash values - crytographic functions that form a digital signature of a file - and use them as a means of comparing difference between documents (much like bioinformatics sequence analysis).
All this is too much to be hand sorted, hand read, and hand curated; it quickly becomes massive amounts of data, hundreds of thousands of lines of data even for hard-disks that are small by current standards. In the look of feel of the things, the hardware, the files, the bit streams, this all could not be further removed from the boxes of manuscripts historians know so well. In short, the personal digital archive is an actor-enabled socio-technical mechanisation of the humanist's, and in particular the curator-cum-historian's, most beloved source category at scale.
This mechanisation causes cultural, organisational, and social challenges. At the British Library, work with this category of material as been ongoing for over a decade and can be characterised as specialist R&D. Since Spring 2014, in an effort to bring our work with this material towards business as usual, the Digital Research team have made efforts to distribute the knowledge and skills required to capture and preserve our born-digital collections to a wider group of curatorial colleagues. A notable by-product of this work is that it has confronted these colleagues - and will as we push the work forward and outward, researchers - with complex, vast, and diverse forensic captures of personal digital media not just small collections of print-like born-digital documents. In doing so, we have attempted to confront them with the software decisions that must be made during the initial arrangement and interpretation of these collections:

whether to sort or not to sort, to cherry pick or to view holistically, to treat as paper (at the level of 'interesting' documents) or as a representation of a life (I always say that downloads folders and browser caches are probably the most valuable materials in these collections to future researchers);
whether these software decisions are driven by the urge to mimic existing processes, whether these decisions are fit for purpose, and whether new processes might better fit the medium;
and, conversely, that although tools create a sense of forcing the hand whether the extent to which using software to curate these collections can be just as idiosyncratic as with the curation of paper collections, but leaves more traces, more documentation, more sense of a procedure being followed.


###The British Library Big Data Experiment
SLIDE For my second example I will step away from this core collection management scenario and into the realm of experimentation. In December 2013 the British Library Digital Research Team released onto Flickr Commons one million digital images clipped from circa 65,000 digitised 17th, 18th, and 19th century books, usually refereed to as the Microsoft Books (because they did the original digitising). We knew very little about these images apart from the books they appear in and where in those books they appear.
As these images, indeed the whole dataset, is Public Domain it is ideal for experimentation. SLIDE  In Spring 2014, and thanks to a little match making after we were awarded a Microsoft Azure for Research Grant, we established the British Library Big Data Experiment, a project that has become an ongoing collaboration between British Library Digital Research and UCL Department of Computer Science (UCLCS), facilitated by UCL Centre for Digital Humanities (UCLDH), that enables and engages students in computer science with humanities research issues and the cultural heritage sector as part of their core assessed work.
In June 2014 the first British Library Big Data Experiment team was convened with a group dissertation project, work submitted in fulfilment of MScs in Software Systems Engineering and Computer Science respectively, that used the images from the Microsoft Books collection and the text and metadata from books they came from to underpin the design of a research-oriented web based service, with the students working in close consultation with Humanities researchers who may have wished to use the capabilities of such a system. SLIDE The final then public output (http://blpublicdomain.azurewebsites.net/), now out of service, represented an attempt to capture the complex and multi-faceted needs of humanities researchers SLIDE whilst at the same time offering unconventional services such as SLIDE OCR text previews, SLIDE as bulk download of text based on metadata queries, and word frequency lists; things that prodded and poked at software decisions that lingered under the hood of graphical user interfaces.
So although primary a pedagogical project there are clear secondary benefits. And following this successful pilot, the currently British Library Big Data Experiment has become a platform for collaborative work, including machine learning and mobile app development teams in Autumn/Winter 2014/2015.
SLIDE The app project was finished this month and SLIDE use a Draw Something like game design to SLIDE create structured tags about the image collections back to us. SLIDE The time spent guessing each set, SLIDE the necessity to use hints, and any failures all feed into an evolving confidence value for semantic links between images, confidence values we will feed back to the research community to evaluate and use. https://picaguess.herokuapp.com
SLIDE The as yet unfinished machine learning project takes two approaches to clustering images. One has the machine interpret a semantic idea - at the moment 'graph' - to cluster and connect book illustrations. The second is more interesting, for it takes various cruder more abstract training sets and has the machine, the algorithm produce categories based on these visual characteristics. We will then ask the human, the researcher to interpret the connections made. Whilst the first approach is linguistically comprehensible it is inevitably incomplete, for any visually odd graphs are unlikely to be found. The second approach on the other hand is always 'correct' and yet difficult to interpret for non-algorithmic, linguistic comprehension - and may, as a result, seem to miss things. By eventually publishing sets of images based on these approaches, we hope to provoke the research community to consider curatorial judgements made by these approaches, judgements not dependent on human generated catalogue metadata for order but rather on a machine under our instruction doing what it does best - taking signals, ignoring 'meaning' we might ascribe, and forging connections.

###Conclusions
SLIDE Neither this work with Personal Digital Archives or the Big Data Experiment are perhaps aimed to catch the attention of people in this room, but they do target important stakeholders to think critically about the role of software in research or curation. But both these projects, we hope, provoke the research community to consider judgements made by software, to investigate the constraints of using software, and to embrace its potential to make intellectual contributions.

Some admin...


This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Exceptions: embeds to and from external sources