Skip to content

Instantly share code, notes, and snippets.

@mattmcgrattan
Created September 12, 2018 17:02
Show Gist options
  • Save mattmcgrattan/0187caae8d0946118e25c5e8b4af21bd to your computer and use it in GitHub Desktop.
Save mattmcgrattan/0187caae8d0946118e25c5e8b4af21bd to your computer and use it in GitHub Desktop.

Background

The Paul Mellon Centre digitised 250 volumes of the Exhibition Catalogue for the Royal Academy Summer Exhibition, from 1769 to 2018, and commissioned in-depth scholarly articles for each year of the exhibition to coincide with the 250th anniversary of the Summer Exhibition.

The resulting website can be found at: https://chronicle250.com.

Click here to open

Digirati were asked to develop the website from designs by Strick and Williams and to provide the supporting infrastructure for the site using the Digital Library Cloud Service (DLCS).

Requirements:

  • Each catalogue should be available online using the IIIF Image and Presentation APIs. See https://iiif.io for details.
  • Each catalague should have searchable full text.
  • Exhibitors should be identified in the catalogue text and linked back, via hotlinks on the images, to a searchable Index on the main Chronicle250 site.
  • Index entries for a given Exhibitor should link to all occurences of that artist in the corpus of Exhibition catalogues.
  • Pages for each year with rich scholarly articles.
  • Index entries for authors and artworks.
  • Thematic indexes across the curated per-year articles.

In building the site Digirati:

  • Provided performant versions of the digitised catalogues and illustrations with deep zoom functionality and support for open APIs (https://iiif.io) using the DLCS.
  • Created OCR for these images, including 18th century catalogues with historic typefaces.
  • Identified exhibitors within the catalogue text and associated exhibitors with regions of images to create hotlinks between the catalogue and the index.
  • Provided a usable search experience both within an individual catalogue and across catalogues.
  • Created a usable index of Exhibitors.
  • Brought the content -- catalogues, indexes, scholarly articles -- together following Strick and Williams' design brief to create the Chronicle250 site.

The Solution

A more detailed technical version of this information can be found here.

DLCS

If we had been started from scratch with no existing infrastructure, and no existing code base, the Chronicle250 project could, potentially, have been very costly in terms of both time and budget.

However, Digirati provide a hosted cloud based service, the DLCS, designed to be run as a multi-tenant service shared by users who may be unable to, or may not wish to, run their own image hosting infrastructure. The DLCS uses the IIIF APIs, and is based around open standards, so new projects can be built easily on top of the DLCS. The DLCS can also be optionally enhanced with additional services that can enrich content with tags, transcriptions, and search.

The use of the DLCS was a key requirement for this project, as the existence of the DLCS made many of the core functions required for the site do-able without a large amount of infrastructure work or basic software development. Development time, and thus the budget, for this project could concentrate on front end development and enhancements to existing DLCS services around annotation and natural language processing, and not on core image hosting or text processing and indexing functionality.

The DLCS provides services which:

  • Transcode images to jpeg2000. (Multi-tenant)
  • Generate static thumbnails at multiple resolutions. (Multi-tenant)
  • (Scalable) IIIF Image API service. (Multi-tenant)
  • Basic IIIF Presentation APIs for create, read, update and delete of IIIF collections, sequences, manifests, and canvases. (Project specific)
  • create OCR text from a IIIF Image API source. (Project specific)
  • normalise OCR to a standard common format (to ensure the DLCS is OCR-engine agnostic). (Project specific)
  • provide OCR text as OA annotations (for display in IIIF Presentation API 2.x clients). (Project specific)
  • do named entity extraction from controlled vocabularies, or from standard NER models. (Project specific)
  • store W3C and OA web annotations in an annotation server. (Project specific)
  • index W3C and OA annotations alongside OCR text and provide IIIF Content Search API services. (Project specific)

For the Chronicle250 we were able to use the shared multi-tenant services as-is and then customise the project specific services for this specific project to provide the enhancements we needed to identify, link, and index exhibitors in the digitised versions of the exhibition catalogues.

OCR Services

The catalogues for Chronicle250 span 250 years of Royal Academy exhibitions, which introduces particular demands around OCR quality, as the historic typefaces used are not, typically, OCR'd well by off-the-shelf open source OCR engines like Tesseract or Ocropus. In addition, segementation of images into blocks, paragraphs, and lines is also difficult because the text is often quite heavily skewed with bleedthrough from verso pages, and uneven kerning introduces erroenous whitespace throughout.

Page 1, 1769

We evaluated a number of OCR engines, including:

  • Tesseract and Ocropus (for open source, locally hosted engines)
  • Microsoft Azure Cognitive Services
  • Abby SDK
  • Google Vision Document Text Detection

The DLCS already had integrations for Google Vision and Tesseract, and we found that Google Vision scored well compared to other cloud-based services from Microsoft and Abby, and score significantly higher than Tesseract. A range of typefaces is used throughout the 250 years of catalogues, so doing specific training of Tesseract with glyphs from particular catalogue years would not have scaled well across the entire project, and would have introduced significant additional demands on staff time for results that would not exceed the cloud-services which could be used immediately.

We were able to use the existing DLCS OCR services as-is to do text extraction, and normalisation of OCR text without signifcant customisation for this project.

Natural Language Processing and Named Entity Recognition

The DLCS has a named entity recognition service which uses IIIF, Spacy.io, and W3C Web Annotations to tag regions of images with people, places, dates, organisations, and other classes of entity.

We evaluated the use of this service using off-the-shelf neural net models untrained on the Royal Academy corpus, and found that the overall quality of tags produced was not accceptable in terms of artists correctly identified, and in terms of falsely identified non-artists.

A typical catalogue page might contain entries that look like:

And also, other pages within the same volume that look like:

We had to identify the artist names on each page, but also identify when different occurences of a name within the catalogue were references to the same artist. Note the different forms in which an artist's name might appear.

To improve the results, we:

  • Wrote code that parsed known sources of artist data, from:
  • Generated variant forms of artist names so that system correctly identified that J. Northcote, R.A. and Northcote, James, R.A. were the same person, and identified which James Northcote (the painter who lived from 1746-1831: https://en.wikipedia.org/wiki/James_Northcote) this was.
  • Wrote code to handle (by normalising and/or ignoring whitespace) the kerning and segmentation issues with historic text
  • Wrote code to filter artsts by date, to ensure that only the relevant artists for a given catalogue year were in the "pool" for tagging
  • Used the Aho-Corasick algorithm to do fast pattern matching of text with the known list of artist names

This code was implemented as an enhanced version of an existing DLCS service, so we did not have to write an entirely new software stack from scratch, and were able to take advantage of existing integrations with OCR services, and annotation servers (for storing the output as annotations on IIIF content).

IIIF Viewing components: Canvas Panel and the 'PMC' Viewer

In order to provide the results of the tagging process alongside the IIIF Image API images, Digirati built a bespoke IIIF Presentation API viewer for the Chronicle250 site.

Digirati have built a lightweight IIIF Presentation API Canvas viewing component, which supports annotation display called CanvasPanel, and which has been used on projects for the Victoria and Albert Museum, such as their Ocean Liner's exhibit.

For the Chronicle250 project, we took CanvasPanel and added additional support for:

  • IIIF Content Search API
  • Multi-page documents with navigation
  • Highlighting/linking annotations
  • Support for search queries being passed in from the Chronicle250 Index of Exhibitors.

The PMC viewer can be found on Github at: https://github.com/digirati-co-uk/pmc-viewer

Search and Indexing

The full DLCS (Digital Library Cloud Service) provides a IIIF Content Search Service Mathmos which integrates with the DLCS message bus, and indexes both full text (provided by OCR) and annotations (provided by machine generated tags).

However, for the Chronicle250 project, the vision was not to rely on the DLCS for delivery of textual content or services to the viewer. The DLCS text pipeline could be shut down after processing, leaving just the Chronicle250 website/application, and the DLCS IIIF Image API and IIIF Presentation API services running as active services. In addition, the IIIF Content Search service on the DLCS provides basic/generic search services which would not fulfill the full requirements of the Chronicle250 site.

Instead, for Chronicle250 we built a bespoke Elasticsearch based index which contained:

And which provided both the IIIF Content Search to the PMC Viewer, and also general search services and indexing on the main Chronicle250 site.

Overall Results

The machine identification of exhbitors across the corpus was extremely successful given the relatively short time spent on bespoke software development and R&D.

We were able to succesfully identify 318,690 exhibitors across the catalogues. An upper bound for the possible maximum number of exhibitors, assuming each exhibitor only exhibited once in each catalogue, would be 513,068, however, given how commonly exhibitors exhibit more than once in any given year, we can assume the actual total is certainly lower. There was a very small number of tags for the post 1990 catalogues because we lacked any Exhibitor data for those years.

Using the techniques described in this article offered a very good return on time invested, versus the time it would have have taken to manually tag 400,000 - 500,000 names in the corpus. Combining these techniques with the services provided by the DLCS made a relatively resource and data heavy project something that was possible to do in a relatively short timescale.

To measure the output, the Paul Mellon Centre produced statistics in Google Datastudio, to show the accuracy and distribution across the entire corpus (click the image to open):

Credits

Digirati

Adam Meszaros, Senior Frontend Consultant. Full stack developer. Chronicle250.com; Site indexes and IIIF Content search services; PMC Viewer; Integration.

Stephen Fraser, Front End Technical Lead. AnnotationStudio; CanvasPanel; PMC Viewer.

Matt McGrattan, Head of Digital Library Services. DLCS text pipeline; Natural langauge processing and tagging; Digirati Product Owner.

Adam Christie, Senior Engineer. DLCS Infrastructure; DevOps;

Ville Vartiainen, Senior UX Consultant. Digirati User Experience.

Ian Farquhar, Head of Project Delivery. Project Management.

Paul Mellon Centre

Tom Scutt, PMC Product Owner and Digital Editor.

Mark Hallett, Sarah Victoria Turner, Jessica Feather, Scholarly Editors.

Baillie Card, Publishing Editor.

Maisoon Rehani, Picture Editor.

Tom Powell, Sean Ketteringham and James Finch, Researchers.

Thérèse Saba, Copyeditor.

Jan Worrall, Indexer.

Strick and Williams

Charlotte Strick, Design.

Claire Williams Martinez, Design.

User Experience

http://www.unaffiliatedworks.com/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment