Skip to content

Instantly share code, notes, and snippets.

@tomcrane
Last active September 7, 2018 10:18
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save tomcrane/1b4afab5d09e7b5c5a4bb2b8a6f55ce7 to your computer and use it in GitHub Desktop.
Save tomcrane/1b4afab5d09e7b5c5a4bb2b8a6f55ce7 to your computer and use it in GitHub Desktop.
Harvesting illustrations with nearby text

Many Wellcome works are printed books with illustrations, figures, tables and so on that are identified by OCR during digitisation.

Many of those will have text nearby, that is likely to be a caption that describes the image.

The IIIF representation offers a quick way of getting pixels for the region of the page occupied by the image, and could be a way of finding nearby text.

Example:

Helpers:

https://github.com/tomcrane/wellcome-today/blob/gh-pages/script/annodump.js#L161

https://github.com/tomcrane/wellcome-today/blob/gh-pages/script/annodump.js#L172

(apologies for my 90s-style JavaScript).

A more sophisticated approach could learn what text is likely to be a caption, but you can still use these techniques to get the pixels of the image (at whatever scale you might need for a machine learning task via the IIIF params) and the text lines that are candidate captions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment