Skip to content

Instantly share code, notes, and snippets.

Embed
What would you like to do?
Harvesting illustrations with nearby text

Many Wellcome works are printed books with illustrations, figures, tables and so on that are identified by OCR during digitisation.

Many of those will have text nearby, that is likely to be a caption that describes the image.

The IIIF representation offers a quick way of getting pixels for the region of the page occupied by the image, and could be a way of finding nearby text.

Example:

Helpers:

https://github.com/tomcrane/wellcome-today/blob/gh-pages/script/annodump.js#L161

https://github.com/tomcrane/wellcome-today/blob/gh-pages/script/annodump.js#L172

(apologies for my 90s-style JavaScript).

A more sophisticated approach could learn what text is likely to be a caption, but you can still use these techniques to get the pixels of the image (at whatever scale you might need for a machine learning task via the IIIF params) and the text lines that are candidate captions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.