Skip to content

Instantly share code, notes, and snippets.

@drjwbaker
Last active August 29, 2015 14:11
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save drjwbaker/b85ba5d85c95eb040ed3 to your computer and use it in GitHub Desktop.
Save drjwbaker/b85ba5d85c95eb040ed3 to your computer and use it in GitHub Desktop.
Curious Images: image Processing, Pattern Recognition, Artistic Use and a celebration of the British Library 1 Million images collection, British Library, 18 December 2014

###Curious Images: Image Processing, Pattern Recognition, Artistic Use and a celebration of the British Library 1 Million images collection, British Library, 18 December 2014

Live notes, so an incomplete, partial record of what actually happened.

Tags: bldigital

Programme: https://www.eventbrite.co.uk/e/curious-images-tickets-14438270255

My asides in []


###The BL 1 Million Images

Ben O'Steen, Introduction and the Story of the Mechanical Curator

Trying to do things that as small, quick, inspirational BUT deliberately not finished, imperfect.

Massive disparity between what we have digitised and our physical holdings.

Proxy questions. Faces in Microsoft Books collections. Images from them striking...

Having a URL for each image is crucial.

The stats: 20 million hits a month. 150k tags. 25k tagged maps on Wikipedia.

Book metadata and community tags now on figshare http://dx.doi.org/10.6084/m9.figshare.1269249

Harvey (Cardiff), Lost Visions

AHRC funded

http://lostvisions.weebly.com/

http://lost-visions.cf.ac.uk/

One group looking at women in trousers; another Indian Mutiny; a third Shakespeare.

Creating metadata linkages to external sources: for example, illustration and the original drawing.

Storing 25TB is data is expensive, processing it is expensive (and hard!)

Only thing we knew about the image is that page it is from, but even then the page doesn't correlate to the printed page number (cover is always page 1)

Crowdsourcing but also reusing Flickr tags.

But also machine learning - the only proper way of churning through 1 million images.

Feedback loop between what computers are good at and what people are good at - get humans to train the machine, get the computer to guess, get the people to check the work of the machine.

#bldigital @lost_visions on Machine Learning: The Bad pic.twitter.com/a131ec4vUu

— James Baker (@j_w_baker) December 18, 2014
<script async src="//platform.twitter.com/widgets.js" charset="utf-8"></script>

Peter Balman, The Value of Public Domain

ViS

How people have used BL public domain material.

Responded to a challenge set by the Technology Strategy Board.

1 million image release. Stories of use fantastic, but maybe these stories are the tip of the iceberg? Generating info...

  • Things like TinEye find both matches and where images have been changed... Only a modest index, but gets us a URL and a date
  • How do we locate the user? Things like whois.net get us an address.
  • dmoz gives us human edited categorisation of the website the image is on.
  • NLP to pull of concepts the website the image is on pertains to; positive/negative et al.
  • dbpedia for more info about the URL https://dbpedia-spotlight.github.io/demo/
  • DBpediaSpotlight can find entities.
  • Page HTML often contains social information that can engineered into tags.

This might take a while if you have lots of images to process...

Aim to trial in January.

There is a video explaining Peter Balman's Visibility project here: http://t.co/HtcLWmBII9 #bldigital

— Stella Wisdom (@miss_wisdom) December 18, 2014
<script async src="//platform.twitter.com/widgets.js" charset="utf-8"></script>

###Making Art and Image Processing the 1 Million

David Normal, Crazyology and the Mechanical Curator

Searched. Printed. Painted.

Centre-piece of the Burning Man Festival.

Amazing to hear from artist David Normal from @burningman about his reuse of @britishlibrary images #BLdigital pic.twitter.com/hQeNWOJMII

— mariekeguy (@mariekeguy) December 18, 2014
<script async src="//platform.twitter.com/widgets.js" charset="utf-8"></script>

Fascinating to hear @davidnormal talking about his artworks http://t.co/MZmuKbKPIM #BLdigital

— Stella Wisdom (@miss_wisdom) December 18, 2014
<script async src="//platform.twitter.com/widgets.js" charset="utf-8"></script>

#bldigital The crazy world of David Normal pic.twitter.com/WlxLu2TQhR

— James Baker (@j_w_baker) December 18, 2014
<script async src="//platform.twitter.com/widgets.js" charset="utf-8"></script>

Vogue idea of beauty seen by a computer can translate onto the beauty of a cookie. Recognising icons of faces in things that don't have facial form.

Photophilia an inbedding of icons within icons within icons.

#BLdigital if you are interested in the faces in unexpected places phenomenon check out @FacesPics

— Stella Wisdom (@miss_wisdom) December 18, 2014
<script async src="//platform.twitter.com/widgets.js" charset="utf-8"></script>

Bosch's use of collage a composite that is somehow a portrait of the sole. Inspiration also from Max Ernst.

Mechanical Creator rather than Mechanical Curator.

Process of taking an image, identifying patterns, replacing elements with things with similar patterns.

Love the postcards at #BLdigital event created by http://t.co/oRc5e9Am31 - such great reuse of public domain content pic.twitter.com/heYXEPs4L5

— mariekeguy (@mariekeguy) December 18, 2014
<script async src="//platform.twitter.com/widgets.js" charset="utf-8"></script>

Q&A

How do people build on what you've done.

Mario Klingemann - The Joy of Order: Using computational techniques to find images and to make art

@quasimondo - Code Artist

https://github.com/quasimondo

Joy of Finding is the starting point.

#BLdigital Hearing from artist Mario Klingeman who is an 'Obsessive Compulsive Orderer' - a lot in common with us librarians!!

— mariekeguy (@mariekeguy) December 18, 2014
<script async src="//platform.twitter.com/widgets.js" charset="utf-8"></script>

But it is hard to find genuinely new things. But then #bldigital came along and the 1 million images...

At the beginning of https://www.flickr.com/photos/britishlibrary you'd find things with only one view. That being you.

But there is no order to the collection. I thought, how can I put order to this?

Condensing each single image into 127 numbers. Then look across the numbers to find similar things. Built tools to wade through the collection of numbers.

Just using folders in Windows Explorer. Manually sorting into training folders.

Using R to train a classifier using hand sorted folders. Random Forests technique. Then feed it unclassified images and results pretty good.

But if you put the wrong things in the training set it creates problems latter.

t-SNE Clustering. Usually used in data viz. Clusters similar images. Visualises the similar images. Nearest neighbours connected.

Image size already an interesting factor in what the image contains.

Tremendous demo from @quasimundo on e.g. clustering of @internetarchive book subjects #bldigital

— Giles Bergel (@GilesBergel) December 18, 2014
<script async src="//platform.twitter.com/widgets.js" charset="utf-8"></script>

Now watching @quasimondo clustering type-ornaments. Recalls classic work of analytical bibliographers #bldigital

— Giles Bergel (@GilesBergel) December 18, 2014
<script async src="//platform.twitter.com/widgets.js" charset="utf-8"></script>

Found 20k maps using this technique. But whenever I find a new topic I like I follow the thread - crystal, geometric forms, skulls, images that tell a story.

Recurring themes of the open door.

Cleaning the images, removing the text, and presenting them. We can reappreciate the objects just for them being there and for them being beautiful. They once took time.

Another #bldigital insight: @quasimondo repeatedly asserts the value of beauty. These images are beautiful, and deserve to be appreciated.

— Roberta Wedge (@RobertaWedge) December 18, 2014
<script async src="//platform.twitter.com/widgets.js" charset="utf-8"></script>

Written in adobe air... But the algorithms can be imported to other systems. Some of the image detection not open source. Possibly quite innovative.

github.com/quasimondo


###Using Computer Vision to Aid the Discovery of Images

Chung (Oxford), Re-presentations of Art Collections

Clustering exact copies - as woodblocks often very similar copies.

More exact the copy, more likely to be from the same printers.

Then look at change over time in a particular block based on changes, can use to date ballads that are undated.

http://balladsblog.bodleian.ox.ac.uk/blog/1069

There is a video of @BodleianBallads ImageMatch tool at http://t.co/Gm7hNH5KxX #BLdigital

— Stella Wisdom (@miss_wisdom) December 18, 2014
<script async src="//platform.twitter.com/widgets.js" charset="utf-8"></script>

Paper on Oxford Visual Geometry’s work Re-presenting @BodleianBallads - http://t.co/1x79c8c7Ox 2/2” #bldigital

— Giles Bergel (@GilesBergel) December 18, 2014
<script async src="//platform.twitter.com/widgets.js" charset="utf-8"></script>

Crowley (Oxford), In Search of Art

http://www.robots.ox.ac.uk/~vgg/

Search Google for a theme, learn features, apply classifier, apply to collection, return set.

Used in YourPaintings.

Elliot Crowley shows cute puppies to demo image matching. Serious paper at http://t.co/v8IkHIWUjy #BLdigital

— Stella Wisdom (@miss_wisdom) December 18, 2014
<script async src="//platform.twitter.com/widgets.js" charset="utf-8"></script>

Will be publically available in the new year.

Happiness is photographically represented by people jumping on beaches. Research from Elliot Crowley at #BLdigital. (I paraphrase.)

— Roberta Wedge (@RobertaWedge) December 18, 2014
<script async src="//platform.twitter.com/widgets.js" charset="utf-8"></script>

###Analysing Digitised Handwritten Manuscripts

Vidal, Transcriptorium Project

EU project.

www.transcriptorium.eu

Recognising handwriting. Using Bentham collection.

Still need transcription, but working to reduce effort and to measure the effort that is reduced.

Stats make the level of correct auto-transcription seem worse than it is. Reason that errors tend to cluster on particular pages or sections of the manuscripts or beginnings of the line (because language model has nothing to predict from)

Most errors in @transcriptorium output at start of lines: the system needs linguistic context of prior words. Same with humans? #bldigital

— Giles Bergel (@GilesBergel) December 18, 2014
<script async src="//platform.twitter.com/widgets.js" charset="utf-8"></script>

Schmidt (Queensland), Text and Image Linking Tool (TILT)

No machine learning.

  • Why bother?
  • How it works
  • The Future

On why, libraries full of documents. But we need to bring this to a modern audience who can search and prod them - not just stare at them. Text often very hard to read - gaps, stains, vertical text.

How? What level to link at - word, line, page? Common is word level but it is tedious, complex (especially as markup). Hitherto manual process. TILT is automated and doesn't care about what the content is. Need a transcription already, idea just to link image and text. Don't use embedded markup - using geoJSON overlay of polygons. Normalise (black and white) text, find lines, find shapes. Assigns text to image by just trying to minimise leftover stuff.

#BLdigital Hearing from Desmond Scmidt about Queensland uni text to image linking (tilt) - big challenges e.g. Implementing Sexy polygons!

— mariekeguy (@mariekeguy) December 18, 2014
<script async src="//platform.twitter.com/widgets.js" charset="utf-8"></script>

Demo: http://www.ecdosis.net/tilt/test/post


###Dealing with Images in the Cultural Heritage Sector

Conrad Taylor, Processing Cultural Heritage Images for Reuse in Print

Invention of half-toning in 1880s - simulates continuous lines with dots.

Prior to complex engraving and etching techniques - skill and artistry.

Conrad Taylor intriguing on complexities of digital imaging and reprinting halftone printed images #bldigital

— Giles Bergel (@GilesBergel) December 18, 2014
<script async src="//platform.twitter.com/widgets.js" charset="utf-8"></script>

Scanning an analogue mark creates many problems [not least because we don't optimise for images] - getting a perfect representation of it is, of course, impossible.

Tim Weyrich, Problem-Aware Digitisation of Cultural Heritage Artefacts

Pet peeve around digital surrogates: arbitrary trust in engineers representing object for various future uses. Mismatch between how job is approached and what people expect.

Few projects design themselves around (humanities) questions and problems. This is where bespoke CS acquisition design comes in.

Two case studies:

  • Reassembling the Theran Wall Paintings
    • settlement destroyed by volcano, preserved by ash, wall paintings destroyed, world's biggest jigsaw puzzle.
    • (initial sense of conflict between solving complex real world problems and writing papers...)
    • strategies for reassembling not unlike how hobbyist jigsaw puzzle solvers proceed. When doing it by hand it must feel like a match - how to replicate this computationally? Use size, width, erosion, fading, pigments, all sorts...
    • knowing what you are expecting to scan can drive how it is scanned to fit the purpose of what is going to be required.
    • then 6 years of research on how to find matches... In short: able to find about as many as conservators found, with humans and machines better at different matches.

#bldigital Humans and machines find both similar and different things. pic.twitter.com/2BoNtNCb5x

— James Baker (@j_w_baker) December 18, 2014
<script async src="//platform.twitter.com/widgets.js" charset="utf-8"></script>
  • Great Parchment Book
    • Domesday type book that describes property relation in Derry (1639)
    • Damaged by fire 1786 - does not burn but shrivels, became delicate. Want text.
    • So much complex (fun!) stuff just wasn't required, so went for the simple solution.
    • Photograph it from many angles, create a 3d object... But flattening techniques don't help. Better to have a virtual camera travelling over the deformed surface of the parchment (like Google Maps does the globe), locally flattening the parchment so it can be read.
    • Minimise processing, stay as close to the original as possible.
    • But of course many applications do need flattening... So with complex tagging gradually flatten the image (removing shading along the way)

Designing a system for digitisation is surprisingly complex. Off-the-shelf does not solve all problems. Optimisation usually needed. This is dependant both on the objects AND the researchers and their questions). So important to develop modular solutions.

#bldigital Tim Weyrich on the technology used for the Great Parchment Project See @SucceedEU http://t.co/Jj2qKP7GUZ pic.twitter.com/j90vuS3aaV

— Rossitza Atanassova (@RossiAtanassova) December 18, 2014
<script async src="//platform.twitter.com/widgets.js" charset="utf-8"></script>

#bldigital Tim Weyrich's recommendations for problem-aware digitisation of Culture Heritage @SucceedEU pic.twitter.com/fJFdsJ8zb4

— Rossitza Atanassova (@RossiAtanassova) December 18, 2014
<script async src="//platform.twitter.com/widgets.js" charset="utf-8"></script>

###A perspective of Image Analysis from Health and Science

Han (MMU), Large-scale data processing and analysis on images

Climate data from NASA, Facebook, et cetera - data tsunami!

Applications to life sciences - pattern recognition and annotation of, for example, gene expression patterns.

Manual annotation costly and time consuming.

Recognising patterns that machine do not (such as sliced mouse embryo...)

Parallelised across high performance cluster (Issues with migration to the cloud).

Detecting eye disease, crop disease ... real world problems!

Rose-Sandler (Missouri Botanical Garden), Art of Life Project

Centre for biodiversity informatics.

Much content served through the Internet Archive, DPLA, Europeana.

45 million pages of text

http://biodiversitylibrary.org/

Building algorithms to find images, volunteers classifying images, then push to description platforms for metadata, then bring it back, share more widely, et cetera (ideal workflow)

@j_w_baker @MLBrook BHL is a staple for people in my field :) excellent project!

— Ross Mounce (@rmounce) December 18, 2014
<script async src="//platform.twitter.com/widgets.js" charset="utf-8"></script>

Macaw classifying tool. Volunteers could put them into broad groupings - including false positives. Necessary to get a sense of whether it worth further description (eg, unlikely to want to describe book plates)

Been using Flickr since 2011 and more coming soon Zooniverse (through AHRC project). Also on Commons.

Yes...Monsters Are Real from BHL http://t.co/LIaeo3xclr #bldigital

— Theo Kuechel (@TheoKL) December 18, 2014
<script async src="//platform.twitter.com/widgets.js" charset="utf-8"></script>

Borrowed NLF DigitalKoot game idea for fun correcting of text. Working with MetadataGames (Tiltfactor).


###Closing discussion

Lots of different ways of approaching the same problem (such as using Google as the training set for historic stuff)

Being available always better than not. But interest in Conrad's point that the quality often doesn't do justice to the medium. Can we enhance these images by incorporating knowledge about the medium?

#bldigital presentation on @BL_Labs & reuse of @britishlibrary 1 million images on Flickr for the European Commission http://t.co/5tU0w0K072

— Rossitza Atanassova (@RossiAtanassova) December 18, 2014
<script async src="//platform.twitter.com/widgets.js" charset="utf-8"></script>

###Adam Farquhar, Closing Comments

Announcements:

  • data.bl.uk coming!
  • annual awards coming!

Some admin...

Creative Commons Licence
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Exceptions: embeds to and from external sources, and direct quotations from speakers

@whelks-chance
Copy link

This is great. I'd like to add that the Lost Visions project is run at Cardiff university, and is an AHRC funded project.

Also - @Lost_Visions https://twitter.com/Lost_Visions

Cheers,
Ian Harvey

@drjwbaker
Copy link
Author

As the caveat at the top says, notes of always incomplete! But sure, well aware and updated as such.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment