James Baker, University of Sussex
At Slideshare.
- Open your terminal.
- In the terminal you have a flashing cursor at a command prompt. Type
ssh -X ID@IP
whereIP
is the IP address of your server. Enter the password when prompted to: the password for today isXXX
. Note: I set these up in advance (including all the data you need for the demo!) and they'll be up a month after the event. For instructions on how to set one up yourself as a Virtual Machine (which is easier to do but slower to run) see my tutorial. - Here we have over 200 small images from illustrated books in the British Library dx.doi.org/10.21250/db17 from two years (1800 and 1805) plus a selection from a few other years. What we are going to do is ask Pastec to find similar images.
- First type
cd pastec/build
and press enter. The moves you from your default directory in the cloud server to thepastec/build
directory. Here type./pastec visualWordsORB.dat
and hit enter. This runs Pastec using the ORB feature detection library (we will come back to what this does later). This will do some things - one of themReading the visual words file
, to which we shall return - and eventually sayReady to accept queries
. This means that Pastec is running. At this point, leave this shell window as it is and fire up another one. - In this new shell, do
ssh -X james@IP
again and go through the login process. - Next type
i=0; find pictures -type f | while read image; do curl -X PUT --data-binary @$image http://localhost:4212/index/images/$i; i=$((i+1)); done
and press enter. What this does is look for all the image files in the directorypictures
(including its subdirectories) and one by one put those images into Pastec. You will see messages on the screen. Let it run until the end (at the line{"image_id":252,"nb_features_extracted":2000,"type":"IMAGE_ADDED"}
). - Next do
i=0; find pictures -type f | while read image; do echo "$i,$image"; i=$((i+1)); done > mapping.csv
and hit enter. This is very similar to above, but instead of putting the image files into Pastec you have just created a spreadsheet (mapping.csv
) which includes a list of all image files in thepictures
directory and sub-directories with a unique number appended. As the script looks at each image in turn, these are assigned an id (starting at 0) in the same order they were entered into Pastec. Typetail mapping.csv
to print the last 10 lines into your terminal. Here on each line you can see a number, a comma (which is the field separator, like the cells in Excel or similar) and the path of each image file. You can see here how, handily, the files are named by the book they are in and its date of publication (as well as some sort of ID number). - Now we are ready to match images. Type
cd
and hit enter to return to your base directory, then typefind pictures -type f -name '*.jpg' | parallel --bar -u -j 8 'curl -s -X POST --data-binary @{} http://localhost:4212/index/searcher' > pastec_matches.txt
and hit enter. The shell will appear to hand (ignore any warnings that come up!). If you leave this open and switch to your previous shell you will see the action happening. You will see lots of mentions of ids, some ranking and scoring going on, mention of 'visual words'. What you've just done is run a search for the first image in the directorypictures
(including its subdirectories) against all the other images in the directorypictures
(including its subdirectories), then for the second image against all other images, et cetera. At the end of the command you've asked the machine to save the output as the filepastec_matches.txt
. - When the process finishes (prompted by the window you ran the command on returning to the command prompt) do
head pastec_matches.txt
to see the first ten lines of the output. The second line should say something like:{"bounding_rects":[{"height":122,"width":691,"x":98,"y":31},{"height":113,"width":690,"x":98,"y":31}],"image_ids":[0,97],"scores":[598.0,32.0],"tags":["",""],"type":"SEARCH_RESULTS"}
. This lines indicates that ids 0 and 97 are a match. Note that they are even though their heights are widths are slightly different (measured here in pixels). - You'll have noticed, however, that this file also contains lines where there is only one id and no so matches. As a final act, run
ruby mash_matches.rb pastec_matches.txt mapping.csv
to make a new fileunique_matches.json
that contains only the matches. Finally, move this file to your machine so can work with it more easily. Achieve this by opening a new shell, typingpwd
and hitting enter, copying the output (something like/Users/jb677
) and doingscp james@IP:unique_matches.json ADDRESS
whereADDRESS
the output you just copied and (as before)IP
is the IP address of your server. - Open this in a text editor (like Notepad), Excel, Word, or whatever you can get it open and readable in. You will see bits in the files where image paths are next to each other. There are 15 in total.
Spend 10 minutes in groups of four (so pairs of pairs!) considering what you think is going on in the black box. To do this I'd like you come up with answers to the following problems. Be prepared to briefly report back your answers:
- Thinking about the kinds of images humans consider to be similar, what similarities between images do you think Pastec can match?
- Thinking about the kinds of images humans consider to be similar, what similarities between images do you think Pastec is unable to match?
- Having observed the inputs and outputs of Pastec, how do you think it is matching images?
- Thinking about your own work, are there applications of Pastec you can think of? (this can be speculative based on what you think it can and can't do!)
- Pastec documentation: http://pastec.io/doc/oss/
- OpenCv
- Visual Words: https://en.wikipedia.org/wiki/Visual_Word
- ORB feature detector: http://scikit-image.org/docs/dev/auto_examples/plot_orb.html
Baumann, Ryan. ‘Finding Near-Matches in the Rijksmuseum with Pastec’. Ryan Baumann - /Etc, 3 November 2015. https://ryanfb.github.io/etc/2015/11/03/finding_near-matches_in_the_rijksmuseum_with_pastec.html.
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Exceptions: embeds to and from external sources, and direct quotations from speakers