drjwbaker/2017-01-20_Pastec-Edinburgh-handout.md

## 2017-01-20_Pastec-Edinburgh-handout.md

      
    Raw
  

              2017-01-20_Pastec-Edinburgh-handout.md
            
          
    Pastec Workshop, University of Edinburgh, 8 February 2017


Instructor

James Baker, University of Sussex

Slides

At Slideshare.

Using Pastec


Open your terminal.
In the terminal you have a flashing cursor at a command prompt. Type ssh -X ID@IP where IP is the IP address of your server. Enter the password when prompted to: the password for today is XXX. Note: I set these up in advance (including all the data you need for the demo!) and they'll be up a month after the event. For instructions on how to set one up yourself as a Virtual Machine (which is easier to do but slower to run) see my tutorial.
Here we have over 200 small images from illustrated books in the British Library dx.doi.org/10.21250/db17 from two years (1800 and 1805) plus a selection from a few other years. What we are going to do is ask Pastec to find similar images.
First type cd pastec/build and press enter. The moves you from your default directory in the cloud server to the pastec/build directory. Here type ./pastec visualWordsORB.dat and hit enter. This runs Pastec using the ORB feature detection library (we will come back to what this does later). This will do some things - one of them Reading the visual words file, to which we shall return - and eventually say Ready to accept queries. This means that Pastec is running. At this point, leave this shell window as it is and fire up another one.
In this new shell, do ssh -X james@IP again and go through the login process.
Next type i=0; find pictures -type f | while read image; do curl -X PUT --data-binary @$image http://localhost:4212/index/images/$i; i=$((i+1)); done and press enter. What this does is look for all the image files in the directory pictures (including its subdirectories) and one by one put those images into Pastec. You will see messages on the screen. Let it run until the end (at the line {"image_id":252,"nb_features_extracted":2000,"type":"IMAGE_ADDED"}).
Next do i=0; find pictures -type f | while read image; do echo "$i,$image"; i=$((i+1)); done > mapping.csv and hit enter. This is very similar to above, but instead of putting the image files into Pastec you have just created a spreadsheet (mapping.csv) which includes a list of all image files in the pictures directory and sub-directories with a unique number appended. As the script looks at each image in turn, these are assigned an id (starting at 0) in the same order they were entered into Pastec. Type tail mapping.csv to print the last 10 lines into your terminal. Here on each line you can see a number, a comma (which is the field separator, like the cells in Excel or similar) and the path of each image file. You can see here how, handily, the files are named by the book they are in and its date of publication (as well as some sort of ID number).
Now we are ready to match images. Type cd and hit enter to return to your base directory, then type find pictures -type f -name '*.jpg' | parallel --bar -u -j 8 'curl -s -X POST --data-binary @{} http://localhost:4212/index/searcher' > pastec_matches.txt and hit enter. The shell will appear to hand (ignore any warnings that come up!). If you leave this open and switch to your previous shell you will see the action happening. You will see lots of mentions of ids, some ranking and scoring going on, mention of 'visual words'. What you've just done is run a search for the first image in the directory pictures (including its subdirectories) against all the other images in the directory pictures (including its subdirectories), then for the second image against all other images, et cetera. At the end of the command you've asked the machine to save the output as the file pastec_matches.txt.
When the process finishes (prompted by the window you ran the command on returning to the command prompt) do head pastec_matches.txt to see the first ten lines of the output. The second line should say something like: {"bounding_rects":[{"height":122,"width":691,"x":98,"y":31},{"height":113,"width":690,"x":98,"y":31}],"image_ids":[0,97],"scores":[598.0,32.0],"tags":["",""],"type":"SEARCH_RESULTS"}. This lines indicates that ids 0 and 97 are a match. Note that they are even though their heights are widths are slightly different (measured here in pixels).
You'll have noticed, however, that this file also contains lines where there is only one id and no so matches. As a final act, run ruby mash_matches.rb pastec_matches.txt mapping.csv to make a new file unique_matches.json that contains only the matches. Finally, move this file to your machine so can work with it more easily. Achieve this by opening a new shell, typing pwd and hitting enter, copying the output (something like /Users/jb677) and doing scp james@IP:unique_matches.json ADDRESS where ADDRESS the output you just copied and (as before) IP is the IP address of your server.
Open this in a text editor (like Notepad), Excel, Word, or whatever you can get it open and readable in. You will see bits in the files where image paths are next to each other. There are 15 in total.


Discussion

Spend 10 minutes in groups of four (so pairs of pairs!) considering what you think is going on in the black box. To do this I'd like you come up with answers to the following problems. Be prepared to briefly report back your answers:

Thinking about the kinds of images humans consider to be similar, what similarities between images do you think Pastec can match?
Thinking about the kinds of images humans consider to be similar, what similarities between images do you think Pastec is unable to match?
Having observed the inputs and outputs of Pastec, how do you think it is matching images?
Thinking about your own work, are there applications of Pastec you can think of? (this can be speculative based on what you think it can and can't do!)


Revealing the black box!


Pastec documentation: http://pastec.io/doc/oss/
OpenCv
Visual Words: https://en.wikipedia.org/wiki/Visual_Word
ORB feature detector: http://scikit-image.org/docs/dev/auto_examples/plot_orb.html


References

Baumann, Ryan. ‘Finding Near-Matches in the Rijksmuseum with Pastec’. Ryan Baumann - /Etc, 3 November 2015. https://ryanfb.github.io/etc/2015/11/03/finding_near-matches_in_the_rijksmuseum_with_pastec.html.

Some admin...


This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Exceptions: embeds to and from external sources, and direct quotations from speakers