drjwbaker/pastec-tutorial.md

## pastec-tutorial.md

      
    Raw
  

              pastec-tutorial.md
            
          
    Getting Pastec up and running

Pastec is an open source index and search engine for image recognition. This is how I got it working with lots of help from the hard work of Ryan Baumann, Shawn Graham and Matthew Lincoln.
Installation

Either install Ubuntu 14.04.5 as an operating system, or get a virtual machine from osboxes. Fire up with VirtualBox. Ensure VM is connected to the network (Settings>Network).
Install Pastec by following the documentation. Be sure to download and unzip visualWordsORB.dat into the build subdirectory of Pastec.
To make some bits later on work, you'll also need to install Parallel..
sudo apt-get install parallel
and Ruby.
sudo apt-get install ruby-full
Preparing your data

If the filenames of your image files contain spaces or underscores, it makes life easier to remove them. First make a copy of all files. To rename them, cd to directory in the Terminal and run something like this rename file script (exchange 's/ /-/g' for 's/_/-/g' to replace underscores rather than spaces with dashes).
Run Pastec over your data

This bit in adapted only slightly from Ryan's excellent post
First run Pastec in a Terminal (make sure you cd to your pastec/build directory to do this):
./pastec visualWordsORB.dat
Leave that Terminal running and open a fresh Terminal.
To bounce your image files into Pastec, cd to the directory below the one your images are in (or the directory two below multiple subdirectories): in my case, this cd.. back from Pictures. To work through all the images in 'Pictures' and all its subdirectories run:
i=0; find Pictures -type f | while read image; do curl -X PUT --data-binary @$image http://localhost:4212/index/images/$i; i=$((i+1)); done
Next to make an index (so, for example, foobar.jpg and every other .jpg file corresponds to id 1,2,3,n) run:
i=0; find Pictures -type f | while read image; do echo "$i,$image"; i=$((i+1)); done > mapping.csv
Next to find matches, run:
find Pictures -type f -name '*.jpg' | parallel --bar -u -j 8 'curl -s -X POST --data-binary @{} http://localhost:4212/index/searcher' > pastec_matches.txt
Finally, to filter the pastec_matches.txt output to only the data represented matches between more than one image, get Ryan Baumann's mash_matches.rb script from his blog, save it to the same directory as pastec_matches.txt, and the run from the Terminal:
ruby mash_matches.rb pastec_matches.txt mapping.csv
You then have a json file that lists only matches for similar images found by Pastec. I've put a test sample here - based on a sample of images from the British Library Microsoft Books collection - so you can see the utility of the output.

Some admin...


This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Exceptions: embeds to and from external sources, and direct quotations from speakers