Skip to content

Instantly share code, notes, and snippets.

@hanslovsky
Created April 2, 2015 21:41
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save hanslovsky/5c69a626e968ec6d2a40 to your computer and use it in GitHub Desktop.
Save hanslovsky/5c69a626e968ec6d2a40 to your computer and use it in GitHub Desktop.
Good practices for selecting pairs of images by index in Spark

Assume you have an indexed Spark RDD of N images and you would like to sparsely select pairs, i.e. the number of selected pairs is much smaller than all possible pairs N^2. One way to do it, is to create a cartesian of the RDD with itself and then filter the pairs by some criterion, e.g. limit the difference between pairwise indices:

selection = images
    .cartesian( images )
    .filter( /* some criterion */ )
    ;

However, this needs to create an intermediate RDD with N^2 elements and on the way their requires a lot of cross-talk between the worker nodes. It turned out, that with big N, this approach is slow and - more importantly - adding more workers even slows down the process.

To avoid the use of cartesian, I suggest a different approach, starting from an RDD of indices mapping to images { i : image }:

  • Create a look-up table of pairwise indices on the master and broadcast it. pairs[][] : pairs[i] = [ ..., k, ... ]
  • Map the indexed image RDD such that each index maps to an image and a list of indices. { i : image } -> { i : ( image, [ ..., k, ... ] ) }
  • FlatMap the result and use the secondary index as primary index. { i : ( image, [ ..., k, ... ] ) } -> { k : ( i, image ) }
  • Join the result with the original image RDD. { k : ( i, image ) } -> { k : ( ( i, image ), image ) } This will give you the same result and it turned out that it does scale with the number of workers, i.e., contrary to the cartesian approach, execution will get faster with more workers. In the case of generating (almost) all pairs, using the cartesian might be the better approach.

In pseudo code, this will look like:

List< List< Integer > > indexPairs = // generate index pairs 
T broadcast = sc.broadCast( indexPairs )
selectedPairs = images
    .map( /* append the appropriate list of indexes to each key, using broadcast as look-up */ )
    .flatMapToPair( /* iterate over each element of [ ...,k... ] and return a list containing for all k: k -> ( i, image ) */ )
    .join( images )

See https://github.com/saalfeldlab/java-spark-workshop/blob/master/src/main/java/org/janelia/workshop/spark/practices/PairwiseImageSelection.java for an example.

I want to thank (Jeremy Freeman)[https://github.com/freeman-lab] for the advice with regards to this. Without his help, I would not have been able to come up with this solution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment