Assume you have an indexed Spark RDD
of N
images and you would like to sparsely select pairs, i.e. the number of selected
pairs is much smaller than all possible pairs N^2
. One way to do it, is to create a cartesian
of the RDD
with itself
and then filter the pairs by some criterion, e.g. limit the difference between pairwise indices:
selection = images
.cartesian( images )
.filter( /* some criterion */ )
;
However, this needs to create an intermediate RDD with N^2
elements and on the way their requires a lot of cross-talk between
the worker nodes. It turned out, that with big N
, this approach is slow and - more importantly - adding more workers even
slows down the process.
To avoid the use of cartesian
, I suggest a different approach, starting from an RDD
of indices mapping to images
{ i : image }
:
- Create a look-up table of pairwise indices on the master and broadcast it.
pairs[][] : pairs[i] = [ ..., k, ... ]
- Map the indexed image
RDD
such that each index maps to an image and a list of indices.{ i : image }
->{ i : ( image, [ ..., k, ... ] ) }
- FlatMap the result and use the secondary index as primary index.
{ i : ( image, [ ..., k, ... ] ) }
->{ k : ( i, image ) }
- Join the result with the original image
RDD
.{ k : ( i, image ) }
->{ k : ( ( i, image ), image ) }
This will give you the same result and it turned out that it does scale with the number of workers, i.e., contrary to thecartesian
approach, execution will get faster with more workers. In the case of generating (almost) all pairs, using thecartesian
might be the better approach.
In pseudo code
, this will look like:
List< List< Integer > > indexPairs = // generate index pairs
T broadcast = sc.broadCast( indexPairs )
selectedPairs = images
.map( /* append the appropriate list of indexes to each key, using broadcast as look-up */ )
.flatMapToPair( /* iterate over each element of [ ...,k... ] and return a list containing for all k: k -> ( i, image ) */ )
.join( images )
See https://github.com/saalfeldlab/java-spark-workshop/blob/master/src/main/java/org/janelia/workshop/spark/practices/PairwiseImageSelection.java for an example.
I want to thank (Jeremy Freeman)[https://github.com/freeman-lab] for the advice with regards to this. Without his help, I would not have been able to come up with this solution.