Skip to content

Instantly share code, notes, and snippets.

@reynoldsm88
Created July 12, 2021 18:40
Show Gist options
  • Save reynoldsm88/65466a97f6e89c8002b769d3e2e0badd to your computer and use it in GitHub Desktop.
Save reynoldsm88/65466a97f6e89c8002b769d3e2e0badd to your computer and use it in GitHub Desktop.
Agglomerative clustering in Spark
// sourced from http://users.eecs.northwestern.edu/~cji970/pub/cjinBigDataService2015.pdf
JavaRDD<String> subGraphIdRDD = sc.textFile(idFileLoc,numGraphs);
JavaPairRDD<Integer, Edge> subMSTs = subGraphIdRDD.flatMapToPair(new LocalMST(filesLoc, numSplits));
numGraphs = numSplits * numSplits / 2;
numGraphs = (numGraphs + (K - 1)) / K;
JavaPairRDD<Integer, Iterable<Edge>> mstToBeMerged = subMSTs.combineByKey( new CreateCombiner(), new Merger(),new KruskalReducer(numPoints),numGraphs);
while (numGraphs > 1) {
numGraphs = (numGraphs + (K - 1)) / K;
mstToBeMerged = mstToBeMerged.mapToPair(new SetPartitionId(K)).reduceByKey(new KruskalReducer(numPoints),numGraphs);
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment