Skip to content

Instantly share code, notes, and snippets.

@ankurdave
Created April 26, 2016 06:09
Show Gist options
  • Save ankurdave/c80a0a9eec95c1af577b3aae19a0aaa8 to your computer and use it in GitHub Desktop.
Save ankurdave/c80a0a9eec95c1af577b3aae19a0aaa8 to your computer and use it in GitHub Desktop.
Work around the 2 billion element limit for Spark's groupByKey using nested groupBys
val rdd = sc.parallelize((0 until 1000).map(x => (1, x)) ++ List((2,1), (2,2)))
// rdd: org.apache.spark.rdd.RDD[(Int, Int)]
rdd.collect
// res1: Array[(Int, Int)] = Array((1,0), (1,1), (1,2), (1,3), (1,4), (1,5), (1,6), (1,7), (1,8), (1,9), (1,10), (1,11), (1,12), (1,13), (1,14), (1,15), (1,16), (1,17), (1,18), (1,19), (1,20), (1,21), (1,22), (1,23), (1,24), (1,25), (1,26), (1,27), (1,28), (1,29), (1,30), (1,31), (1,32), (1,33), (1,34), (1,35), (1,36), (1,37), (1,38), (1,39), (1,40), (1,41), (1,42), (1,43), (1,44), (1,45), (1,46), (1,47), (1,48), (1,49), (1,50), (1,51), (1,52), (1,53), (1,54), (1,55), (1,56), (1,57), (1,58), (1,59), (1,60), (1,61), (1,62), (1,63), (1,64), (1,65), (1,66), (1,67), (1,68), (1,69), (1,70), (1,71), (1,72), (1,73), (1,74), (1,75), (1,76), (1,77), (1,78), (1,79), (1,80), (1,81), (1,82), (1,83), (1,84), (1,85), (1,86), (1,87), (1,88), (1,89), (1,90), (1,91), (1,92), (1,93), (1,94), (1,95), (1,96),...
val nestedGroups = rdd.groupBy(kv => (kv._1, kv._2 % 10)).groupBy(_._1._1).map(_._2.map(_._2))
// nestedGroups: org.apache.spark.rdd.RDD[Iterable[Iterable[(Int, Int)]]]
nestedGroups.collect
// res6: Array[Iterable[Iterable[(Int, Int)]]] = Array(List(CompactBuffer((1,6), (1,16), (1,26), (1,36), (1,46), (1,56), (1,66), (1,76), (1,86), (1,96), (1,106), (1,116), (1,126), (1,136), (1,146), (1,156), (1,166), (1,176), (1,186), (1,196), (1,206), (1,216), (1,226), (1,236), (1,246), (1,256), (1,266), (1,276), (1,286), (1,296), (1,306), (1,316), (1,326), (1,336), (1,346), (1,356), (1,366), (1,376), (1,386), (1,396), (1,406), (1,416), (1,426), (1,436), (1,446), (1,456), (1,466), (1,476), (1,486), (1,496), (1,506), (1,516), (1,526), (1,536), (1,546), (1,556), (1,566), (1,576), (1,586), (1,596), (1,606), (1,616), (1,626), (1,636), (1,646), (1,656), (1,666), (1,676), (1,686), (1,696), (1,706), (1,716), (1,726), (1,736), (1,746), (1,756), (1,766), (1,776), (1,786), (1,796), (1,806), (1,816),...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment