Skip to content

Instantly share code, notes, and snippets.

@tecmaverick
Last active December 20, 2022 00:50
Show Gist options
  • Save tecmaverick/c1ee383beb60af90d7d1882649cbb566 to your computer and use it in GitHub Desktop.
Save tecmaverick/c1ee383beb60af90d7d1882649cbb566 to your computer and use it in GitHub Desktop.
Spark RDD ScratchPad
// ============================================================
// Generate a test KeyValue Pair
spark.conf.set("spark.sql.shuffle.partitions",2)
val num = Seq((2000,10),(2001,20),(2000,20),(2002,30),(2003,30),(2004,50),(2004,100),(2004,250),(2005,250),(2005,25),
(2006,150),(2006,225),(2007,250),(2007,125),(2008,250),(2009,25),(2010,250),(2010,125))
val rdd = sc.parallelize(num)
val prdd = rdd.reduceByKey(_ + _).repartition(2)
val srdd = rdd.sortByKey().repartition(2)
// HashPartitioner is used
prdd.partitioner
// RangePartitioner is used
srdd.partitioner
// Save hash partitioned data to file
prdd.saveAsTextFile("file:///Users/abe/Personal/Apache Spark/data/hashp")
// Save range partitioned data to file
srdd.saveAsTextFile("file:///Users/abe/Personal/Apache Spark/data/rangep")
// ============================================================
// Generate a sequrnce of numbers
val rdd = sc.parallelize(Seq.range(0,100))
rdd.foreach(println)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment