Skip to content

Instantly share code, notes, and snippets.

@tecmaverick
Last active December 20, 2022 01:24
Show Gist options
  • Save tecmaverick/1e23a1130c8c326ccbf641d402f28f11 to your computer and use it in GitHub Desktop.
Save tecmaverick/1e23a1130c8c326ccbf641d402f28f11 to your computer and use it in GitHub Desktop.
//Set local checkpoint directory. This is unreliable incase of driver restarts
sc.setCheckpointDir("file:///tmp/sparkcheckpoints")
//View the checkpoint dir
sc.getCheckpointDir.get
val rdd = sc.parallelize(Seq.range(0,100))
val filteredRdd = rdd.filter(x=> x>50).map(x=> x * 2)
// View the lineage
filteredRdd.toDebugString
// The lineage BEFORE checkpointing
// (8) MapPartitionsRDD[9] at map at <console>:23 []
// | MapPartitionsRDD[8] at filter at <console>:23 []
// | ParallelCollectionRDD[5] at parallelize at <console>:23 []
//Checkpoint the RDD
filteredRdd.checkpoint
// The lineage AFTER checkpointing
// (8) MapPartitionsRDD[9] at map at <console>:23 []
// | ReliableCheckpointRDD[10] at count at <console>:24 []
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment