Skip to content

Instantly share code, notes, and snippets.

@manjotmona
Last active February 22, 2018 05:13
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save manjotmona/9fcadd714b8c7fda2773965a9ab07452 to your computer and use it in GitHub Desktop.
Save manjotmona/9fcadd714b8c7fda2773965a9ab07452 to your computer and use it in GitHub Desktop.
Question 1:
scala> val list = List("Hello,World")
list: List[String] = List(Hello,World)
scala> val rdd = sc.parallelize(list)
rdd: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[2] at parallelize at <console>:26
scala> rdd.collect
res3: Array[String] = Array(Hello,World)
Question 2:
scala> val rdd2 = spark.sparkContext.parallelize(List(1,2,3,4))
rdd2: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[7] at parallelize at <console>:23
scala> val rdd1 = sc.parallelize(List(1,1,2,3,4,4)).map(x => ("a", x)).distinct.sortBy(_._2).values
rdd1: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[18] at values at <console>:24
scala> rdd1.coalesce(rdd2.getNumPartitions).zip(rdd2)
res4: org.apache.spark.rdd.RDD[(Int, Int)] = ZippedPartitionsRDD2[20] at zip at <console>:29
scala> res4.collect
res5: Array[(Int, Int)] = Array((1,1), (2,2), (3,3), (4,4))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment