Skip to content

Instantly share code, notes, and snippets.

Last active Jun 17, 2017
What would you like to do?
Spark RDD partition size after reduction of the data size
RDD partition size retains from the parent
(c) 2017
Author: Mehmet Suzen (suzen at acm dot org)
// Generate 1 million Gaussian random numbers
import util.Random
val ngauss = (1 to 1e6.toInt).map(x=>Random.nextGaussian)
val ngauss_rdd = sc.parallelize(ngauss)
ngauss_rdd.count // 1 million
ngauss_rdd.partitions.size // 4
val ngauss_rdd2 = ngauss_rdd.filter(x=>x > 4.0)
ngauss_rdd2.count // 35
ngauss_rdd2.partitions.size // 4
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment