Create a gist now

Instantly share code, notes, and snippets.

What would you like to do?
Spark RDD partition size after reduction of the data size
/*
RDD partition size retains from the parent
(c) 2017
GPLv3
Author: Mehmet Suzen (suzen at acm dot org)
*/
// Generate 1 million Gaussian random numbers
import util.Random
Random.setSeed(4242)
val ngauss = (1 to 1e6.toInt).map(x=>Random.nextGaussian)
val ngauss_rdd = sc.parallelize(ngauss)
ngauss_rdd.count // 1 million
ngauss_rdd.partitions.size // 4
val ngauss_rdd2 = ngauss_rdd.filter(x=>x > 4.0)
ngauss_rdd2.count // 35
ngauss_rdd2.partitions.size // 4
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment