Spark RDD partition size after reduction of the data size
/* | |
RDD partition size retains from the parent | |
(c) 2017 | |
GPLv3 | |
Author: Mehmet Suzen (suzen at acm dot org) | |
*/ | |
// Generate 1 million Gaussian random numbers | |
import util.Random | |
Random.setSeed(4242) | |
val ngauss = (1 to 1e6.toInt).map(x=>Random.nextGaussian) | |
val ngauss_rdd = sc.parallelize(ngauss) | |
ngauss_rdd.count // 1 million | |
ngauss_rdd.partitions.size // 4 | |
val ngauss_rdd2 = ngauss_rdd.filter(x=>x > 4.0) | |
ngauss_rdd2.count // 35 | |
ngauss_rdd2.partitions.size // 4 | |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment