Skip to content

Instantly share code, notes, and snippets.

@msuzen
Last active June 17, 2017 19:19
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
Star You must be signed in to star a gist
Save msuzen/81d90a33767f8f910944c8e187d9bed1 to your computer and use it in GitHub Desktop.
Spark RDD partition size after reduction of the data size
/*
RDD partition size retains from the parent
(c) 2017
GPLv3
Author: Mehmet Suzen (suzen at acm dot org)
*/
// Generate 1 million Gaussian random numbers
import util.Random
Random.setSeed(4242)
val ngauss = (1 to 1e6.toInt).map(x=>Random.nextGaussian)
val ngauss_rdd = sc.parallelize(ngauss)
ngauss_rdd.count // 1 million
ngauss_rdd.partitions.size // 4
val ngauss_rdd2 = ngauss_rdd.filter(x=>x > 4.0)
ngauss_rdd2.count // 35
ngauss_rdd2.partitions.size // 4
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment