Skip to content

Instantly share code, notes, and snippets.

@pavlov99
Created January 4, 2017 05:59
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save pavlov99/c62ae91b5637b77b118506fbaab3966b to your computer and use it in GitHub Desktop.
Save pavlov99/c62ae91b5637b77b118506fbaab3966b to your computer and use it in GitHub Desktop.
// NOTE: add minimum and maximum values to thresholds
val thresholds: Array[Double] = Array(Double.MinValue, 0.0) ++ (((0.0 until 50.0 by 10).toArray ++ Array(Double.MaxValue)).map(_.toDouble))
// Convert DataFrame to RDD and calculate histogram values
val _tmpHist = df
.select($"column" cast "double")
.rdd.map(r => r.getDouble(0))
.histogram(thresholds)
// Result DataFrame contains `from`, `to` range and the `value`.
val histogram = sc.parallelize((thresholds, thresholds.tail, _tmpHist).zipped.toList).toDF("from", "to", "value")
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment