Skip to content

Instantly share code, notes, and snippets.

@zero323
Created July 15, 2015 17:41
Show Gist options
  • Save zero323/93dc3add9d6eabb5fd3b to your computer and use it in GitHub Desktop.
Save zero323/93dc3add9d6eabb5fd3b to your computer and use it in GitHub Desktop.
from numpy import floor
def quantile(rdd, p, sample=None, seed=0):
"""Compute a percentile of order p ∈ [0, 1]
:rdd a numeric rdd
:p percentile (between 0 and 1)
:sample fraction of and rdd to use. If not provided we use a whole dataset
:seed random number generator seed to be used with sample
"""
assert 0 <= p <= 1
assert sample is None or 0 < sample <= 1
rdd = rdd if sample is None else rdd.sample(False, sample, seed)
rddSortedWithIndex = (rdd.
sortBy(lambda x: x).
zipWithIndex().
map(lambda (x, i): (i, x)).
cache())
n = rddSortedWithIndex.count()
h = (n - 1) * p + 1
rddX = rddSortedWithIndex.lookup(int(floor(h)))[0]
rddXPlusOne = rddSortedWithIndex.lookup(int(floor(h)) + 1)[0]
return rddX + (h - floor(h)) * (rddXPlusOne - rddX)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment