Skip to content

Instantly share code, notes, and snippets.

@drdee
Created July 29, 2014 15:58
Show Gist options
  • Save drdee/d68eaf0208184d72cbff to your computer and use it in GitHub Desktop.
Save drdee/d68eaf0208184d72cbff to your computer and use it in GitHub Desktop.
PySpark countApproxDistinct
def error(estimate, size):
return abs(estimate - size) / float(size)
def uniform():
for x in xrange(100000):
yield x % 100
rdd = sc.parallelize([x for x in uniform()])
assert(error(rdd._jrdd.rdd().countApproxDistinct(4, 0), 100) < 0.4)
assert(error(rdd._jrdd.rdd().countApproxDistinct(8, 0), 100) < 0.1)
https://github.com/apache/spark/blob/4c7243e109c713bdfb87891748800109ffbaae07/core/src/test/scala/org/apache/spark/rdd/RDDSuite.scala#L78-87
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment