Skip to content

Instantly share code, notes, and snippets.

@codegordi
Created October 21, 2013 20:50
Show Gist options
  • Save codegordi/7090759 to your computer and use it in GitHub Desktop.
Save codegordi/7090759 to your computer and use it in GitHub Desktop.
Python function to calculate cumulative relative frequency distribution (for contexts where numpy/scipy/etc not available, e.g. in Pig pre-v.0.12). Originally designed to work as a User Defined Function for Pig on Hadoop.
def cumRelFreqDistn(tups):
# create bins of increment 0.01
a = [i*-0.01 for i in range(100)]
a = a[1:len(a)]
b = [i*0.01 for i in range(101)]
a.extend(b)
a.sort()
bins = a
# build cumulative relative frequency distribution
cumfreq = [0]*200
for tup in tups:
tup = list(tup)
tup = tup[0]
for bin in range(len(bins)):
if tup <= bins[bin]:
cumfreq[bin] = cumfreq[bin]+1
cumrelfreq = [float(cumfreq[i]) / max(cumfreq) for i in range(len(cumfreq))]
crfd = zip(bins, cumrelfreq)
return crfd
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment