Skip to content

Instantly share code, notes, and snippets.

@ConstantineLignos
Created February 4, 2012 17:40
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save ConstantineLignos/1739135 to your computer and use it in GitHub Desktop.
Save ConstantineLignos/1739135 to your computer and use it in GitHub Desktop.
Compute the probability mass assigned to the most frequent tokens using the Brown corpus
from collections import Counter
import nltk
TOP_PERCENT = .01
def prob_mass_top(counts, n):
return sum(count for word, count in counts.most_common(n)) / float(sum(count.values()))
count = Counter(word.lower() for word in nltk.corpus.brown.words())
print "Top %d%% of types account for %2.1f%% of tokens" % \
(TOP_PERCENT * 100, prob_mass_top(count, int(len(count) * TOP_PERCENT)) * 100)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment