Skip to content

Instantly share code, notes, and snippets.

@davidfauth
Created January 21, 2014 15:51
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save davidfauth/8542633 to your computer and use it in GitHub Desktop.
Save davidfauth/8542633 to your computer and use it in GitHub Desktop.
Python utility to Tokenize data and write out the top-5 bigrams
@outputSchema("top_five:bag{t:(bigram:chararray)}")
def top5_bigrams(textDescription):
sentences = nltk.tokenize.sent_tokenize(textDescription)
tokens = [nltk.tokenize.word_tokenize(s) for s in sentences]
bgm = nltk.collocations.BigramAssocMeasures()
finder = nltk.collocations.BigramCollocationFinder.from_documents(tokens)
top_5 = finder.nbest(bgm.likelihood_ratio, 5)
return [ ("%s %s" % (s[0], s[1]),) for s in top_5 ]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment