Skip to content

Instantly share code, notes, and snippets.

@dyerrington
Created April 14, 2015 06:07
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save dyerrington/f9b9e090a250275cd35f to your computer and use it in GitHub Desktop.
Save dyerrington/f9b9e090a250275cd35f to your computer and use it in GitHub Desktop.
Find bi-grams, filter on frequency, return PMI
import nltk
from nltk.collocations import *
bigram_measures = nltk.collocations.BigramAssocMeasures()
trigram_measures = nltk.collocations.TrigramAssocMeasures()
# change this to read in your data
finder = BigramCollocationFinder.from_words(
nltk.corpus.genesis.words('/var/www/htdocs/rapstats/data/albums/wutang_all.txt'))
# only bigrams that appear 3+ times
finder.apply_freq_filter(2)
# return the 10 n-grams with the highest PMI
finder.nbest(bigram_measures.pmi, 10)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment