Skip to content

Instantly share code, notes, and snippets.

View ConstantineLignos's full-sized avatar

Constantine Lignos ConstantineLignos

View GitHub Profile
@ConstantineLignos
ConstantineLignos / occupy_zipf.py
Created February 4, 2012 17:40
Compute the probability mass assigned to the most frequent tokens using the Brown corpus
from collections import Counter
import nltk
TOP_PERCENT = .01
def prob_mass_top(counts, n):
return sum(count for word, count in counts.most_common(n)) / float(sum(count.values()))
count = Counter(word.lower() for word in nltk.corpus.brown.words())
@ConstantineLignos
ConstantineLignos / cmudict_triphones.py
Created September 15, 2011 16:42
Getting triphones from CMUDict pronunciations in NLTK
import re
from collections import defaultdict
import nltk
from nltk.corpus import cmudict
def clean_pron(pron):
"""Remove stress from pronunciations."""
return re.sub(r"\d", "", pron)