Skip to content

Instantly share code, notes, and snippets.

@denten
Last active August 29, 2015 14:00
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save denten/11388676 to your computer and use it in GitHub Desktop.
Save denten/11388676 to your computer and use it in GitHub Desktop.
iteration of ConcatenatedCorpusView bug in NLTK
import nltk
wsj = nltk.corpus.treebank.tagged_words(simplify_tags=True)
cdf = nltk.ConditionalFreqDist((tag, word) for (word, tag) in wsj)
wordlist = cdf['VN'].keys()
# Bug 1: strange iteration exceptions in the for loop
# Solution: wsj is a custom NLTK data type "ConcatenatedCorpusView"
# convert wsj into a native python type "list" for better iteration.
# I am guessing ConcatenatedCorpusView chokes on empty tuples (in this case
# at index 17).
wsj_list = list(wsj)
# Bug 2: repeated words return the index of the first word only
# Solution: Use enumerate() to index
for ndx, (word, tag) in enumerate(wsj_list):
if word in wordlist and tag == 'VN':
print wsj_list[ndx-1:ndx+1]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment