public
Created

Fixed Treebank Tokenizer for NLTK

  • Download Gist
tokenize.doctest
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84
.. Copyright (C) 2001-2012 NLTK Project
.. For license information, see LICENSE.TXT
 
>>> from nltk.tokenize import *
 
Regression Tests: Treebank Tokenizer
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
Some test strings.
 
>>> s1 = "On a $50,000 mortgage of 30 years at 8 percent, the monthly payment would be $366.88."
>>> print word_tokenize(s1)
['On', 'a', '$', '50', ',', '000', 'mortgage', 'of', '30', 'years', 'at', '8', 'percent', ',', 'the', 'monthly', 'payment', 'would', 'be', '$', '366.88', '.']
>>> s2 = "\"We beat some pretty good teams to get here,\" Slocum said."
>>> print word_tokenize(s2)
['``', 'We', 'beat', 'some', 'pretty', 'good', 'teams', 'to', 'get', 'here', ',', "''", 'Slocum', 'said', '.']
>>> s3 = "Well, we couldn't have this predictable, cliche-ridden, \"Touched by an Angel\" (a show creator John Masius worked on) wanna-be if she didn't."
>>> print word_tokenize(s3)
['Well', ',', 'we', 'could', "n't", 'have', 'this', 'predictable', ',', 'cliche-ridden', ',', '``', 'Touched', 'by', 'an', 'Angel', "''", '(', 'a', 'show', 'creator', 'John', 'Masius', 'worked', 'on', ')', 'wanna-be', 'if', 'she', 'did', "n't", '.']
>>> s4 = "I cannot cannot work under these conditions!"
>>> print word_tokenize(s4)
['I', 'can', 'not', 'can', 'not', 'work', 'under', 'these', 'conditions', '!']
 
Regression Tests: Regexp Tokenizer
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
Some additional test strings.
 
>>> s = ("Good muffins cost $3.88\nin New York. Please buy me\n"
... "two of them.\n\nThanks.")
>>> s2 = ("Alas, it has not rained today. When, do you think, "
... "will it rain again?")
>>> s3 = ("<p>Although this is <b>not</b> the case here, we must "
... "not relax our vigilance!</p>")
 
>>> print regexp_tokenize(s2, r'[,\.\?!"]\s*', gaps=False)
[', ', '. ', ', ', ', ', '?']
>>> print regexp_tokenize(s2, r'[,\.\?!"]\s*', gaps=True)
['Alas', 'it has not rained today', 'When', 'do you think',
'will it rain again']
 
Make sure that grouping parentheses don't confuse the tokenizer:
 
>>> print regexp_tokenize(s3, r'</?(b|p)>', gaps=False)
['<p>', '<b>', '</b>', '</p>']
>>> print regexp_tokenize(s3, r'</?(b|p)>', gaps=True)
['Although this is ', 'not',
' the case here, we must not relax our vigilance!']
 
Make sure that named groups don't confuse the tokenizer:
 
>>> print regexp_tokenize(s3, r'</?(?P<named>b|p)>', gaps=False)
['<p>', '<b>', '</b>', '</p>']
>>> print regexp_tokenize(s3, r'</?(?P<named>b|p)>', gaps=True)
['Although this is ', 'not',
' the case here, we must not relax our vigilance!']
 
Make sure that nested groups don't confuse the tokenizer:
 
>>> print regexp_tokenize(s2, r'(h|r|l)a(s|(i|n0))', gaps=False)
['las', 'has', 'rai', 'rai']
>>> print regexp_tokenize(s2, r'(h|r|l)a(s|(i|n0))', gaps=True)
['A', ', it ', ' not ', 'ned today. When, do you think, will it ',
'n again?']
 
The tokenizer should reject any patterns with backreferences:
 
>>> print regexp_tokenize(s2, r'(.)\1')
Traceback (most recent call last):
...
ValueError: Regular expressions with back-references are
not supported: '(.)\\1'
>>> print regexp_tokenize(s2, r'(?P<foo>)(?P=foo)')
Traceback (most recent call last):
...
ValueError: Regular expressions with back-references are
not supported: '(?P<foo>)(?P=foo)'
 
A simple sentence tokenizer '\.(\s+|$)'
 
>>> print regexp_tokenize(s, pattern=r'\.(\s+|$)', gaps=True)
['Good muffins cost $3.88\nin New York',
'Please buy me\ntwo of them', 'Thanks']
treebank2-heilman.py
Python
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100
r"""
 
Penn Treebank Tokenizer
 
The Treebank tokenizer uses regular expressions to tokenize text as in Penn Treebank.
This implementation is a port of the tokenizer sed script written by Robert McIntyer
and available at http://www.cis.upenn.edu/~treebank/tokenizer.sed.
This is the method that is invoked by ``word_tokenize()``. It assumes that the
text has already been segmented into sentences, e.g. using ``sent_tokenize()``.
 
This tokenizer performs the following steps:
 
- split standard contractions, e.g. ``don't`` -> ``do n't`` and ``they'll`` -> ``they 'll``
- treat most punctuation characters as separate tokens
- split off commas and single quotes, when followed by whitespace
- separate periods that appear at the end of line
 
>>> from nltk.tokenize import TreebankWordTokenizer
>>> s = "Good muffins cost $3.88\nin New York. Please buy me\ntwo of them.\n\nThanks."
>>> TreebankWordTokenizer().tokenize(s)
['Good', 'muffins', 'cost', '$', '3.88', 'in', 'New', 'York.',
'Please', 'buy', 'me', 'two', 'of', 'them.', 'Thanks', '.']
>>> s = "They'll save and invest more."
['They', "'ll", 'save', 'and', 'invest', 'more', '.']
 
"""
 
import re
from api import *
 
 
class TreebankWordTokenizer(TokenizerI):
# List of contractions adapted from Robert MacIntyre's tokenizer.
CONTRACTIONS2 = [re.compile(r"\b(can)(not)\b", flags=re.IGNORECASE),
re.compile(r"\b(d)('ye)\b", flags=re.IGNORECASE),
re.compile(r"\b(gim)(me)\b", flags=re.IGNORECASE),
re.compile(r"\b(gon)(na)\b", flags=re.IGNORECASE),
re.compile(r"\b(got)(ta)\b", flags=re.IGNORECASE),
re.compile(r"\b(lem)(me)\b", flags=re.IGNORECASE),
re.compile(r"\b(mor)('n)\b", flags=re.IGNORECASE),
re.compile(r"\b(wan)(na) ", flags=re.IGNORECASE)]
CONTRACTIONS3 = [re.compile(r" ('t)(is)\b", flags=re.IGNORECASE),
re.compile(r" ('t)(was)\b", flags=re.IGNORECASE)]
CONTRACTIONS4 = [re.compile(r"\b(whad)(dd)(ya)\b", flags=re.IGNORECASE),
re.compile(r"\b(wha)(t)(cha)\b", flags=re.IGNORECASE)]
 
def tokenize(self, text):
#starting quotes
text = re.sub(r'^\"', r'``', text)
text = re.sub(r'(``)', r' \1 ', text)
text = re.sub(r'([ (\[{<])"', r'\1 `` ', text)
 
#punctuation
text = re.sub(r'\.\.\.', r' ... ', text)
text = re.sub(r'[,;:@#$%&]', r' \g<0> ', text)
text = re.sub(r'([^\.])(\.)([\]\)}>"\']*)\s*$', r'\1 \2\3 ', text)
text = re.sub(r'[?!]', r' \g<0> ', text)
 
text = re.sub(r"([^'])' ", r"\1 ' ", text)
 
#parens, brackets, etc.
text = re.sub(r'[\]\[\(\)\{\}\<\>]', r' \g<0> ', text)
text = re.sub(r'--', r' -- ', text)
 
#add extra space to make things easier
text = " " + text + " "
 
#ending quotes
text = re.sub(r'"', " '' ", text)
text = re.sub(r'(\S)(\'\')', r'\1 \2 ', text)
 
text = re.sub(r"([^' ])('[sS]|'[mM]|'[dD]|') ", r"\1 \2 ", text)
text = re.sub(r"([^' ])('ll|'re|'ve|n't|) ", r"\1 \2 ", text)
text = re.sub(r"([^' ])('LL|'RE|'VE|N'T|) ", r"\1 \2 ", text)
 
for regexp in self.CONTRACTIONS2:
text = regexp.sub(r' \1 \2 ', text)
for regexp in self.CONTRACTIONS3:
text = regexp.sub(r' \1 \2 ', text)
 
# We are not using CONTRACTIONS4 since
# they are also commented out in the SED scripts
# for regexp in self.CONTRACTIONS4:
# text = regexp.sub(r' \1 \2 \3 ', text)
 
text = re.sub(" +", " ", text)
text = text.strip()
 
#add space at end to match up with MacIntyre's output (for debugging)
if text != "":
text += " "
 
return text.split()
 
#if __name__ == "__main__":
# import sys
# t = TreebankWordTokenizer()
# for line in sys.stdin:
# line = line.strip()
# print t.tokenize(line)

In this version, CONTRACTIONS2 requires tokens to be space separated, when \b should be used instead (as in the existing implementation). Thus, it breaks in cases like this:

"This cannot cannot be right!" --> "This can not cannot be right !"
"This cannot work." --> "This cannot work ."

I fixed the gist to solve this issue.

Also, the flags argument to re.sub requires Python 2.7. The existing version of treebank.py works with earlier versions. Do we have any tests?

Removed the 'flags' argument to re.sub. Working on additional unit tests now. Should these go into test/tokenize.doctest?

Great, thanks. Yes, please put them in tokenize.doctest. Also, please see tokenize/regexp.py for an example of doctests inside docstrings for the purpose of user documentation.

So, I tested this script against the official penn treebank sed script on a sample of 100,000 sentences from the NYT section of GigaWord. The latest version above gets the exact same results on this sample as the sed script so I am pretty confident that this version is as close to official treebank tokenization as possible. I added some simple doctests to the docstring. I am also attaching a new version of tokenize.doctest here which contains some unit tests for treebank tokenization.

Let me know if I can check these in.

Thanks. Please go ahead, but first remember we can't use the "flags" named argument to re.compile as this requires Python 2.7 and we are still supporting 2.5.

I've removed the "flags" arguments, reinstated the class docstring, added a missing line of code to the docstring:
TreebankWordTokenizer().tokenize(s)
I've also acknowledged the author.
https://github.com/nltk/nltk/commit/e660a6827d44e242b4114abba7fc30fef6b0125c

Please sign in to comment on this gist.

Something went wrong with that request. Please try again.