Skip to content

Instantly share code, notes, and snippets.

@anandkunal
Created March 10, 2010 06:25
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save anandkunal/327590 to your computer and use it in GitHub Desktop.
Save anandkunal/327590 to your computer and use it in GitHub Desktop.
pattern_punctuation = re.compile('(\!|\.|\?|\;|\,|\:|\&|\_|\{|\}|\[|\]|\~|\:|\-|\;|\$|\(|\))')
pattern_numbers = re.compile('[0-9]')
pattern_plurals = re.compile('\'s$')
pattern_possesives = re.compile('s\'$')
pattern_quotations = re.compile('(\'|\")')
pattern_word = re.compile('[a-z]+')
def tokenize_word(word):
before = word
word = word.lower()
word = pattern_punctuation.sub('', word)
word = pattern_numbers.sub('', word)
word = pattern_plurals.sub('', word)
word = pattern_possesives.sub('', word)
word = pattern_quotations.sub('', word)
word = pattern_word.match(word)
return word
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment