Skip to content

Instantly share code, notes, and snippets.

@frank-leap
Last active August 29, 2015 14:23
Show Gist options
  • Save frank-leap/c307fc79dd2bdd2ecd91 to your computer and use it in GitHub Desktop.
Save frank-leap/c307fc79dd2bdd2ecd91 to your computer and use it in GitHub Desktop.
Simple word tokenizer that returns a list of non-empty words in lowercase
def simpleWordTokenizer(string):
""" A simple (for-comprehension) implementation of input string tokenization
Args:
string (str): input string
Returns:
list: a list of tokens in lowercase and no empty strings
"""
return [x for x in re.split(split_regex, string.lower()) if x]
starWarsDarkSide = 'Only at the end do you realize the power of the Dark Side.'
print simpleWordTokenizer(starWarsDarkSide) # should give ['only', 'at', 'the', 'end', 'do', 'you', 'realize', 'the', 'power', ...]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment