Skip to content

Instantly share code, notes, and snippets.

@lichengunc
Created July 19, 2019 01:11
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save lichengunc/7de551fdea39e848512b27ed72fbb7b9 to your computer and use it in GitHub Desktop.
Save lichengunc/7de551fdea39e848512b27ed72fbb7b9 to your computer and use it in GitHub Desktop.
def tokenize(sent, token_to_ix=None):
words = re.sub(r"([.,'!?\"()*#:;])",
'',
sent.lower()
).replace('-', ' ').replace('/', ' ').split()
if token_to_ix:
return [wd if wd in token_to_ix.keys() else 'UNK' for wd in words]
else:
return words
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment