Skip to content

Instantly share code, notes, and snippets.

@Sirsirious
Last active February 4, 2020 18:51
Show Gist options
  • Save Sirsirious/48dd5ce9685bb90d479ab75dd51dc798 to your computer and use it in GitHub Desktop.
Save Sirsirious/48dd5ce9685bb90d479ab75dd51dc798 to your computer and use it in GitHub Desktop.
The function to do the Tokenization
def _tokenize(self):
work_sentence = self.raw
for punctuation in self._punctuations:
work_sentence = work_sentence.replace(punctuation,
" "+punctuation+" ")
for delimiter in self._token_boundaries:
work_sentence = work_sentence.replace(delimiter,
self._delimiter_token)
self.tokens = [x.strip() for x in work_sentence.split(self._delimiter_token) if x != '']
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment