Skip to content

Instantly share code, notes, and snippets.

@pmbaumgartner
Created January 10, 2022 15:49
Show Gist options
  • Save pmbaumgartner/66049c355fcfd73bda5bb72a8fd78540 to your computer and use it in GitHub Desktop.
Save pmbaumgartner/66049c355fcfd73bda5bb72a8fd78540 to your computer and use it in GitHub Desktop.
Clean in spaCy Tokenizer
from spacy.tokenizer import Tokenizer
class CTLTokenizer(Tokenizer):
# https://stackoverflow.com/a/58718664
def __call__(self, string) -> spacy.tokens.Doc:
string = self.clean_string(string)
doc = super().__call__(string)
return doc
def clean_string(self, string: str) -> str:
"""String cleaning function. You can call this to clean a string
without tokenizing.
e.g.
nlp.tokenizer.clean_string('Some example sentence')
"""
if not string.endswith("."):
string = string + "."
return string
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment