Skip to content

Instantly share code, notes, and snippets.

@ivyleavedtoadflax
Last active November 9, 2018 13:28
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save ivyleavedtoadflax/a418140d3fd87355fe4db6909c49259f to your computer and use it in GitHub Desktop.
Save ivyleavedtoadflax/a418140d3fd87355fe4db6909c49259f to your computer and use it in GitHub Desktop.
Custom date tokenizer
from spacy.util import (compile_prefix_regex, compile_infix_regex, compile_suffix_regex)
def _custom_tokenizer(self, nlp, regex=[r"[-/,.\n\s]"]):
"""Custom tokenizer to split date formats like 05-05-2015
and 05/05/2015
"""
# Use the default prefixes and suffixes
prefix_re = compile_prefix_regex(nlp.Defaults.prefixes)
suffix_re = compile_suffix_regex(nlp.Defaults.suffixes)
# Add our own rule to the end of the infix regex
infix_re = compile_infix_regex(tuple(list(nlp.Defaults.infixes) + regex))
tokenizer = Tokenizer(
nlp.vocab,
nlp.Defaults.tokenizer_exceptions,
prefix_search=prefix_re.search,
infix_finditer=infix_re.finditer,
suffix_search=suffix_re.search,
token_match=None
)
return tokenizer
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment