Skip to content

Instantly share code, notes, and snippets.

@Sirsirious
Created February 4, 2020 18:13
Show Gist options
  • Save Sirsirious/e45b7432e3fad65121eb867dc742459b to your computer and use it in GitHub Desktop.
Save Sirsirious/e45b7432e3fad65121eb867dc742459b to your computer and use it in GitHub Desktop.
Regex for sentence boundary and for punctuation escaping.
DEFAULT_SENTENCE_BOUNDARIES = ['(?<=[0-9]|[^0-9.])(\.)(?=[^0-9.]|[^0-9.]|[\s]|$)','\.{2,}','\!+','\:+','\?+']
"""
Breaking it down:
(?<=[0-9]|[^0-9.])(\.)(?=[^0-9.]|[^0-9.]|[\s]|$) -> looks for ant period that is not preceded or succeded by a digit or other period.
This avoids the algorithm to split sentences at decimal numbers or reticences.
\.{2,} -> captures reticences.
\!+ -> captures series of exclamation points.
\:+ -> captures series of colons.
\?+ -> captures series of question marks.
"""
DEFAULT_PUNCTUATIONS = ['(?<=[0-9]|[^0-9.])(\.)(?=[^0-9.]|[^0-9.]|[\s]|$)','\.{2,}',
'\!+','\:+','\?+','\,+', r'\(|\)|\[|\]|\{|\}|\<|\>']
"""
The only difference here are in:
\,+ -> captures series of commas
\(|\)|\[|\]|\{|\}|\<|\> -> captures any parenthesis (the pipe | sign means 'or')
"""
# Another used regex is \s|\t|\n|\r -> it matches at any white char (\s), tab char (\t), new line char (\n) and carriage return char (\r)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment