Created
February 4, 2020 18:13
-
-
Save Sirsirious/e45b7432e3fad65121eb867dc742459b to your computer and use it in GitHub Desktop.
Regex for sentence boundary and for punctuation escaping.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
DEFAULT_SENTENCE_BOUNDARIES = ['(?<=[0-9]|[^0-9.])(\.)(?=[^0-9.]|[^0-9.]|[\s]|$)','\.{2,}','\!+','\:+','\?+'] | |
""" | |
Breaking it down: | |
(?<=[0-9]|[^0-9.])(\.)(?=[^0-9.]|[^0-9.]|[\s]|$) -> looks for ant period that is not preceded or succeded by a digit or other period. | |
This avoids the algorithm to split sentences at decimal numbers or reticences. | |
\.{2,} -> captures reticences. | |
\!+ -> captures series of exclamation points. | |
\:+ -> captures series of colons. | |
\?+ -> captures series of question marks. | |
""" | |
DEFAULT_PUNCTUATIONS = ['(?<=[0-9]|[^0-9.])(\.)(?=[^0-9.]|[^0-9.]|[\s]|$)','\.{2,}', | |
'\!+','\:+','\?+','\,+', r'\(|\)|\[|\]|\{|\}|\<|\>'] | |
""" | |
The only difference here are in: | |
\,+ -> captures series of commas | |
\(|\)|\[|\]|\{|\}|\<|\> -> captures any parenthesis (the pipe | sign means 'or') | |
""" | |
# Another used regex is \s|\t|\n|\r -> it matches at any white char (\s), tab char (\t), new line char (\n) and carriage return char (\r) |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment