Skip to content

Instantly share code, notes, and snippets.

@akashjobanputra
Created June 4, 2018 13:27
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save akashjobanputra/b59029353e405e2a60c3c8cbd1fa3ecc to your computer and use it in GitHub Desktop.
Save akashjobanputra/b59029353e405e2a60c3c8cbd1fa3ecc to your computer and use it in GitHub Desktop.
Re Snippet for normalising whitespaces and new lines. source: http://textacy.readthedocs.io/en/latest/_modules/textacy/preprocess.html#normalize_whitespace
import re
LINEBREAK_REGEX = re.compile(r'((\r\n)|[\n\v])+')
NONBREAKING_SPACE_REGEX = re.compile(r'(?!\n)\s+')
def normalize_whitespace(text):
"""
Given ``text`` str, replace one or more spacings with a single space, and one
or more linebreaks with a single newline. Also strip leading/trailing whitespace.
"""
return NONBREAKING_SPACE_REGEX.sub(' ', LINEBREAK_REGEX.sub(r'\n', text)).strip()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment