Skip to content

Instantly share code, notes, and snippets.

@wpm
Created December 20, 2017 16:58
Show Gist options
  • Save wpm/bf1f2301b98a883b50e903bc3cc86439 to your computer and use it in GitHub Desktop.
Save wpm/bf1f2301b98a883b50e903bc3cc86439 to your computer and use it in GitHub Desktop.
Segment a spaCy document into "paragraphs", treating whitespace tokens containing more than one line as a paragraph delimiter.
def paragraphs(document):
start = 0
for token in document:
if token.is_space and token.text.count("\n") > 1:
yield document[start:token.i]
start = token.i
yield document[start:]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment