Skip to content

Instantly share code, notes, and snippets.

View hmprt's full-sized avatar

hmprt

View GitHub Profile
@wpm
wpm / spacy_paragraph_segmenter.py
Created December 20, 2017 16:58
Segment a spaCy document into "paragraphs", treating whitespace tokens containing more than one line as a paragraph delimiter.
def paragraphs(document):
start = 0
for token in document:
if token.is_space and token.text.count("\n") > 1:
yield document[start:token.i]
start = token.i
yield document[start:]