Skip to content

Instantly share code, notes, and snippets.

@jmwenda
Created February 11, 2016 15:52
Show Gist options
  • Save jmwenda/4229fb0eaeb37b742cf2 to your computer and use it in GitHub Desktop.
Save jmwenda/4229fb0eaeb37b742cf2 to your computer and use it in GitHub Desktop.
import nltk
text = """The Buddha, the Godhead, resides quite as comfortably in the circuits of a digital computer or the gears of a cycle transmission as he does at the top of a mountain or in the petals of a flower. To think otherwise is to demean the Buddha...which is to demean oneself."""
sentence_re = r'''(?x)
# abbreviations, e.g. U.S.A. (with optional last period)
([A-Z])(\.[A-Z])+\.?
# words with optional internal hyphens
| \w+(-\w+)*
# currency and percentages, e.g. $12.40, 82%
| \$?\d+(\.\d+)?%?
# ellipsis
| \.\.\.
# these are separate tokens
| [][.,;"'?():-_`]
'''
toks = nltk.regexp_tokenize(text, sentence_re)
print toks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment