Skip to content

Instantly share code, notes, and snippets.

@neilernst
Last active May 22, 2017 21:13
Show Gist options
  • Save neilernst/2eaa50ce8d45e1dc2002de7cc9288c11 to your computer and use it in GitHub Desktop.
Save neilernst/2eaa50ce8d45e1dc2002de7cc9288c11 to your computer and use it in GitHub Desktop.
# author N Ernst
# thanks to https://stackoverflow.com/questions/4576077/python-split-text-on-sentences?rq=1
import nltk.data
import argparse
parser = argparse.ArgumentParser()
parser.add_argument("length", help="recommended sentence length [default 5]", type=int, default=0)
parser.add_argument("filename", help="what text file to parse")
args = parser.parse_args()
#nltk.download() punkt
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
with open(args.filename) as fp:
data = fp.read()
sentences = (tokenizer.tokenize(data))
gt = list(filter(lambda x: len(x.split(' ')) > args.length, sentences))
ratio = len(gt)/len(sentences)
print ("Sentences: {}, longer than {}: {}, ratio: {:.2f}".format(len(sentences), args.length, len(gt), ratio))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment