Skip to content

Instantly share code, notes, and snippets.

@christopherkullenberg
Last active October 20, 2016 18:37
Show Gist options
  • Save christopherkullenberg/42b70a2225e17acb3f99430445eb8743 to your computer and use it in GitHub Desktop.
Save christopherkullenberg/42b70a2225e17acb3f99430445eb8743 to your computer and use it in GitHub Desktop.
from subprocess import Popen, PIPE, STDOUT
from nltk.tokenize import sent_tokenize #make sure to install the full corpus.
import re
aFile = '/home/christopher/Desktop/Introduction to Computation and Programming Using Python, Revised - Guttag, John V..pdf'
def pdftoText(filename):
'''
Input: a PDF file
Output: output of pdftotext.
'''
p = Popen(['pdftotext', filename, "-"], shell=False, stdout=PIPE, stderr=STDOUT)
content, err = p.communicate()
return(content.decode('utf-8'))
def sentencetokenizer(text):
tokens = sent_tokenize(text)
return(tokens)
for sentence in (sentencetokenizer(pdftoText(aFile))):
print("******************")
print(sentence)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment