Skip to content

Instantly share code, notes, and snippets.

@zulzeen
Last active March 23, 2018 13:36
Show Gist options
  • Save zulzeen/5d04f37b557c4eca1a012a1209c2a089 to your computer and use it in GitHub Desktop.
Save zulzeen/5d04f37b557c4eca1a012a1209c2a089 to your computer and use it in GitHub Desktop.
Convert PDF to text
# adapted from https://www.binpress.com/tutorial/manipulating-pdfs-with-python/167
# converts pdf, returns its text content as a string
# uses pdfminer.six as a library ; works with Python 3.6
from io import StringIO
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
def convert(fname, pages=None):
if not pages:
pagenums = set()
else:
pagenums = set(pages)
with StringIO() as output:
manager = PDFResourceManager()
converter = TextConverter(manager, output, laparams=LAParams())
interpreter = PDFPageInterpreter(manager, converter)
with open(fname, 'rb') as infile:
for page in PDFPage.get_pages(infile, pagenums):
interpreter.process_page(page)
converter.close()
text = output.getvalue()
return text
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment