Skip to content

Instantly share code, notes, and snippets.

@AndreiD
Created August 14, 2014 15:43
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save AndreiD/dc4dad58e661213a4b5c to your computer and use it in GitHub Desktop.
Save AndreiD/dc4dad58e661213a4b5c to your computer and use it in GitHub Desktop.
Python PDF to Text or HTML HOW TO:
pip install pdfminer
for text replace HTMLConverter to TEXTConverter....
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from pdfminer.converter import HTMLConverter
from cgi import escape
def convert_pdf_to_html(url):
rsrcmgr = PDFResourceManager()
retstr = StringIO()
codec = 'utf-8'
laparams = LAParams()
device = HTMLConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
scrape = urlopen(url).read()
fp = StringIO(scrape)
interpreter = PDFPageInterpreter(rsrcmgr, device)
password = ""
maxpages = 0
caching = True
pagenos = set()
for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password, caching=caching, check_extractable=True):
interpreter.process_page(page)
fp.close()
device.close()
textstr = retstr.getvalue()
retstr.close()
return textstr
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment