Skip to content

Instantly share code, notes, and snippets.

@jmcarp
Last active March 30, 2023 03:07
Show Gist options
  • Star 24 You must be signed in to star a gist
  • Fork 10 You must be signed in to fork a gist
  • Save jmcarp/7105045 to your computer and use it in GitHub Desktop.
Save jmcarp/7105045 to your computer and use it in GitHub Desktop.
Extract text from PDF document using PDFMiner
"""
Extract PDF text using PDFMiner. Adapted from
http://stackoverflow.com/questions/5725278/python-help-using-pdfminer-as-a-library
"""
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter#process_pdf
from pdfminer.pdfpage import PDFPage
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from cStringIO import StringIO
def pdf_to_text(pdfname):
# PDFMiner boilerplate
rsrcmgr = PDFResourceManager()
sio = StringIO()
codec = 'utf-8'
laparams = LAParams()
device = TextConverter(rsrcmgr, sio, codec=codec, laparams=laparams)
interpreter = PDFPageInterpreter(rsrcmgr, device)
# Extract text
fp = file(pdfname, 'rb')
for page in PDFPage.get_pages(fp):
interpreter.process_page(page)
fp.close()
# Get text from StringIO
text = sio.getvalue()
# Cleanup
device.close()
sio.close()
return text
@moulya-somasundara
Copy link

How do you extract a URL present in a PDF? For ex. if you are trying to extract the URL present in the left hand panel of the pdf version of a LinkedIn Profile using PDF miner?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment