Skip to content

Instantly share code, notes, and snippets.

Embed
What would you like to do?
A sample code which uses pdfminer module to extract text from pdf files
# pdfTextMiner.py
# Python 2.7.6
# For Python 3.x use pdfminer3k module
# This link has useful information on components of the program
# https://euske.github.io/pdfminer/programming.html
# http://denis.papathanasiou.org/posts/2010.08.04.post.html
''' Important classes to remember
PDFParser - fetches data from pdf file
PDFDocument - stores data parsed by PDFParser
PDFPageInterpreter - processes page contents from PDFDocument
PDFDevice - translates processed information from PDFPageInterpreter to whatever you need
PDFResourceManager - Stores shared resources such as fonts or images used by both PDFPageInterpreter and PDFDevice
LAParams - A layout analyzer returns a LTPage object for each page in the PDF document
PDFPageAggregator - Extract the decive to page aggregator to get LT object elements
'''
import os
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfpage import PDFPage
# From PDFInterpreter import both PDFResourceManager and PDFPageInterpreter
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfdevice import PDFDevice
# Import this to raise exception whenever text extraction from PDF is not allowed
from pdfminer.pdfpage import PDFTextExtractionNotAllowed
from pdfminer.layout import LAParams, LTTextBox, LTTextLine
from pdfminer.converter import PDFPageAggregator
''' This is what we are trying to do:
1) Transfer information from PDF file to PDF document object. This is done using parser
2) Open the PDF file
3) Parse the file using PDFParser object
4) Assign the parsed content to PDFDocument object
5) Now the information in this PDFDocumet object has to be processed. For this we need
PDFPageInterpreter, PDFDevice and PDFResourceManager
6) Finally process the file page by page
'''
base_path = "C://some_folder"
my_file = os.path.join(base_path + "/" + "test_pdf.pdf")
log_file = os.path.join(base_path + "/" + "pdf_log.txt")
password = ""
extracted_text = ""
# Open and read the pdf file in binary mode
fp = open(my_file, "rb")
# Create parser object to parse the pdf content
parser = PDFParser(fp)
# Store the parsed content in PDFDocument object
document = PDFDocument(parser, password)
# Check if document is extractable, if not abort
if not document.is_extractable:
raise PDFTextExtractionNotAllowed
# Create PDFResourceManager object that stores shared resources such as fonts or images
rsrcmgr = PDFResourceManager()
# set parameters for analysis
laparams = LAParams()
# Create a PDFDevice object which translates interpreted information into desired format
# Device needs to be connected to resource manager to store shared resources
# device = PDFDevice(rsrcmgr)
# Extract the decive to page aggregator to get LT object elements
device = PDFPageAggregator(rsrcmgr, laparams=laparams)
# Create interpreter object to process page content from PDFDocument
# Interpreter needs to be connected to resource manager for shared resources and device
interpreter = PDFPageInterpreter(rsrcmgr, device)
# Ok now that we have everything to process a pdf document, lets process it page by page
for page in PDFPage.create_pages(document):
# As the interpreter processes the page stored in PDFDocument object
interpreter.process_page(page)
# The device renders the layout from interpreter
layout = device.get_result()
# Out of the many LT objects within layout, we are interested in LTTextBox and LTTextLine
for lt_obj in layout:
if isinstance(lt_obj, LTTextBox) or isinstance(lt_obj, LTTextLine):
extracted_text += lt_obj.get_text()
#close the pdf file
fp.close()
# print (extracted_text.encode("utf-8"))
with open(log_file, "w") as my_log:
my_log.write(extracted_text.encode("utf-8"))
print("Done !!")
@lanbufan

This comment has been minimized.

Copy link

@lanbufan lanbufan commented Aug 5, 2017

Thanks. Small correction worked for me "wb" and not "w":
with open(log_file, "wb") as my_log

In any case, the best short example I found. This PDFminer3k is parsing and reading PDF text that PyPDF2 was not able to read.

@balway

This comment has been minimized.

Copy link

@balway balway commented Dec 10, 2017

Thanks - adding the W also worked for me. Neat code. thanks vinovator

@jeanpancho

This comment has been minimized.

Copy link

@jeanpancho jeanpancho commented Jan 22, 2018

Thanks. but as i tried to run it in python it says "ImportError: No module named pdfparser". What's wrong with it?

@michel117

This comment has been minimized.

Copy link

@michel117 michel117 commented Feb 14, 2018

@jeanpancho you might try that stackoverflow post

@Tino93

This comment has been minimized.

Copy link

@Tino93 Tino93 commented Aug 28, 2018

Hey,

what does LTPage stand for?

Thanks!

@rrsayao

This comment has been minimized.

Copy link

@rrsayao rrsayao commented Feb 27, 2019

Anyone else getting this error when trying to write to my_log?

TypeError: write() argument must be str, not bytes

Edit: As mentioned above, changing "w" for "wb" solved my problem.

@lycanthropes

This comment has been minimized.

Copy link

@lycanthropes lycanthropes commented Jun 30, 2019

it is too old, now this can not work.

@ZigmaZigmax

This comment has been minimized.

Copy link

@ZigmaZigmax ZigmaZigmax commented Jul 19, 2019

it is too old, now this can not work.

pdfminer.six is working same code

@xo28122000

This comment has been minimized.

Copy link

@xo28122000 xo28122000 commented Aug 10, 2019

thank you so much for this simple implementation. Do you know how do we get more features(location, font, size,etc) of the text?

@StephenRUK

This comment has been minimized.

Copy link

@StephenRUK StephenRUK commented Oct 4, 2019

thank you so much for this simple implementation. Do you know how do we get more features(location, font, size,etc) of the text?

The LTChar class contains the location, font size and (internal) font name for each character. You may find the LTChar objects by iterating through the children of each container recursively. At least that's what I found to work best so far. See https://github.com/pdfminer/pdfminer.six/blob/master/pdfminer/layout.py#L228

Sample code for finding all characters with their locations and font information:

def find_characters(container):
    """Returns list of dicts containing (char,box,fontname,fontsize)"""
    chars = []
    for child in container:
        if isinstance(child, Layout.LTChar):
            char = {
                'char': child.get_text(),
                'box': child.bbox,
                'fontname': child.fontname,
                'fontsize': child.size
            }
            chars.append(char)
        elif isinstance(child, Layout.LTComponent):
            chars += find_characters(child)
    return chars
@szufisher

This comment has been minimized.

Copy link

@szufisher szufisher commented Oct 15, 2019

it worked on my python 3.7.3 windows 10 computer(no pdfminer3k needed), even can directly handle Chinese without further extract setup.

many thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.