somada141/iterate_pdf_content_python.md

## iterate_pdf_content_python.md

      
    Raw
  

              iterate_pdf_content_python.md
            
          
    Introduction

This note shows how to iteratively retrieve the decoded content of the pages in a PDF file using Python and the pdfminer Python package.

At the time of writing, the latest version of pdfminer was used, i.e., 20140328. Thus, as the package seems poorly maintained and prone to API-breaking changes one may have to explicitly install this version for the following code to work.

Code

The following generator will iterate over a PDF file under a given filename and iteratively yield the text content of each PDF page:
# -*- coding: utf-8 -*-

from __future__ import unicode_literals

import cStringIO

import pdfminer
import pdfminer.pdfinterp
import pdfminer.converter
import pdfminer.layout
import pdfminer.pdfpage

def generate_pdf_page_text(filename):
    """Generator that yields the text content of a PDF in a page-by-page fashion"""

    # create a `pdfminer.pdfinterp.PDFResourceManager` object
    resource_manager = pdfminer.pdfinterp.PDFResourceManager()

    # create a `cStringIO.StringIO` object which will be used by the
    # `TextConverter` (see below) to store the decoded text content
    # of the PDF pages
    str_return = cStringIO.StringIO()

    # create a new `pdfminer.converter.TextConverter` that will process
    # and decode the content of the PDF pages
    device = pdfminer.converter.TextConverter(
        resource_manager,
        str_return,
        codec="utf-8",
        laparams=pdfminer.layout.LAParams()
    )

    # create a new `pdfminer.pdfinterp.PDFPageInterpreter` which will perform the
    # actual processing via the `device` object
    interpreter = pdfminer.pdfinterp.PDFPageInterpreter(
        resource_manager, device)

    # open the define PDF file
    fid = open(filename, "rb")

    # get a PDF-page generator on the opened PDF file
    generator_pages = pdfminer.pdfpage.PDFPage.get_pages(fid)

    # iterate through all PDF pages
    for page in generator_pages:
        # use the `interpreter` to extract the actual text content of the PDF
        # page
        interpreter.process_page(page)
        # retrieve the content from the `cStringIO.StringIO` object
        text = str_return.getvalue()
        # truncate the `cStringIO.StringIO` file object because the
        # `interpreter` appends the new content to the previous one.
        # This way, every iteration will only yield the current page
        # instead of gradually appending the new content.
        str_return.truncate(0)

        # yield the content of the current page
        yield text

the specifics of the above code are annotated through comments
References


https://pypi.python.org/pypi/pdfminer/
http://stackoverflow.com/questions/25665/python-module-for-converting-pdf-to-text
http://stackoverflow.com/questions/4330812/how-do-i-clear-a-stringio-object