This note shows how to iteratively retrieve the decoded content of the pages in a PDF file using Python and the pdfminer
Python package.
At the time of writing, the latest version of
pdfminer
was used, i.e.,20140328
. Thus, as the package seems poorly maintained and prone to API-breaking changes one may have to explicitly install this version for the following code to work.
The following generator will iterate over a PDF file under a given filename
and iteratively yield
the text content of each PDF page: