rufuspollock/pdf2xxx.md

## pdf2xxx.md

      
    Raw
  

              pdf2xxx.md
            
          
    Additions wanted - please just fork and add.
Tutorials


Parsing PDFs by Thomas Levine
Get Started With Scraping – Extracting Simple Tables from PDF Documents

Generic (PDF -> text)


PDFMiner - PDFMiner is a tool for extracting information from PDF documents. Unlike other PDF-related tools, it focuses entirely on getting and analyzing text data. PDFMiner allows one to obtain the exact location of text in a page, as well as other information such as fonts or lines. It includes a PDF converter that can transform PDF files into other text formats (such as HTML). It has an extensible PDF parser that can be used for other purposes than text analysis.

Pure python


pdftohtml - pdftohtml is a utility which converts PDF files into HTML and XML formats. Based on xpdf
pdftoxml - command line utility to convert PDF to XML built on poppler.
docsplit - part of DocumentCloud. Docsplit is a command-line utility and Ruby library for splitting apart documents into their component parts: searchable UTF-8 plain text via OCR if necessary, page images or thumbnails in any format, PDFs, single pages, and document metadata (title, author, number of pages...)
pypdf2xml - convert PDF to XML. Built on pdfminer. Started as an alternative to poppler's pdftoxml, which didn't properly decode CID Type2 fonts in PDFs.
pdf2htmlEX - Convert PDF to HTML without losing text or format. C++. Fast.

Tables from PDF


http://tabula.nerdpower.org/ - open-source, designed specifically for tabular data but looks a bit of a pain to set up
http://pdftoxml.sourceforge.net/ - one of the better for tables but have not used for a while
http://pdftohtml.sourceforge.net/ - linux only afaict
pdf.js - you probably want a fork like pdf2json or node-pdfreader that integrates this better with node. I have not tried this on tables though ...
Using scraperwiki + pdftoxml - see this recent tutorial Get Started With Scraping – Extracting Simple Tables from PDF Documents