Additions wanted - please just fork and add.
- Parsing PDFs by Thomas Levine
- Get Started With Scraping – Extracting Simple Tables from PDF Documents
- PDFMiner - PDFMiner is a tool for extracting information from PDF documents. Unlike other PDF-related tools, it focuses entirely on getting and analyzing text data. PDFMiner allows one to obtain the exact location of text in a page, as well as other information such as fonts or lines. It includes a PDF converter that can transform PDF files into other text formats (such as HTML). It has an extensible PDF parser that can be used for other purposes than text analysis.
- Pure python
- pdftohtml - pdftohtml is a utility which converts PDF files into HTML and XML formats. Based on xpdf
- pdftoxml - command line utility to convert PDF to XML built on poppler.
- docsplit - part of DocumentCloud. Docsplit is a command-line utility and Ruby library for splitting apart documents into their component parts: searchable UTF-8 plain text via OCR if necessary, page images or thumbnails in any format, PDFs, single pages, and document metadata (title, author, number of pages...)
- pypdf2xml - convert PDF to XML. Built on pdfminer. Started as an alternative to poppler's pdftoxml, which didn't properly decode CID Type2 fonts in PDFs.
- pdf2htmlEX - Convert PDF to HTML without losing text or format. C++. Fast.
- http://tabula.nerdpower.org/ - open-source, designed specifically for tabular data but looks a bit of a pain to set up
- http://pdftoxml.sourceforge.net/ - one of the better for tables but have not used for a while
- http://pdftohtml.sourceforge.net/ - linux only afaict
- pdf.js - you probably want a fork like pdf2json or node-pdfreader that integrates this better with node. I have not tried this on tables though ...
- Using scraperwiki + pdftoxml - see this recent tutorial Get Started With Scraping – Extracting Simple Tables from PDF Documents