Skip to content

Instantly share code, notes, and snippets.

@rufuspollock
Last active November 15, 2016 15:58
Show Gist options
  • Star 7 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save rufuspollock/5844485 to your computer and use it in GitHub Desktop.
Save rufuspollock/5844485 to your computer and use it in GitHub Desktop.
PDF 2 XXX. Tools, libraries and tutorials for converting PDFs to something more machine usable

Additions wanted - please just fork and add.

Tutorials

Generic (PDF -> text)

  • PDFMiner - PDFMiner is a tool for extracting information from PDF documents. Unlike other PDF-related tools, it focuses entirely on getting and analyzing text data. PDFMiner allows one to obtain the exact location of text in a page, as well as other information such as fonts or lines. It includes a PDF converter that can transform PDF files into other text formats (such as HTML). It has an extensible PDF parser that can be used for other purposes than text analysis.
    • Pure python
  • pdftohtml - pdftohtml is a utility which converts PDF files into HTML and XML formats. Based on xpdf
  • pdftoxml - command line utility to convert PDF to XML built on poppler.
  • docsplit - part of DocumentCloud. Docsplit is a command-line utility and Ruby library for splitting apart documents into their component parts: searchable UTF-8 plain text via OCR if necessary, page images or thumbnails in any format, PDFs, single pages, and document metadata (title, author, number of pages...)
  • pypdf2xml - convert PDF to XML. Built on pdfminer. Started as an alternative to poppler's pdftoxml, which didn't properly decode CID Type2 fonts in PDFs.
  • pdf2htmlEX - Convert PDF to HTML without losing text or format. C++. Fast.

Tables from PDF

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment