Skip to content

Instantly share code, notes, and snippets.

@jdraths
Last active May 12, 2017 19:33
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save jdraths/328d54fb232a98c3d2895845d20ff475 to your computer and use it in GitHub Desktop.
Save jdraths/328d54fb232a98c3d2895845d20ff475 to your computer and use it in GitHub Desktop.
Extract data from pdf with poppler
=begin
[REFERENCE](https://linux.die.net/man/1/pdftotext)
$ brew install poppler
> `pdftotext -h`
pdftotext version 0.48.0
Copyright 2005-2016 The Poppler Developers - http://poppler.freedesktop.org
Copyright 1996-2011 Glyph & Cog, LLC
Usage: pdftotext [options] <PDF-file> [<text-file>]
-f <int> : first page to convert
-l <int> : last page to convert
-r <fp> : resolution, in DPI (default is 72)
-x <int> : x-coordinate of the crop area top left corner
-y <int> : y-coordinate of the crop area top left corner
-W <int> : width of crop area in pixels (default is 0)
-H <int> : height of crop area in pixels (default is 0)
-layout : maintain original physical layout
-fixed <fp> : assume fixed-pitch (or tabular) text
-raw : keep strings in content stream order
-htmlmeta : generate a simple HTML file, including the meta information
-enc <string> : output text encoding name
-listenc : list available encodings
-eol <string> : output end-of-line convention (unix, dos, or mac)
-nopgbrk : don't insert page breaks between pages
-bbox : output bounding box for each word and page size to html. Sets -htmlmeta
-bbox-layout : like -bbox but with extra layout bounding box data. Sets -htmlmeta
-opw <string> : owner password (for encrypted files)
-upw <string> : user password (for encrypted files)
-q : don't print any messages or errors
-v : print copyright and version info
-h : print usage information
-help : print usage information
--help : print usage information
-? : print usage information
=end
def extract_to_text(pdf_path)
command = ['pdftotext', Shellwords.escape(pdf_path)].join(' ') # add '-' as the last argument to print results inline
`#{command}`
end
def extract_to_html(pdf_path)
command = ['pdftohtml', Shellwords.escape(pdf_path)].join(' ')
`#{command}`
end
=begin
> `pdfimages -h`
pdfimages version 0.48.0
Copyright 2005-2016 The Poppler Developers - http://poppler.freedesktop.org
Copyright 1996-2011 Glyph & Cog, LLC
Usage: pdfimages [options] <PDF-file> <image-root>
-f <int> : first page to convert
-l <int> : last page to convert
-png : change the default output format to PNG
-tiff : change the default output format to TIFF
-j : write JPEG images as JPEG files
-jp2 : write JPEG2000 images as JP2 files
-jbig2 : write JBIG2 images as JBIG2 files
-ccitt : write CCITT images as CCITT files
-all : equivalent to -png -tiff -j -jp2 -jbig2 -ccitt
-list : print list of images instead of saving
-opw <string> : owner password (for encrypted files)
-upw <string> : user password (for encrypted files)
-p : include page numbers in output file names
-q : don't print any messages or errors
-v : print copyright and version info
-h : print usage information
-help : print usage information
--help : print usage information
-? : print usage information
=end
def extract_to_img(pdf_path, output_path)
command = ['pdfimages', '-png', Shellwords.escape(pdf_path)].join(' ')
`#{command}`
end
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment