jdraths/poppler_service.rb

## poppler_service.rb
=begin
[REFERENCE](https://linux.die.net/man/1/pdftotext)
$ brew install poppler
> `pdftotext -h`
pdftotext version 0.48.0
Copyright 2005-2016 The Poppler Developers - http://poppler.freedesktop.org
Copyright 1996-2011 Glyph & Cog, LLC
Usage: pdftotext [options] <PDF-file> [<text-file>]
  -f <int>             : first page to convert
  -l <int>             : last page to convert
  -r <fp>              : resolution, in DPI (default is 72)
  -x <int>             : x-coordinate of the crop area top left corner
  -y <int>             : y-coordinate of the crop area top left corner
  -W <int>             : width of crop area in pixels (default is 0)
  -H <int>             : height of crop area in pixels (default is 0)
  -layout              : maintain original physical layout
  -fixed <fp>          : assume fixed-pitch (or tabular) text
  -raw                 : keep strings in content stream order
  -htmlmeta            : generate a simple HTML file, including the meta information
  -enc <string>        : output text encoding name
  -listenc             : list available encodings
  -eol <string>        : output end-of-line convention (unix, dos, or mac)
  -nopgbrk             : don't insert page breaks between pages
  -bbox                : output bounding box for each word and page size to html.  Sets -htmlmeta
  -bbox-layout         : like -bbox but with extra layout bounding box data.  Sets -htmlmeta
  -opw <string>        : owner password (for encrypted files)
  -upw <string>        : user password (for encrypted files)
  -q                   : don't print any messages or errors
  -v                   : print copyright and version info
  -h                   : print usage information
  -help                : print usage information
  --help               : print usage information
  -?                   : print usage information
=end

def extract_to_text(pdf_path)
  command = ['pdftotext', Shellwords.escape(pdf_path)].join(' ') # add '-' as the last argument to print results inline
  `#{command}`
end

def extract_to_html(pdf_path)
  command = ['pdftohtml', Shellwords.escape(pdf_path)].join(' ')
  `#{command}`
end

=begin
> `pdfimages -h`
pdfimages version 0.48.0
Copyright 2005-2016 The Poppler Developers - http://poppler.freedesktop.org
Copyright 1996-2011 Glyph & Cog, LLC
Usage: pdfimages [options] <PDF-file> <image-root>
  -f <int>       : first page to convert
  -l <int>       : last page to convert
  -png           : change the default output format to PNG
  -tiff          : change the default output format to TIFF
  -j             : write JPEG images as JPEG files
  -jp2           : write JPEG2000 images as JP2 files
  -jbig2         : write JBIG2 images as JBIG2 files
  -ccitt         : write CCITT images as CCITT files
  -all           : equivalent to -png -tiff -j -jp2 -jbig2 -ccitt
  -list          : print list of images instead of saving
  -opw <string>  : owner password (for encrypted files)
  -upw <string>  : user password (for encrypted files)
  -p             : include page numbers in output file names
  -q             : don't print any messages or errors
  -v             : print copyright and version info
  -h             : print usage information
  -help          : print usage information
  --help         : print usage information
  -?             : print usage information
=end

def extract_to_img(pdf_path, output_path)
  command = ['pdfimages', '-png', Shellwords.escape(pdf_path)].join(' ')
  `#{command}`
end
	=begin
	[REFERENCE](https://linux.die.net/man/1/pdftotext)
	$ brew install poppler
	> `pdftotext -h`
	pdftotext version 0.48.0
	Copyright 2005-2016 The Poppler Developers - http://poppler.freedesktop.org
	Copyright 1996-2011 Glyph & Cog, LLC
	Usage: pdftotext [options] <PDF-file> [<text-file>]
	-f <int> : first page to convert
	-l <int> : last page to convert
	-r <fp> : resolution, in DPI (default is 72)
	-x <int> : x-coordinate of the crop area top left corner
	-y <int> : y-coordinate of the crop area top left corner
	-W <int> : width of crop area in pixels (default is 0)
	-H <int> : height of crop area in pixels (default is 0)
	-layout : maintain original physical layout
	-fixed <fp> : assume fixed-pitch (or tabular) text
	-raw : keep strings in content stream order
	-htmlmeta : generate a simple HTML file, including the meta information
	-enc <string> : output text encoding name
	-listenc : list available encodings
	-eol <string> : output end-of-line convention (unix, dos, or mac)
	-nopgbrk : don't insert page breaks between pages
	-bbox : output bounding box for each word and page size to html. Sets -htmlmeta
	-bbox-layout : like -bbox but with extra layout bounding box data. Sets -htmlmeta
	-opw <string> : owner password (for encrypted files)
	-upw <string> : user password (for encrypted files)
	-q : don't print any messages or errors
	-v : print copyright and version info
	-h : print usage information
	-help : print usage information
	--help : print usage information
	-? : print usage information
	=end

	def extract_to_text(pdf_path)
	command = ['pdftotext', Shellwords.escape(pdf_path)].join(' ') # add '-' as the last argument to print results inline
	`#{command}`
	end

	def extract_to_html(pdf_path)
	command = ['pdftohtml', Shellwords.escape(pdf_path)].join(' ')
	`#{command}`
	end

	=begin
	> `pdfimages -h`
	pdfimages version 0.48.0
	Copyright 2005-2016 The Poppler Developers - http://poppler.freedesktop.org
	Copyright 1996-2011 Glyph & Cog, LLC
	Usage: pdfimages [options] <PDF-file> <image-root>
	-f <int> : first page to convert
	-l <int> : last page to convert
	-png : change the default output format to PNG
	-tiff : change the default output format to TIFF
	-j : write JPEG images as JPEG files
	-jp2 : write JPEG2000 images as JP2 files
	-jbig2 : write JBIG2 images as JBIG2 files
	-ccitt : write CCITT images as CCITT files
	-all : equivalent to -png -tiff -j -jp2 -jbig2 -ccitt
	-list : print list of images instead of saving
	-opw <string> : owner password (for encrypted files)
	-upw <string> : user password (for encrypted files)
	-p : include page numbers in output file names
	-q : don't print any messages or errors
	-v : print copyright and version info
	-h : print usage information
	-help : print usage information
	--help : print usage information
	-? : print usage information
	=end

	def extract_to_img(pdf_path, output_path)
	command = ['pdfimages', '-png', Shellwords.escape(pdf_path)].join(' ')
	`#{command}`
	end