Skip to content

Instantly share code, notes, and snippets.

What would you like to do?
Ruby extract plain text for PDF by wrapping pdftotext shell command.
# frozen_string_literal: true
# Primary responsibility is extracting text from a PDF or confirming if
# text is available in the PDF.
# Security note: This simple wrapper assumes that the PDF filename that you give it has been
# chosen by an internal method, such as a tempfile name. Do not pass unsafe user supplied file names
# into this class.
# Copyright 2017 Rietta Inc. BSD Licensed.
class PdfTextExtractor
attr_accessor :pdf_file
def initialize(pdf_file:)
unless command?('pdftotext')
raise 'pdftotext is not installed, but is required.'
@pdf_file = pdf_file
# Determine if a command is available on the current Unix system.
def command?(command)
system("which #{command} > /dev/null 2>&1")
def text
@text ||= `pdftotext '#{@pdf_file}' -`.strip
def text?
text != ''
def as_json(_opts = {})
filename: @pdf_file,
text: text
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.