Skip to content

Instantly share code, notes, and snippets.

@vtypal
Last active August 29, 2015 14:20
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save vtypal/f579f99999d1bc0d31a1 to your computer and use it in GitHub Desktop.
Save vtypal/f579f99999d1bc0d31a1 to your computer and use it in GitHub Desktop.
require 'rake'
require 'logger'
require 'pdf-reader'
#Usage
# [1] Create a new directory and copy pdf2txt.rb in this directory.
# [2] Insert also your pdf files in this directory.
# For every pdf file in this directory, this script will create a new textfile extracting the text from your original pdf file.
# File paths nad Log paths
BASE_DIR=File.expand_path(File.dirname(__FILE__))
LOG_DIR = BASE_DIR
# Logger
logger = Logger.new(LOG_DIR + '/pdf2text.log', 'daily')
logger.datetime_format = "%H:%M:%S"
FileList[ BASE_DIR + '/*.pdf'].each do |tpdf|
begin
reader = PDF::Reader.new(tpdf)
logger.info("Reading #{tpdf}")
rescue
logger.error("Cannot read #{tpdf}")
end
# # # puts reader.pdf_version
# # # puts reader.info[:Producer]
# # puts reader.metadata
# puts reader.page_count
#print reader.pages[11].text, "\n" #extract a single page
ttxt = File.join(File.dirname(tpdf), File.basename(tpdf,".*") + ".txt")
begin
File.open(ttxt,"w+") { |f|
# f << "=== #{BASE_DIR("\\")[-1]} ===\n\n"
reader.pages.each do |page|
f << page.text
end
}
logger.info("#{ttxt} created")
rescue
logger.error("Cannot create #{ttxt}")
end
end
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment