Skip to content

Instantly share code, notes, and snippets.

@jeremybmerrill
Created September 24, 2015 20:09
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save jeremybmerrill/4c9cbb2d85e1225a9634 to your computer and use it in GitHub Desktop.
Save jeremybmerrill/4c9cbb2d85e1225a9634 to your computer and use it in GitHub Desktop.
ocr a pdf
#! /usr/bin/env ruby
require 'pdfshaver'
# brew install ghostscript imagemagick #yikes
# brew install tesseract --HEAD # needs >=3.04
ARGV.each do |pdf|
puts pdf
pdf_basename = pdf.gsub(".pdf", '')
if PDFShaver
document = PDFShaver::Document.new(pdf)
document.pages.each{|page| page.render("./#{pdf_basename}-#{page.number}.png") }
else
`convert -monochrome -density 300x300 "#{pdf}" -depth 8 "#{pdf_basename}.png"`
end
(Dir["#{pdf_basename}-*.png"] + Dir["#{pdf_basename}.png"]).each do |png|
puts png
# `tesseract "#{png}" "#{png}" pdf`
`tesseract "#{png}" "#{png}" pdf`
end
files = Dir["#{pdf_basename}-*.png.pdf"].sort_by{|pdf| Regexp.new("#{pdf_basename}-([0-9]+).png.pdf").match(pdf)[1].to_i }.join('" "')
puts files.inspect
`pdftk "#{files}" cat output "#{pdf_basename}.ocr.pdf"`
end
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment