Skip to content

Instantly share code, notes, and snippets.

@dshorthouse
Last active March 24, 2024 21:01
Show Gist options
  • Star 8 You must be signed in to star a gist
  • Fork 4 You must be signed in to fork a gist
  • Save dshorthouse/81457ac9a8916135610e653efa7661b0 to your computer and use it in GitHub Desktop.
Save dshorthouse/81457ac9a8916135610e653efa7661b0 to your computer and use it in GitHub Desktop.
OCR Image-based PDF in ruby
require 'parallel'
require 'rtesseract'
require 'mini_magick'
source = "/MyDirectory/my.pdf"
doc = {}
pdf = MiniMagick::Image.open(source)
Parallel.map(pdf.pages.each_with_index, in_threads: 8) do |page, idx|
tmpfile = Tempfile.new(['', '.tif'])
MiniMagick::Tool::Convert.new do |convert|
convert.density(300)
convert << page.path
convert.alpha("off")
convert << tmpfile.path
end
tess = RTesseract.new(tmpfile.path)
doc[idx] = tess.to_s
tmpfile.unlink
end
doc.sort.to_h.values
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment