Skip to content

Instantly share code, notes, and snippets.

@sixtyfive
Last active November 17, 2019 17:01
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save sixtyfive/0a3892e5efbfeb4e3eb0c8ff2e8c9b37 to your computer and use it in GitHub Desktop.
Save sixtyfive/0a3892e5efbfeb4e3eb0c8ff2e8c9b37 to your computer and use it in GitHub Desktop.
Script that uses Tesseract, Poppler and ImageMagick utilities to OCR a PDF consisting of mere images and make it searchable
#!/usr/bin/env ruby
main_lang = ARGV[0]
input_pdf = ARGV[1]
temp_dir = 'temp'
if main_lang && input_pdf
`mkdir -p #{temp_dir}`
print "Splitting PDF into separate pages... "
`pdfseparate "#{input_pdf}" #{temp_dir}/page_%d.pdf`
print "\nConverting to image, then running OCR: page "
bs = 0
Dir["#{temp_dir}/*.pdf"].each do |pdf_in|
prefix, number = pdf_in.match(/(page_)(\d+)/).captures
bs_str = ""; bs.times {bs_str += "\b"}
print "#{bs_str}#{number} "
bs = number.size + 1
tiff = pdf_in.gsub(/\.pdf/, '.tiff')
`convert -density 300 #{pdf_in} -depth 8 -strip -background white -alpha off #{tiff} 2>/dev/null`
`rm #{pdf_in}`
outname = "#{prefix}#{'%04i' % number.to_i}"
`tesseract #{tiff} #{File.join(temp_dir, outname)} -l #{main_lang} pdf 2>/dev/null`
`rm #{tiff}`
end
puts
`pdfunite #{temp_dir}/page_????.pdf "#{input_pdf.gsub(/\.pdf/, '')} (searchable).pdf"`
`rm -f #{temp_dir}/page_????.pdf`
`rmdir #{temp_dir}`
else
puts "Usage: ocrpdf.rb <main language (CAREFUL! FORMAT IS UNUSUAL; SEE BELOW!)> <input PDF>\n\n"
puts `tesseract --list-langs`.split("\n").join(" ")
end
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment