Skip to content

Instantly share code, notes, and snippets.

@ngmaloney
Last active February 10, 2022 13:42
Show Gist options
  • Save ngmaloney/c6abc87e671117263dcbeed3669a0258 to your computer and use it in GitHub Desktop.
Save ngmaloney/c6abc87e671117263dcbeed3669a0258 to your computer and use it in GitHub Desktop.
Naive solution for finding a map page on a PDF by word frequency
require 'rmagick'
require 'rtesseract'
## Utility for extracting the map page from a survey_map pdf
class Utils::MapExtract
attr_reader :pdf_file, :images
def initialize(pdf_file)
@pdf_file = pdf_file
@images = []
end
# Returns an io instance of the page with fewest words
def process!
Magick::Image.read(pdf_file).each_with_index do |pdf, idx|
filename = "#{basename}_#{idx}.png"
tmp = "/tmp/#{filename}"
pdf.write(tmp)
word_count = word_count_on_page(tmp)
images << { file: tmp, size: word_count }
end
io = File.read(map[:file])
cleanup
io
end
def map
@map ||= images.sort_by { |i| i[:size] }.first
end
def cleanup
images.each { |img| File.delete img[:file] }
end
def basename
File.basename(pdf_file, '.pdf')
end
def word_count_on_page(img)
RTesseract.new(img).to_s.split(' ').count
end
end
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment