Skip to content

Instantly share code, notes, and snippets.

@vinzenzweber
Last active June 22, 2019 20:47
Show Gist options
  • Save vinzenzweber/7095aa441f52e2a42ef7960219e3b46a to your computer and use it in GitHub Desktop.
Save vinzenzweber/7095aa441f52e2a42ef7960219e3b46a to your computer and use it in GitHub Desktop.
Extract text from an image called filename.jpeg using tesseract within a docker container. The image contains german text, therefore `-l deu`. The text regions or zones are defined by `filename.uzn`. The image filename and the .uzn filename have to be identical for this to work!
Dir['./books/*/*/*.jpeg'].each do |file_path|
next if File.directory? file_path
file_name = file_path.sub('.jpeg', '.uzn')
puts "OCR #{file_name}"
uzn_content = """2385 1059 273 201 Energie
2634 1080 372 210 Kohlenhydrate
2961 1062 294 234 Fett
3204 1080 321 186 Eiweiss
2030 1500 800 960 Zutaten
2823 1497 975 1023 Zubereitung"""
File.write(file_name, uzn_content)
end
2385 1059 273 201 Textblock 1
2634 1080 372 210 Textblock 2
2961 1062 294 234 Textblock 3
3204 1080 321 186 Textblock 4
2030 1500 800 960 Textblock 5
2823 1497 975 1023 Textblock 6
#!/bin/sh
docker run -ti --rm -v "$(pwd):/data" tesseractshadow/tesseract4re /bin/bash -c "tesseract /data/filename.jpeg /data/filename -l deu --psm 4 --oem 3 hocr txt"
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment