Last active
June 22, 2019 20:47
-
-
Save vinzenzweber/7095aa441f52e2a42ef7960219e3b46a to your computer and use it in GitHub Desktop.
Extract text from an image called filename.jpeg using tesseract within a docker container. The image contains german text, therefore `-l deu`. The text regions or zones are defined by `filename.uzn`. The image filename and the .uzn filename have to be identical for this to work!
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Dir['./books/*/*/*.jpeg'].each do |file_path| | |
next if File.directory? file_path | |
file_name = file_path.sub('.jpeg', '.uzn') | |
puts "OCR #{file_name}" | |
uzn_content = """2385 1059 273 201 Energie | |
2634 1080 372 210 Kohlenhydrate | |
2961 1062 294 234 Fett | |
3204 1080 321 186 Eiweiss | |
2030 1500 800 960 Zutaten | |
2823 1497 975 1023 Zubereitung""" | |
File.write(file_name, uzn_content) | |
end |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
2385 1059 273 201 Textblock 1 | |
2634 1080 372 210 Textblock 2 | |
2961 1062 294 234 Textblock 3 | |
3204 1080 321 186 Textblock 4 | |
2030 1500 800 960 Textblock 5 | |
2823 1497 975 1023 Textblock 6 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/bin/sh | |
docker run -ti --rm -v "$(pwd):/data" tesseractshadow/tesseract4re /bin/bash -c "tesseract /data/filename.jpeg /data/filename -l deu --psm 4 --oem 3 hocr txt" |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment