-
-
Save duhaime/10dbdbb84f0f15652cc616c45b98027f to your computer and use it in GitHub Desktop.
-- specify input and output directories | |
set infile_directory to "/Users/doug/Desktop/inputs/" | |
set outfile_directory to "/Users/doug/Desktop/outputs/" | |
-- get the basenames of each input file | |
tell application "System Events" | |
set infile_list to files of folder infile_directory | |
end tell | |
-- process each input file | |
repeat with infile in infile_list | |
set infile_name to name of infile | |
set infile to POSIX file (infile_directory & infile_name) | |
set outfile to POSIX file (outfile_directory & infile_name) | |
run_ocr(infile, outfile) | |
end repeat | |
-- main function: run ocr on an infile and save results to an outfile | |
on run_ocr(infile, outfile) | |
-- identify path to ABBYY FineReader | |
set appFile to POSIX file "/Applications/FineReader.app" | |
-- set FineReader parameters | |
using terms from application "FineReader" | |
set langList to {English, Latin} | |
set saveType to single file | |
end using terms from | |
using terms from application "FineReader" | |
set toFile to outfile | |
set retainLayoutWordLayout to as editable copy | |
set keepPageNumberHeadersAndFootersBoolean to yes | |
set keepLineBreaksAndHyphenationBoolean to yes | |
set keepPageBreaksBoolean to yes | |
set increasePaperSizeToFitContentBoolean to yes | |
set keepImageBoolean to yes | |
set imageOptionsImageQualityEnum to high quality | |
set keepTextAndBackgroundColorsBoolean to yes | |
set highlightUncertainSymbolsBoolean to yes | |
set keepPageNumbersBoolean to yes | |
end using terms from | |
WaitWhileBusy() | |
tell application "FineReader" | |
set hasdoc to has document | |
if hasdoc then | |
close document | |
end if | |
end tell | |
WaitWhileBusy() | |
tell application "FineReader" | |
set auto_read to auto read new pages false | |
end tell | |
tell application "Finder" | |
open infile using appFile | |
end tell | |
delay 5 | |
WaitWhileBusy() | |
-- the end of line character below is created by pressing OPTION+ENTER | |
tell application "FineReader" | |
export to html toFile ¬ | |
ocr languages enum langList ¬ | |
saving type saveType ¬ | |
keep line breaks and hyphenation keepLineBreaksAndHyphenationBoolean ¬ | |
keep page numbers headers and footers keepPageNumberHeadersAndFootersBoolean ¬ | |
keep pictures keepImageBoolean ¬ | |
image quality imageOptionsImageQualityEnum ¬ | |
keep text and background colors keepTextAndBackgroundColorsBoolean | |
end tell | |
WaitWhileBusy() | |
-- close the current file | |
tell application "FineReader" | |
auto read new pages auto_read | |
close document | |
end tell | |
end run_ocr | |
-- close ABBYY | |
tell application "FineReader" | |
quit | |
end tell | |
-- helpers to wait for thread to open up | |
on WaitWhileBusy() | |
repeat while IsMainApplicationBusy() | |
end repeat | |
end WaitWhileBusy | |
on IsMainApplicationBusy() | |
tell application "FineReader" | |
set resultBoolean to is busy | |
end tell | |
return resultBoolean | |
end IsMainApplicationBusy |
#!/usr/bin/env python | |
# -*- coding: utf-8 -*- | |
# pip install pyobjc | |
# pip install py-applescript | |
import applescript, os, glob, sys | |
scpt = applescript.AppleScript(''' | |
-- main function: run ocr on an infile and save results to an outfile | |
on run_ocr(infile, outfile) | |
set infile to POSIX file infile | |
set outfile to POSIX file outfile | |
-- identify path to ABBYY FineReader | |
set appFile to POSIX file "/Applications/FineReader.app" | |
-- set FineReader parameters | |
using terms from application "FineReader" | |
set langList to {English, Latin} | |
set saveType to single file | |
end using terms from | |
using terms from application "FineReader" | |
set toFile to outfile | |
set retainLayoutWordLayout to as editable copy | |
set keepPageNumberHeadersAndFootersBoolean to yes | |
set keepLineBreaksAndHyphenationBoolean to yes | |
set keepPageBreaksBoolean to yes | |
set increasePaperSizeToFitContentBoolean to yes | |
set keepImageBoolean to yes | |
set imageOptionsImageQualityEnum to high quality | |
set keepTextAndBackgroundColorsBoolean to yes | |
set highlightUncertainSymbolsBoolean to yes | |
set keepPageNumbersBoolean to yes | |
end using terms from | |
WaitWhileBusy() | |
tell application "FineReader" | |
set hasdoc to has document | |
if hasdoc then | |
close document | |
end if | |
end tell | |
WaitWhileBusy() | |
tell application "FineReader" | |
set auto_read to auto read new pages false | |
end tell | |
tell application "Finder" | |
open infile using appFile | |
end tell | |
delay 5 | |
WaitWhileBusy() | |
-- the end of line character below is created by pressing OPTION+ENTER | |
tell application "FineReader" | |
export to html toFile ¬ | |
ocr languages enum langList ¬ | |
saving type saveType ¬ | |
keep line breaks and hyphenation keepLineBreaksAndHyphenationBoolean ¬ | |
keep page numbers headers and footers keepPageNumberHeadersAndFootersBoolean ¬ | |
keep pictures keepImageBoolean ¬ | |
image quality imageOptionsImageQualityEnum ¬ | |
keep text and background colors keepTextAndBackgroundColorsBoolean | |
end tell | |
WaitWhileBusy() | |
-- close the current file | |
tell application "FineReader" | |
auto read new pages auto_read | |
close document | |
end tell | |
end run_ocr | |
-- close Abbyy | |
tell application "FineReader" | |
quit | |
end tell | |
-- helpers to wait for thread to open up | |
on WaitWhileBusy() | |
repeat while IsMainApplicationBusy() | |
end repeat | |
end WaitWhileBusy | |
on IsMainApplicationBusy() | |
tell application "FineReader" | |
set resultBoolean to is busy | |
end tell | |
return resultBoolean | |
end IsMainApplicationBusy | |
''') | |
infiles = glob.glob('inputs/*') | |
for infile in infiles: | |
infile = os.path.abspath(infile) | |
outfile = os.path.abspath('outputs/' + os.path.basename(infile)) | |
print(' * processing', infile) | |
scpt.call('run_ocr', infile, outfile) |
@chriscjcj interesting! Could it be that the location of your Abbyy app is somewhere else? Does /Applications/FineReader.app
exist on your machine?
I would also add, if you're not opposed to it, you may find tesseract a little easier to work with. Using the automator approach above is kind of fun if you must use ABBYY, but otherwise I'd likely go with something that's intended to be automated on OSX like tesseract...
@duhaime First of all, thank you so much for the reply.
Here's what I'm trying to do... I have a Synology NAS and a Brother ADS-1700W document scanner. This scanner can scan directly to a network share on the NAS. Of course, it doesn't do OCR. I was hoping to implement an automated (or at least very easy) process by which I could point an OCR program at a folder of scanned documents, and have it dump the OCR'ed documents into another directory.
I do have ABBYY FireReader and it is located at the path you already denoted in the AppleScript. I did read that AppleScript can be finicky about how paths are expressed. (posix paths, etc.) So I tried many different ways to denote the path, but none was successful.
Thank you very much for letting me know about tesseract-ocr. I was unaware this existed. I will dive into it this weekend and see if it offers the functionality to get me where I'm trying to go.
I'm grateful for your advice and consultation. Thanks again! :-)
EDIT: I just noticed that tesseract-ocr can run as a Docker container. Synology supports running docker containers. Perhaps I could run this directly on the NAS. That would be fantastic. Getting there will be interesting. While I'm fairly nerdy, I'm not a coder by trade. We'll see if I have what it takes to pull it off. ;-)
EDIT2: I found this tutorial. It's six years old, but might get me there. I'll report back.
I have attempted to use this script and line 60 generates an error:
I thought it had to do with setting the volume paths to network shares, but I seem to get the error regardless of the path.