Skip to content

Instantly share code, notes, and snippets.

@aortbals
Created February 19, 2020 21:10
Show Gist options
  • Star 2 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save aortbals/221bbc5ef0b746d6d010c4f2467ec102 to your computer and use it in GitHub Desktop.
Save aortbals/221bbc5ef0b746d6d010c4f2467ec102 to your computer and use it in GitHub Desktop.
OCR all files in a folder using Tesseract, ignoring existing files.
#! /usr/bin/env bash
### ocr
#
# OCR all files in a folder using Tesseract, ignoring existing files.
#
## Functions
usage() {
echo "Usage: ocr <source-directory> <destination-directory>"
exit 1
}
if ! [ -x "$(command -v tesseract)" ]; then
echo -e 'Tesseract is required to use this script.\n\nFor more information, visit: https://github.com/tesseract-ocr/tesseract' >&2
exit 1
fi
## Arguments
if (( $# != 2 ))
then
usage
fi
source="$1"
dest="$2"
## Main
mkdir -p "$dest"
shopt -s nullglob
shopt -s nocaseglob
for f in "$source"/*.{png,jpg,jpeg}; do
filename=`basename "$f"`
if [ ! -f "$dest/$filename.txt" ]; then
echo "PROCESSING $f"
tesseract "$f" "$dest/$filename"
fi
done
shopt -u nocaseglob
shopt -u nullglob
@norahvii
Copy link

if I do bash ocr.bash I get Usage: ocr <source-directory> <destination-directory>
what do I do if my goal is to convert all the files in a folder but retain their original names?
something like: ``tesseract *.jpg *.txt` (pseudocode)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment