Skip to content

Instantly share code, notes, and snippets.

@tarxvf
Last active June 9, 2016 15:02
Show Gist options
  • Save tarxvf/ba49560220e69a90d1df379a3f309150 to your computer and use it in GitHub Desktop.
Save tarxvf/ba49560220e69a90d1df379a3f309150 to your computer and use it in GitHub Desktop.
this takes a directory of work items (in this case, PDF's in need of OCR) and processes them, skipping the old items. This particular variant uses a file to feed the selection process (allowing a quick distribution of processing over multiple nodes. (poor mans cluster) pypdfocr worked great on this one. I've written this construct a few bazillio…
#!/bin/bash
DIR=/tmp/
FILES=$DIR/*.pdf
cd $DIR
#for FILE in $FILES; do
while read FILE; do
for
# echo ">>"
# echo "FILE=" $FILE ""
[[ $FILE =~ ([0-9]+).pdf ]]
SHORTNAME=${BASH_REMATCH[1]}
OCR_NAME=$DIR$SHORTNAME"_ocr.pdf"
echo "OCR_NAME=" $OCR_NAME ""
if [ -f "$OCR_NAME" ]; then
echo "File $OCR_NAME exists."
else
echo "Processing $OCR_NAME file..."
# DO NEW WORK HERE
fi
done <$1
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment