Skip to content

Instantly share code, notes, and snippets.

Embed
What would you like to do?
Quick-'n'-dirty rename of scanned PDFs based on OCR of content (requires Tesseract, ImageMagick)
#/bin/bash -e
set -o pipefail
REGION_X=1920
REGION_Y=2210
REGION_WIDTH=320
REGION_HEIGHT=100
RENAME_PATTERN='foo-%s.pdf'
for file in "$@"; do
declare date=$(
convert -crop "${REGION_WIDTH}x${REGION_HEIGHT}+${REGION_X}+${REGION_Y}" \
-density 300 -black-threshold 0.3 "$file" png:- |
tesseract stdin stdout 2>/dev/null |
grep -oP '\d[\s\d]*/[\d\s]+/[\d\s]+' |
tr -d ' ' |
awk -F/ '$3 < 100{$3 += 2000}{print $3 "-" $1 "-" $2}'
)
if [ -n "$date" ] && date -d "$date" >/dev/null; then
rename=$(printf "$RENAME_PATTERN" "$date")
if [ -e "$rename" ]; then
echo "Error: $rename would be overwritten; skipping" >&2
else
mv -vn "$file" "$rename"
fi
else
echo "Error detecting date of $file; skipping" >&2
fi
done
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.