Skip to content

Instantly share code, notes, and snippets.

@coolreader18
Last active November 20, 2018 03:30
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save coolreader18/0f89a8e6088c080b9f5007602e8aec0d to your computer and use it in GitHub Desktop.
Save coolreader18/0f89a8e6088c080b9f5007602e8aec0d to your computer and use it in GitHub Desktop.
A script to OCR a pdf file
#!/usr/bin/env sh
set -e
if [ ! -f "$1" ]; then
echo "Input file doesn't exist"
exit 1
fi
if [ ! "$2" ]; then
echo "Must provide output file"
exit 1
fi
input="$(realpath "$1")"
output="$(realpath "$2")"
shift 2
tmpdir="$(mktemp -d)"
cd "$tmpdir"
cleanup() {
cd /
rm -rf "$tmpdir"
}
trap cleanup 2
pdftoppm "$input" img -png
find . -name 'img-*.png' | sort >fileslist
ln -s "$output" out.pdf
touch out.pdf
tesseract fileslist out $@ pdf
cleanup
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment