Skip to content

Instantly share code, notes, and snippets.

@davidbjourno
Last active March 9, 2022 16:08
Show Gist options
  • Save davidbjourno/360df2364be6bf6a379fdbcc1ca9efec to your computer and use it in GitHub Desktop.
Save davidbjourno/360df2364be6bf6a379fdbcc1ca9efec to your computer and use it in GitHub Desktop.
Batch OCR PDFs

Batch OCR PDFs

In PDFs directory:

  1. Replace additional . in filenames with _: zmv '(*.*)(.*)' '${1//./_}$2
  2. mkdir jpg && mkdir txt
  3. mogrify -format jpg -density 200 -quality 100 -alpha off -path jpg/ -verbose *.pdf
  4. for file in jpg/*.jpg; do tesseract $file ${file%%.*}; done
  5. mv jpg/*.txt txt/
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment