Skip to content

Instantly share code, notes, and snippets.

@mnyrop
Created July 9, 2020 01:33
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save mnyrop/075971e6a60225f783ae364e6f1fec5c to your computer and use it in GitHub Desktop.
Save mnyrop/075971e6a60225f783ae364e6f1fec5c to your computer and use it in GitHub Desktop.
for volume in ./www.nyu.edu/calabash/vol*; do
volume_name=$(basename $volume)
for pdf in $volume/*.pdf; do
doc_name=$(basename $pdf .pdf)
pdf_dir="./data/${volume_name}/pdf"
png_dir="./data/${volume_name}/png/${doc_name}"
ocr_dir="./data/${volume_name}/ocr/${doc_name}"
echo "converting $doc_name"
mkdir -p $pdf_dir $png_dir $ocr_dir
# copy pdf
cp $pdf "./data/${volume_name}/pdf/${doc_name}.pdf"
# convert pdf to pngs
gs -q -dNOPAUSE -dBATCH -sDEVICE=pnggray -g2550x3300 -dUseCropBox -dPDFFitPage -sOutputFile=$png_dir/%03d.png $pdf
# extract ocr from pngs
for png in $png_dir/*.png; do
ocr=$ocr_dir/$(basename $png .png)
tesseract $png $ocr
done
done
done
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment