Skip to content

Instantly share code, notes, and snippets.

@braytac
Forked from poppingtonic/extract.sh
Created November 2, 2018 22:57
Show Gist options
  • Save braytac/0ac94206eb48254843058c84f12495a5 to your computer and use it in GitHub Desktop.
Save braytac/0ac94206eb48254843058c84f12495a5 to your computer and use it in GitHub Desktop.
convert a PDF to TIFF using gs, convert from TIFF to txt using tesseract
# $1 is the first argument
# remove result.txt
#rm output/result.txt
fname=`echo "$1" | awk -F "/" '{print $(NF)}'`
# convert the pdf to a group of tiffs
if [ ! -e extracted_tz_parliament/$fname.txt ];
then
echo "Working on $fname..."
gs -dNOPAUSE -dBATCH -sDEVICE=tiffg4 -sOutputFile=scan_%d.tif $1
i=1
while [ $i -ge 0 ]
do
if [ -a scan_$i.tif ]
then
tesseract scan_$i.tif scan_$i
# add the text to the result.txt file
cat scan_$i.txt >> output/result.txt
rm scan_$i.txt scan_$i.tif
i=$(( $i + 1 ))
else
i=-100
fi
done
else
echo "File $fname.txt exists, moving on."
fi
mv output/result.txt extracted_tz_parliament/$fname.txt
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment