Skip to content

Instantly share code, notes, and snippets.

@poppingtonic
Last active November 2, 2018 22:57
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 2 You must be signed in to fork a gist
  • Save poppingtonic/dc74499e850bb6c3d66cf167f4516447 to your computer and use it in GitHub Desktop.
Save poppingtonic/dc74499e850bb6c3d66cf167f4516447 to your computer and use it in GitHub Desktop.
convert a PDF to TIFF using gs, convert from TIFF to txt using tesseract
# $1 is the first argument
# remove result.txt
#rm output/result.txt
fname=`echo "$1" | awk -F "/" '{print $(NF)}'`
# convert the pdf to a group of tiffs
if [ ! -e extracted_tz_parliament/$fname.txt ];
then
echo "Working on $fname..."
gs -dNOPAUSE -dBATCH -sDEVICE=tiffg4 -sOutputFile=scan_%d.tif $1
i=1
while [ $i -ge 0 ]
do
if [ -a scan_$i.tif ]
then
tesseract scan_$i.tif scan_$i
# add the text to the result.txt file
cat scan_$i.txt >> output/result.txt
rm scan_$i.txt scan_$i.tif
i=$(( $i + 1 ))
else
i=-100
fi
done
else
echo "File $fname.txt exists, moving on."
fi
mv output/result.txt extracted_tz_parliament/$fname.txt
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment