Skip to content

Instantly share code, notes, and snippets.

@Fingel
Created November 22, 2013 05:43
Show Gist options
  • Save Fingel/7595388 to your computer and use it in GitHub Desktop.
Save Fingel/7595388 to your computer and use it in GitHub Desktop.
Script for converting images to text (ocr) for the Noisebridge Archivists group. Requires tesseract. Ubuntu packages: tesseract-ocr and tesseract-ocr-eng
#!/bin/bash
#Converts images to text using tesseract (package tesseract-ocr & tesseract-ocr-eng)
function usage
{
echo "img2txt -i <input directory> -o <output directory> --concat"
}
function concat
{
o=$1
for f in $o/*
do
cat $f >> $o/concatenated.txt
done
}
function convert
{
i=$1
o=$2
mkdir $o
echo "Converting and placing files into $o"
for f in $i/*
do
test -f $f || continue
echo "processing file $f... $o/${f##*/}.txt"
tesseract $f $o/${f##*/} &> /dev/null
done
echo "done."
}
inputdir=
outputdir=
concatenate=
while [ "$1" != "" ]; do
case $1 in
-i | --input ) shift
inputdir=$1
;;
-o | --output ) shift
outputdir=$1
;;
-c | --concat ) concat=1
;;
-h | --help ) usage
exit
;;
* ) usage
exit 1
esac
shift
done
convert $inputdir $outputdir
if [ $concat == 1 ]; then
concat $outputdir
fi
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment