Skip to content

Instantly share code, notes, and snippets.

@kmwallio
Last active December 18, 2015 07:19
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save kmwallio/5745406 to your computer and use it in GitHub Desktop.
Save kmwallio/5745406 to your computer and use it in GitHub Desktop.
Converts a Graphical PDF to text using Tesseact-OCR. Needs ImageMagick, Tesseract and possible xpdf installed (use homebrew). Use: gpdf2text.pl input.pdf output.txt
#!/usr/bin/perl
use File::Copy "cp";
use File::Path qw(make_path remove_tree);
my $file_name = $ARGV[0];
my $out_file = $ARGV[1] eq '' ? 'text.txt' : $ARGV[1];
if ($file_name eq '') {
print "\n\n" . 'Usage:' . "\n";
print "\tgpdf2txt [in-file] [out-file]\n\n";
} else {
print "Converting: " . $file_name . "\n";
mkdir('./gpdf-tmp');
cp($file_name, './gpdf-tmp/' . $file_name);
chdir('./gpdf-tmp');
`pdftoppm * -f 1 -l 100 -r 300 ocr_pdf`;
opendir(my $dh, './');
while(my $file = readdir($dh)) {
if ($file =~ m/ppm$/i) {
my $nfile = $file;
my $nfile2 = $file;
$nfile2 =~ s/\.ppm$//i;
$nfile =~ s/ppm$/png/i; # Change to TIF if needed.
`convert $file $nfile`;
`tesseract $nfile $nfile2 -l eng`; # Change eng to desired installed language
`cat $nfile2.txt >> text.txt`;
}
}
closedir($dh);
chdir('..');
cp('./gpdf-tmp/text.txt', $out_file);
remove_tree('./gpdf-tmp');
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment