Skip to content

Instantly share code, notes, and snippets.

@tpaskhalis
Last active October 16, 2021 19:53
Show Gist options
  • Save tpaskhalis/214c3976ac08cb809d846e01135d9f5f to your computer and use it in GitHub Desktop.
Save tpaskhalis/214c3976ac08cb809d846e01135d9f5f to your computer and use it in GitHub Desktop.
Batch conversion of pdf files to text

The procedure for automatic conversion of pdf into txt files in shell has been previously described in detail here.

In this gist I will focus on writing the bash script that uses find command-line program. It allows much sleaker implementation, with less code (essentially one-liner), while being robust to file and folder names that contain whitespaces or other non-standard characters (more on issues of wordsplitting in bash here).

Here's the original script:

#!/bin/bash
FILES=~/pdfs/*.pdf
for f in $FILES
do
 echo "Processing $f file..."
 pdftotext -enc UTF-8 $f
done

Problems start when the path contains characters other than alphanumeric or underscore, e.g. whitespace:

FILES=~/pdfs/party\ manifestos/*.pdf
for f in $FILES
do
  echo "Processing $f file..."
  pdftotext -enc UTF-8 $f
done

Running this script will result in:

./convertpdf.sh

Processing /home/tom/pdfs/party file...
I/O Error: Couldn't open file '/home/tom/pdfs/party': No such file or directory.
Processing manifestos/*.pdf file...
I/O Error: Couldn't open file 'manifestos/*.pdf': No such file or directory.

For reasons described here this problem cannot be solved by putting $FILES in double quotes as "$FILES".

The correct and more robust way to batch process multiple pdf files through pdftotext is to use find (more on correctly using find here) and its output. Here is how to do it in two lines of code:

FOLDER=~/pdfs/party\ manifestos/
find "$FOLDER" -name '*.pdf' -exec pdftotext -enc UTF-8 {} \;

Or, preserving the output of echo inside the loop:

FOLDER=~/pdfs/party\ manifestos/
find "$FOLDER" -name '*.pdf' | while read i; 
do 
  echo "Processing $i"
  pdftotext -enc UTF-8 "$i"
done
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment