tpaskhalis/convertpdf.md

## convertpdf.md

      
    Raw
  

              convertpdf.md
            
          
    The procedure for automatic conversion of pdf into txt files in shell has been previously described in detail here.
In this gist I will focus on writing the bash script that uses find command-line program. It allows much sleaker implementation, with less code (essentially one-liner), while being robust to file and folder names that contain whitespaces or other non-standard characters (more on issues of wordsplitting in bash here).
Here's the original script:
#!/bin/bash
FILES=~/pdfs/*.pdf
for f in $FILES
do
 echo "Processing $f file..."
 pdftotext -enc UTF-8 $f
done

Problems start when the path contains characters other than alphanumeric or underscore, e.g. whitespace:
FILES=~/pdfs/party\ manifestos/*.pdf
for f in $FILES
do
  echo "Processing $f file..."
  pdftotext -enc UTF-8 $f
done

Running this script will result in:
./convertpdf.sh

Processing /home/tom/pdfs/party file...
I/O Error: Couldn't open file '/home/tom/pdfs/party': No such file or directory.
Processing manifestos/*.pdf file...
I/O Error: Couldn't open file 'manifestos/*.pdf': No such file or directory.


For reasons described here this problem cannot be solved by putting $FILES in double quotes as "$FILES".
The correct and more robust way to batch process multiple pdf files through pdftotext is to use find (more on correctly using find here) and its output. Here is how to do it in two lines of code:
FOLDER=~/pdfs/party\ manifestos/
find "$FOLDER" -name '*.pdf' -exec pdftotext -enc UTF-8 {} \;

Or, preserving the output of echo inside the loop:
FOLDER=~/pdfs/party\ manifestos/
find "$FOLDER" -name '*.pdf' | while read i; 
do 
  echo "Processing $i"
  pdftotext -enc UTF-8 "$i"
done