The procedure for automatic conversion of pdf into txt files in shell has been previously described in detail here.
In this gist I will focus on writing the bash script that uses find
command-line program. It allows much sleaker implementation, with less code (essentially one-liner), while being robust to file and folder names that contain whitespaces or other non-standard characters (more on issues of wordsplitting in bash here).
Here's the original script:
#!/bin/bash
FILES=~/pdfs/*.pdf
for f in $FILES
do
echo "Processing $f file..."
pdftotext -enc UTF-8 $f
done
Problems start when the path contains characters other than alphanumeric or underscore, e.g. whitespace:
FILES=~/pdfs/party\ manifestos/*.pdf
for f in $FILES
do
echo "Processing $f file..."
pdftotext -enc UTF-8 $f
done
Running this script will result in:
./convertpdf.sh
Processing /home/tom/pdfs/party file...
I/O Error: Couldn't open file '/home/tom/pdfs/party': No such file or directory.
Processing manifestos/*.pdf file...
I/O Error: Couldn't open file 'manifestos/*.pdf': No such file or directory.
For reasons described here this problem cannot be solved by putting $FILES
in double quotes as "$FILES"
.
The correct and more robust way to batch process multiple pdf files through pdftotext
is to use find
(more on correctly using find
here) and its output. Here is how to do it in two lines of code:
FOLDER=~/pdfs/party\ manifestos/
find "$FOLDER" -name '*.pdf' -exec pdftotext -enc UTF-8 {} \;
Or, preserving the output of echo inside the loop:
FOLDER=~/pdfs/party\ manifestos/
find "$FOLDER" -name '*.pdf' | while read i;
do
echo "Processing $i"
pdftotext -enc UTF-8 "$i"
done