Someone who was in my PDF text extraction session at NICAR 2020 asked how to identify image vs. text PDFs when you have thousands of files and they're a mixture of formats with the end goal of only running OCR software on the image PDFs.
This is how I would approach the problem using command-line tools.
- You’re working on a Mac or Linux machine where you have access to some common command-line utilities such as
find
andsed
- This should work under the Windows Subshell for Linux under Windows also
- You have
pdftotext
installed, which we used in the NICAR session.
- Run
pdftotext
on all the PDFs (with the help offind
) to try to extract text. - Inspect the file sizes of a known image PDF to determine a good size threshold for the text files. Common sense tells us image PDFs have no text to extract (without OCR that is), so the output of
pdftotext
should create text files that are very small, only a few bytes. - Use
find
again to identify the text files that are really small. - Replace
.txt
with.pdf
to get the original filename
For each PDF file, run pdftotext
on it and save the output to a .txt file:
You can leverage the command find
with its -exec
option to do this.
find . -iname '*.pdf' -exec pdftotext {} \;
Let’s break this command donw:
find
: Name of command-lname ‘*.pdf’
: Match files ending in.pdf
or.PDF
(or.pDF
, etc for that matter)-exec pdftotext {} \;
: For each matching file, runpdftotext
on that file. The{}
is a placeholder that gets replaced with the matching filename.
This post has more info on using find
with -exec
: https://linuxaria.com/howto/linux-shell-how-to-use-the-exec-option-in-find-with-examples.
This will create .txt files for all PDF files in the current directory (and subdirectories) with the text contents. The .txt files correspond to the name of the PDF file but ending in .txt instead of .pdf.
Then look at the text output for a file that you know to be a text PDF and one that you know to be an image PDF.
For example, this is a text PDF:
ls -lh Public\ Health\ Spending\ Brief_2019\ \(1\).txt
Let’s break down this command:
ls
: The command name. This just lists a file or files in a directory.-l
: Show additional information like file size and timestamp-h
: Print numbers in human-readable forms. This is particularly important to be able to differentiate between file sizes that are bytes vs. kilobytes vs. gigabytes.
The output:
-rw-r--r-- 1 ghing 1248616752 11K Mar 23 12:39 Public Health Spending Brief_2019 (1).txt
This is an image PDF:
ls -lh Screen\ Shot\ 2020-03-23\ at\ 12.29.57\ PM.png.txt
The output:
-rw-r--r-- 1 ghing 1248616752 1B Mar 23 12:39 Screen Shot 2020-03-23 at 12.29.57 PM.png.txt
You’ll notice that the extracted text for the text PDF is much larger (11K) than the one for the image PDF (1B).
So, we can once again use find
to identify all text files (extracted using pdftotext) that are larger than a certain size. You might have to tweak the size parameter. I’m kind of arbitrarily searching for files smaller than two bytes:
find . -iname '*.txt' -size -2c
Let’s break down that command:
find
: The name of the command..
: Search starting in the current folder.-iname ‘*.txt’
: Find files that end in.txt
or.TXT
.-iname
means case-insensitive.-name
does the same thing but is case sensitive.-size -2c
: In addition to the name matching, matching files must be smaller than 2 bytes. Thec
specifies that the unit is bytes, which is kind of counterintuitive. See https://www.ostechnix.com/find-files-bigger-smaller-x-size-linux/ for more on the unit codes.
The output is just the image PDF:
./Screen Shot 2020-03-23 at 12.29.57 PM.png.txt
So, imagining doing this for a whole directory, you’ll get a list of only the files that are likely to contain only scanned images.
Swap out .txt
for .pdf
and you’ll have a list of the PDF files.
We can actually pipe the previous find
command through sed
in order to replace .txt with .pdf:
find . -iname '*.txt' -size -2c | sed 's/.txt$/.pdf/'
Note this will be a little wonky if some of your files end in ‘.PDF’ instead of .pdf
. There are a number of ways you can work around this, but that's beyond the scope right now.