Skip to content

Instantly share code, notes, and snippets.

@ghing
Created March 23, 2020 18:13
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save ghing/618e237efcc570e86275e067aa721ece to your computer and use it in GitHub Desktop.
Save ghing/618e237efcc570e86275e067aa721ece to your computer and use it in GitHub Desktop.
Identifying which PDFs are image vs. text

Identifying which PDFs are images vs. text

Someone who was in my PDF text extraction session at NICAR 2020 asked how to identify image vs. text PDFs when you have thousands of files and they're a mixture of formats with the end goal of only running OCR software on the image PDFs.

This is how I would approach the problem using command-line tools.

Assumptions

  • You’re working on a Mac or Linux machine where you have access to some common command-line utilities such as find and sed
    • This should work under the Windows Subshell for Linux under Windows also
  • You have pdftotext installed, which we used in the NICAR session.

Overview

  • Run pdftotext on all the PDFs (with the help of find) to try to extract text.
  • Inspect the file sizes of a known image PDF to determine a good size threshold for the text files. Common sense tells us image PDFs have no text to extract (without OCR that is), so the output of pdftotext should create text files that are very small, only a few bytes.
  • Use find again to identify the text files that are really small.
  • Replace .txt with .pdf to get the original filename

Run pdftotext on all files

For each PDF file, run pdftotext on it and save the output to a .txt file:

You can leverage the command find with its -exec option to do this.

find . -iname '*.pdf' -exec pdftotext {} \;

Let’s break this command donw:

  • find: Name of command
  • -lname ‘*.pdf’: Match files ending in .pdf or .PDF (or .pDF, etc for that matter)
  • -exec pdftotext {} \;: For each matching file, run pdftotext on that file. The {} is a placeholder that gets replaced with the matching filename.

This post has more info on using find with -exec: https://linuxaria.com/howto/linux-shell-how-to-use-the-exec-option-in-find-with-examples.

This will create .txt files for all PDF files in the current directory (and subdirectories) with the text contents. The .txt files correspond to the name of the PDF file but ending in .txt instead of .pdf.

Inspect the sizes of a known image and text PDF

Then look at the text output for a file that you know to be a text PDF and one that you know to be an image PDF.

For example, this is a text PDF:

ls -lh Public\ Health\ Spending\ Brief_2019\ \(1\).txt

Let’s break down this command:

  • ls: The command name. This just lists a file or files in a directory.
  • -l: Show additional information like file size and timestamp
  • -h: Print numbers in human-readable forms. This is particularly important to be able to differentiate between file sizes that are bytes vs. kilobytes vs. gigabytes.

The output:

-rw-r--r--  1 ghing  1248616752    11K Mar 23 12:39 Public Health Spending Brief_2019 (1).txt

This is an image PDF:

ls -lh Screen\ Shot\ 2020-03-23\ at\ 12.29.57\ PM.png.txt

The output:

-rw-r--r--  1 ghing  1248616752     1B Mar 23 12:39 Screen Shot 2020-03-23 at 12.29.57 PM.png.txt

You’ll notice that the extracted text for the text PDF is much larger (11K) than the one for the image PDF (1B).

Use find to find the really small text files

So, we can once again use find to identify all text files (extracted using pdftotext) that are larger than a certain size. You might have to tweak the size parameter. I’m kind of arbitrarily searching for files smaller than two bytes:

find . -iname '*.txt' -size -2c

Let’s break down that command:

  • find: The name of the command.
  • .: Search starting in the current folder.
  • -iname ‘*.txt’: Find files that end in .txt or .TXT. -iname means case-insensitive. -name does the same thing but is case sensitive.
  • -size -2c: In addition to the name matching, matching files must be smaller than 2 bytes. The c specifies that the unit is bytes, which is kind of counterintuitive. See https://www.ostechnix.com/find-files-bigger-smaller-x-size-linux/ for more on the unit codes.

The output is just the image PDF:

./Screen Shot 2020-03-23 at 12.29.57 PM.png.txt

So, imagining doing this for a whole directory, you’ll get a list of only the files that are likely to contain only scanned images.

Determine image pdf files from text file name

Swap out .txt for .pdf and you’ll have a list of the PDF files.

We can actually pipe the previous find command through sed in order to replace .txt with .pdf:

find . -iname '*.txt' -size -2c | sed 's/.txt$/.pdf/'

Note this will be a little wonky if some of your files end in ‘.PDF’ instead of .pdf. There are a number of ways you can work around this, but that's beyond the scope right now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment