Someone who was in my PDF text extraction session at NICAR 2020 asked how to identify image vs. text PDFs when you have thousands of files and they're a mixture of formats with the end goal of only running OCR software on the image PDFs.
This is how I would approach the problem using command-line tools.
- You’re working on a Mac or Linux machine where you have access to some common command-line utilities such as
find
andsed
- This should work under the Windows Subshell for Linux under Windows also