Identifying which PDFs are images vs. text
Someone who was in my PDF text extraction session at NICAR 2020 asked how to identify image vs. text PDFs when you have thousands of files and they're a mixture of formats with the end goal of only running OCR software on the image PDFs.
This is how I would approach the problem using command-line tools.
- You’re working on a Mac or Linux machine where you have access to some common command-line utilities such as
- This should work under the Windows Subshell for Linux under Windows also