[omg documents.png]
Congratulations! Your Freedom of Information request finally yielded a big brown envelope in the mail. You are the proud owner of a juicy leak. You've managed to scrape all the PDFs from that stone-age open government portal. Now all you have to do is report.
In the course of my work on the Overview document mining software I've seen just about every problem that journalists can have with a document-driven story. These are the stories of unreadable formats, heaps of paper, and late nights reading.
When you're the proud owner of a brand new document dump, the next steps depend on what you've got and what you're trying to do. You might have one page or one million pages. You could be starting with a tall stack of paper or a CSV file or anything in between. Maybe you already know exactly what you're looking for, or maybe that anonymous tip was so non-specific you don't know where to start. This post is organized as a sort o