Skip to content

Instantly share code, notes, and snippets.

View jstray's full-sized avatar

Jonathan Stray jstray

View GitHub Profile

You got the documents. Now what?

[omg documents.png]

Congratulations! Your Freedom of Information request finally yielded a big brown envelope in the mail. You are the lucky recipient of a juicy leak. You've managed to scrape all the PDFs from that stone-age government portal. Now all you have to do is the reporting.

Would that it were so easy. Your next steps depend on what you've got and what you're trying to do. You might have one page or one million pages. You could be starting with a tall stack of paper or a CSV file or anything in between. Maybe you already know exactly what you're looking for, or maybe that anonymous tip was maddeningly non-specific. In the course of my work on the Overview document-mining software I've seen just about every problem that a journalist can have with a document-driven story. These are the tales of unreadable formats, heaps of paper, and late nights reading. This post is organized as a sort of flowchart, a series of questions you can ask

You got the documents. Now what?

[omg documents.png]

Congratulations! Your Freedom of Information request finally yielded a big brown envelope in the mail. You are the proud owner of a juicy leak. You've managed to scrape all the PDFs from that stone-age government portal. Now all you have to do is the reporting.

In the course of my work on the Overview document-mining software I've seen just about every problem that journalists can have with a document-driven story. These are the stories of unreadable formats, heaps of paper, and late nights reading.

When you're the proud owner of a brand new document dump, the next steps depend on what you've got and what you're trying to do. You might have one page or one million pages. You could be starting with a tall stack of paper or a CSV file or anything in between. Maybe you already know exactly what you're looking for, or maybe that anonymous tip was maddeningly non-specific. This post is organized as a sort of flowchart, a seri