if !database: wget + grep
The Federal Aviation Administration is posting PDFs of the Section 333 exemptions that it grants, i.e. the exemptions for operators who want to fly drones commercially before the FAA finishes its rulemaking. A journalist wanted to look for exemptions granted to operators in a given U.S. state. But the FAA doesn't appear to have an easy-to-read data file to use and doesn't otherwise list exemptions by location of operator.
However, since their exemptions page is just one giant HTML table for listing the PDFs, we can just use wget to fetch all the PDFs, run pdftotext on each file, and then grep to our hearts' delight.
- If you trust the process described below and/or want to use up my bandwidth...here's a 1.6GB zip file of the PDFs and extracted text, as downloaded on Sept. 30, 2015.
- Update: Or just peruse a repo of the PDFs and their text extracts here.
- Or, you could just use the data provided and curated by The Verge on their Github repo, which they collect in partnership with Drone Center at Bard College.
- Here's a story they wrote about Section 333
But you should try doing this yourself. Maybe you don't need total control over how the documents are collected and filtered (though you should if you want to do any indepth research while having an easy-to-update mirror of the documents), but I find that being able to interactively and speedily do a full-text search across a document set usually spawns new ideas and discoveries beyond what you had intended to find. The steps below are repeatable using free software for a *nix system (I'm on OS X 10.10 and brew installed wget, xpdf, and ack) and the FAA's site is as robust and good as any to practice data-mining on.
1. Wget the page
You could write a web scraper and carefully parse the links. Or you could just notice that everything you need is in one HTML file and has a
The following wget command will get every file with a pdf extension and store it into a relative directory named faa_333_pdfs. I've used the most verbose versions of the flags in the example below, with
--level being the most important options; check out the wget manual for more information and options.
wget --accept=pdf --recursive --level=1 \ --no-directories \ --directory-prefix=faa_333_pdfs \ https://www.faa.gov/uas/legislative_programs/section_333/333_authorizations/
Warning: You will end up downloading 1,800+ PDFs weighing in at a total of 1.6+ gigabytes
2. Extract texts from PDFs
If you have pdftotext (via xpdf) installed on your system, then you can just batch convert everything to text...here's the command in bash (which is sloppy, you could do it in Python or at least use find...i didn't check to see if any of the filenames were funky)...the following loop runs pdftotext for each filename, then creates a corresponding text file with .txt at the end:
find . -name '*.pdf' -print0 | while read -d '' -r fname; do echo $fname # print the name for reference pdftotext -layout "$fname" "$fname.txt" done
A few notes
for fname in *.pdf; doworks in this case, but it's better to be safe with the more verbose
findcommand. I used the StackOverflow answer here. I honestly am terrible at being mindful of the edge cases...Government IT systems seem to be even more picky/limited with filename conventions so it's rare to download a file with nonalphanumeric characters, nevermind one with newlines or null characters. But better safe than sorry.
- pdftotext threw up some errors -- eg.
Syntax Error: Expected the optional content group list, but wasn't able to find it, or it isn't an Array-- and I didn't look to see what the deal was. You can (and should) re-run pdftotext yourself once you have the PDFs download -- it only takes a few minutes on a modern laptop.
-layoutoption outputs the text in a form similar to how it is physically laid out in the PDF (screenshot below). This may or may not be what you really need when doing a full-text, multi-line search:
3. Text search
Then you can just grep or whatever...probably easiest to use a text/project editor like Sublime/Atom and do a project-folder search over the text files, for more interactivity...or even just use the OS X Finder's normal search. I'm using ack in the example below, which is basically like grep but with more colors:
Obviously, turning this into an accessible data table is not something I recommend doing from the command-line. But maybe you just need to do a quick look-see, in which case, grep is absolutely the way to go. Though I prefer ag, which supports multi-line PCRE regex searching (as does ack)
And of course, check out the cleaned and sorted data from The Verge/ Drone Center at Bard College, if only as a reference point.
4. Zip and upload to S3 or what have you
You don't need to care about this but I do because I frequently forget how to zip and send files to my cloud storage (via AWS CLI):
cd .. # assuming you were in the downloaded-files sub directory zip -r faa_333_pdfs.zip faa_333_pdfs aws s3 cp faa_333_pdfs.zip s3://mah-s3-bucket/faa_333_pdfs.zip --acl public-read
And of course, one big static HTML page is just perfect for wget's
--accept option, which I think is a more ideal tool for this situation compared to the (admittedly useful) DownThemAll plugin. And running wget on a public government site to mirror pages or grab documents they've published is generally permissable -- though I've run into a few federal websites that will block wget unless you change the default user agent. This list of interesting datasets for computational journalists contains a few examples of government one-page file lists suitable for wgetting:
- California public employee salaries
- California school immunization rates
- California school AP, SAT, SAT results
- NYPD stop-and-frisk database
- Various pages on the Florida Department of Corrections site (which is invitingly titled, Data Mining on the Florida Department of Corrections Website)
- Sunlight Foundation's data files on Congressional office expenditures
- Congressional lobbyist database
- The UK police data archive
- NYC taxi trip data (warning: don't do this unless you have 100GB+ of hard drive space)
However, most of these multi-part databases require a significant amount of scripting post-download to assemble together -- here's my Bash and R gist for an older version of the NYPD stop and frisk dataset -- so using wget to avoid writing a scraper is probably not going to save you much time if you want to do anything besides grep for strings in delimited text files.
That data is crying out for an API !