Skip to content

Instantly share code, notes, and snippets.

@bmount
Created October 17, 2012 00:05
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save bmount/3902907 to your computer and use it in GitHub Desktop.
Save bmount/3902907 to your computer and use it in GitHub Desktop.
Data before APIs before Applications

This is a little feedback for people implementing open government (geo-) data projects — the good guys, like Bronwyn Agrios and Jay Nath. I had a little interaction today with some of you, related to a pedestrian safety dataset and interactive map:

@jay_nath @ajturner @bronwynagrios @shannonspanhake It takes more time for a programmer to figure out that API than to make a better map...

— Brian Mount (@brian_mount) October 15, 2012

The reason I was bummed enough to whine is that I had done a rough map some time ago with similar data for bikes and was interested to see how the pedestrian information compared. The dataset for bike crashes covered the same period as the Department of Public Health map (2005-2009), and was part of a SWITRS (the California Statewide Integrated Traffic Reporting System) disclosure that people were kicking around. I got a copy from some bike safety advocates, then saw and used a cleaned-up version from another project around the same time. It happens that the government interface to SWITRS is the worst possible interface to government data that has ever existed or could ever possibly exist. It is as if they created a checklist of every possible mistake, then made all of them:

  • Start with potentially life-saving data
  • Hide it behind a superfluous registration system
  • In the user interface, allow submission of forms where broad data queries are possible
  • In fact, make #3 the default
  • When a user submits the default query, create a pop-up that says she can only query single counties
  • When she queries a county over multiple years, create a pop-up that says she can only query one year at a time
  • Once the user has completed steps 1-6, send her an email with an expiring link to a pdf of the tables she queried

It is actually worse than no data at all, and I had not taken another look at SWITRS since. This is why I was excited when I saw a raw dataset for the same period, covering the same area and potentially illuminating a similar set of street hazards. So when I saw the Department of Public Health link citing the SWITRS source I really wanted to see, for example, how things were going on Ocean Ave, a street I had noticed looked disproporitionately dangerous in the bike data set. For further context: I am interested in curved, busy roads and places with no natural interruptions to rapid car movement. A very important aspect of crashes on that kind of street is the high proportion of them that police blame on the pedestrian, which is an available field in the SWITRS reports. A high rate of crashes blamed on pedestrians probably means people walking are forced to cross streets in the path of cars, which, combined with high speeds and limited visibility due to street geometries, creates a condition I think needs to be studied.

As noted in the DPH methodology report, individual crashes are rare enough that you have to find some other meaningful category than "the geographic location of a particular corner." I was thinking something like, crashes where a pedestrian was found "at fault" on streets of width X with a radius of curvature less than Y. The DPH map was a nice excuse to revisit the question, so when I saw that it had data I started clicking around. I spent about 20 minutes looking for any kind of downloadable data. (Ie, just the original file provided by CHP to DPH.) When I realized that a download was probably not available I looked at the requests and responses to the web map, trying to get example query strings that I could use to pull down the data. The first three I checked were these: 1 2 3

So, what was included in the JSON API? Things like: a text-formatted dataUrl of the ESRI logo, lots of stuff like "esriServerHTMLPopupTypeNone", street geometries, partial crash records in a custom non-standard format. Things not included: all non-severe accidents, all the police report data, all the conditions on the street at the time of the accidents (ie rain), investigator's finding of fault, whether a truck was involved, road surface, several others, all of which was almost certainly in whatever records the DPH started with. Then I saw the API description, which was basically the same as the above, with an explicit limitation of 2000 records, painfully many verbose query parameters, bounding boxes in a special variant of mercator rather than simple unprojected coordinates, and that's when I just got turned off. (If I did not care a lot about the subject, I would not have tweeted about it.) My suggestions for future data sharing:

Data before APIs

There are actually very few cases where as a programmer you would rather have government data through an API rather than a simple, tabular format like csv, available for download over http or ftp. Highly time-sensitive things like live transit data or emergency announcements maybe yes, everything else, no.

Shapefiles are fine, Microsoft Access databases can usually be turned into something useful, dBase/dbf files are OK. The older the data exchange format, the better.

If you must have an API, and creating one does not divert resources from, for example, expunging all Microsoft "smart quotes" (“—”) and cedilles (ç) from your databases, just make it crystal clear how to get full datasets, and make sure the full dataset is actually full, not just the subset needed by the application vendor to create their application. (That's the before application part. I think it's so common for the application tail to wag the data dog that I would just avoid applications altogether until the vast swamp of missing civic data is filled in a bit.) It is demoralizing and exhausting to be faced with some API learning curve, or have to write a custom scraper, or try to parse some pdf table, in order to access data that was well and simply structured to begin with.

There is no next part! There is a quote attributed to Félix Houphouët-Boigny, dictator of Ivory Coast, who supposedly said: "In the Ivory Coast, there is no two, three, or four, there is only one, and that's me." Replace "Ivory Coast" with "open government initiative" and "me" with "data downloads"!

Happy Ending

Since I last looked at the data, some researchers at Berkeley have created a model site (with the exception of requiring registration), and distribute 260 megabytes of compressed textfiles of all CHP records for the 2000's. It's here. An example of a great way to provide data before APIs would be a prominent link to that site when discussing the dataset, and preferably your own copy in case the link goes bad.

TL;DR: You can get a full database of all California traffic accidents from 2000-2010 here PostGIS dump here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment