Skip to content

Instantly share code, notes, and snippets.

@bethsebian
Last active April 18, 2016 01:12
Show Gist options
  • Save bethsebian/57656987f403657da51d5e87e18b38ae to your computer and use it in GitHub Desktop.
Save bethsebian/57656987f403657da51d5e87e18b38ae to your computer and use it in GitHub Desktop.

"Police Data Tracker" is a website I built in mod 3 to aggregate and standardize police data from law enforcement agencies. My work in mod 3 was focused on developing a proof of concept, and for this reason I was not confronted with the entirety of the technical and strategic questions surrounding my work. Revisiting the project in mod 4, including defining and beginning to tackle my MVP, has proved much more difficult. Below, I outline the 3 most challenging issues I encountered and how I chose to resolve them.

###Defining data attributes for each data category
####Problem Disparities in each cities datasets complicate adding them to my database. Cities publish wildly different information. For example, consider the category of officer-involved shootings (OIS).

Los Angeles publishes 3 separate API endpoints to cover OIS: 1) data about the incident itself, 2) information about the officer(s) involved, and 3) information about the suspect(s) involved. This structure allows them to document incidents that involved multiple officers and/or residents. Their documentation is extensive:

Incident Info Fields: zip, approx_latitude, approx_longitude, state, incident_date, incident_number, reporting_district, city, incident_type, incident_location, handling_unit_name, handling_unit_id, geo_location, needs_recoding, longitude, latitude, human_address

Suspect Info Fields: zip, approx_latitude, wounded, suspect_age, approx_longitude, weapon_involved_category_desc, state, incident_date, on_parole, incident_number, reporting_district, city, incident_type, incident_location, weapon_involved_category, mental_health_concerns, of_involved_deputies, geo_location, needs_recoding, longitude, latitude, human_address, under_the_influence, deceased, criminal_history, suspect_race, on_probation

Officer Info Fields: approx_latitude, wounded, approx_longitude, state, incident_date, deputy_gender, city, incident_type, training, of_suspects, deputy_assigned_unit, years_of_service, zip, deputy_assigned_unit_name, involved_in_previous_shootings, deputy_race, weapon_involved_category_desc, district_attorney_action, incident_number, reporting_district, incident_location, weapon_involved_category, supsect_race, geo_location, needs_recoding, longitude, latitude, human_address, deputy_age

Dallas, in comparison, has far fewer fields. suspect_weapon, suspect_deceased_injured_or_shoot_and_miss, geolocation, needs_recoding, longitude, latitude, human_address, location, officer_s, suspect_s, case, ag_forms, url, date, grand_jury_disposition And their data requires more parsing than LA's data. For example, their officer_s field includes the officer's name, race, and gender all in the same string. officer_s: "Hayden, Kevin B/M",

I had to decide what fields to include. On one hand, if I include data from comprehensive feeds like LA's, other city's will have lots of gaps in their data, and it will be hard for developers consuming my API to know what data will be reliable. On the other hand, I hate to play to the lowest common denominator, both from the perspective of norm setting, and the unfortunate consequence of abandoning data cities like LA have carefully cultivated in my API.

####Solution

  1. Select a middle ground for data columns: I chose to focus on race, which some cities have and some don’t, because it’s one of the primary issues I want to bring transparency to and I suspect my users would care about.

  2. Document best practices: I will include a best practices section to highlight exception data collection practices. For example, in the case of officer-involved shootings, LA publishes whether officers had knowledge of mental health issues before arriving to the scene. LA is the only city to publish this info, so it seems a waste to build my database around it, but at the same time I’d like to encourage other cities to publish this info in the future. I hope that by bringing some transparency to these best practices, activists will have more details about what to ask for.

  3. Provide users a snapshot of what data is consistently available for each city. I envision a table on the documentation page that lists all cities, all attributes in the API, and a check mark if that city’s has data for that column.

###Selecting Cities To Target I’d originally envisioned this site as a forum for sharing data on police agencies targeted by Dept of Justice consent decrees. Out of thousands of agencies across the country, this seemed a fitting way to scope my data and focus specifically on agencies that deserved more scrutiny. There are approximately 20 law enforcement agencies that meet this criteria.

####Problem
After a few hours of research, I learned that there is very little machine-readable, online data for these agencies. I originally wanted to find at least one city for each category, and even that proved difficult. In other words, wanting to track 8 categories of data for 20 cities, I’m hoping to identify 160 API endpoints, yet I could only locate 3-4. There were 2-3 more sets of data published in non-open formats such as PDFs or Excel docs, and these were not updated with any regularity.

####Solution There were two ways to tackle this problem:

  1. Put the tech side of my project to the side and double down on advocacy: submitting FOIA requests to the 20 agencies I’d targeted, and taking ownership for publishing their data as I received it. Having encountered a few other place-based police data initiatives (most notably North Caroline and Chicago), I knew that this would be effective, yet I also suspected that I would not be able to match the hours and legal expertise these initiatives used to have the data released.

  2. Expand the scope of my agencies to any with open data in the categories I’ve targeted (non-DOJ consent decree agencies), flesh out the technical complexities using data that is currently open and publicly available, then use that proof of concept to pitch my project to a larger advocacy organization that could tackle the FOIA requests and legal advocacy needed to open up the data of the agencies I want to target (those accused of civil rights abuses by the DOJ).

I chose to go with option 2, not only because it was more viable, but also because my political strategy has evolved in researching my options. While cities with consent decrees do warrant special scrutiny, the goal of bringing police data to a broader audience is that it can be used to prevent situations that require intervention by the Justice Department.

Building this site around available data will help create success stories for cities publishing data already. This is turn could be a driver for other cities who are not publishing data to start. It also seems that while option 2 could support option 1 (I could incorporate data released as a result of FOIA requests), option 1 locks me into a scope that limits the site's impact on police open data norms.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment