Skip to content

Instantly share code, notes, and snippets.

@toddwschneider
Created December 1, 2015 12:28
Show Gist options
  • Save toddwschneider/c79ea3272a631ee2fddf to your computer and use it in GitHub Desktop.
Save toddwschneider/c79ea3272a631ee2fddf to your computer and use it in GitHub Desktop.
How many NYC taxi trips are uniquely identifiable by census tracts and the hour of pickup time

40% of NYC Taxi Trips are Uniquely Identified by Pickup/Drop Off Census Tracts and Hour

In my recent post analyzing 1.1 billion NYC taxi and Uber trips, I included a section about privacy concerns which showed how precise latitude/longitude coordinates of taxi pickups and drop offs could potentially be used to reveal personal information about where people live, work, socialize, etc.

I wrote that if the Taxi & Limousine Commission wanted to avoid disclosing personal information, they would have to remove latitude/longitude from the dataset, perhaps replacing them with coarser census tract location data. Now it seems like maybe census tracts are still too precise.

I hadn't previously investigated how well census tracts uniquely identify pickups and drop offs, but it turns out that if you know the census tracts for pickups and drop offs, plus pickup times truncated to the nearest hour, then you can uniquely identify 40% of NYC taxi trips.

The ability to identify a trip depends on several factors, including geography and time of day: trips in dense areas during rush hour are less likely to be identifiable, while trips in remote areas at off-peak hours are more likely to be identifiable.

Here's a map that shows the percentage of pickups in each tract that are uniquely identified by pickup tract, drop off tract, and pickup time truncated to nearest hour. Darker regions of the map are areas where more trips are uniquely identifiable:

map

(click the map to view an interactive version including data by tract)

The area around Penn Station is the most anonymous place to hail a cab: 14% of the trips starting there are uniquely identifiable. Conversely, over 90% of trips starting in many parts of the outer boroughs are identifiable. The percantage of identifiable trips by borough (note most Queens trips are from the airports):

  1. Manhattan: 38%
  2. Queens: 53%
  3. Staten Island: 73%
  4. Brooklyn: 87%
  5. Bronx: 91%

About 35% of trips during the peak daytime hours are uniquely identifiable, while over 70% of trips 4AM–6AM are identifiable:

hourly

Maybe 40% is a small enough number overall that it's fine to include pickup and drop off census tracts in public data. Census tracts would still be an improvement over latitude/longitude from a privacy perspective, since tracts don't identify exact homes, businesses, and other establishments, but nevertheless I was surprised by the high rate at which tracts uniquely identify trips in a city as dense as New York.

On some level I probably shouldn't be surprised: a famous paper by Latanya Sweeney showed that 87% of the U.S. population is uniquely identified by birthday, gender, and ZIP code, though a more recent paper by Philippe Golle put the number at 63% of the population. Either way, we probably have a bias toward underestimating how easy it is to identify people from seemingly limited data, and any organization releasing public data should be cognizant of that.

Copy link

ghost commented Dec 3, 2015

I think you should use either pick up OR drop off, but not both. Since including both makes little sense. Typically an attacker is interested in figuring out destination or source and has information about the other.

Finally with Sweeney et. al. paper an often overlooked point is that people change their zip codes when they move.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment