Skip to content

Instantly share code, notes, and snippets.

@toddwschneider
toddwschneider / nyc_taxi_uniquely_identifiable.md
Created December 1, 2015 12:28
How many NYC taxi trips are uniquely identifiable by census tracts and the hour of pickup time

40% of NYC Taxi Trips are Uniquely Identified by Pickup/Drop Off Census Tracts and Hour

In my recent post analyzing 1.1 billion NYC taxi and Uber trips, I included a section about privacy concerns which showed how precise latitude/longitude coordinates of taxi pickups and drop offs could potentially be used to reveal personal information about where people live, work, socialize, etc.

I wrote that if the Taxi & Limousine Commission wanted to avoid disclosing personal information, they would have to remove latitude/longitude from the dataset, perhaps replacing them with coarser census tract location data. Now it seems like maybe census tracts are still too precise.

I hadn't previously investigated how well census tracts uniquely identify pickups and drop offs, but **it turns out that if you

library(maptools)
library(geosphere)
# load USA state-level spatial data
# download from http://gadm.org
# click the 'download' tab
# select county = 'united states', file format = 'R', click ok
# download 'level 1' for state-level data
load("USA_adm1.RData")
# you can make a text file of request times (in ms, one number per line) and import it here, or you can use a probability distribution to simulate request times (see below where setting req_durations_in_ms)
# rq = read.table("~/Downloads/request_times.txt", header=FALSE)$V1
# argument notes:
# parallel_router_count is only relevant if router_mode is set to "intelligent"
# choice_of_two, power_of_two, and unicorn_workers_per_dyno are only relevant if router_mode is set to "naive"
# you can only select one of choice_of_two, power_of_two, and unicorn_workers_per_dyno
run_simulation = function(router_mode = "naive",
reqs_per_minute = 9000,