toddwschneider

## nyc_taxi_uniquely_identifiable.md

      
              1 file
            
          
              0 forks
            
          
              1 comment
            
          
              7 stars
            
          
                toddwschneider
                / nyc_taxi_uniquely_identifiable.md
            
            
              Created
              December 1, 2015 12:28
            
              
                How many NYC taxi trips are uniquely identifiable by census tracts and the hour of pickup time
              
          
    40% of NYC Taxi Trips are Uniquely Identified by Pickup/Drop Off Census Tracts and Hour

In my recent post analyzing 1.1 billion NYC taxi and Uber trips, I included a section about privacy concerns which showed how precise latitude/longitude coordinates of taxi pickups and drop offs could potentially be used to reveal personal information about where people live, work, socialize, etc.
I wrote that if the Taxi & Limousine Commission wanted to avoid disclosing personal information, they would have to remove latitude/longitude from the dataset, perhaps replacing them with coarser census tract location data. Now it seems like maybe census tracts are still too precise.
I hadn't previously investigated how well census tracts uniquely identify pickups and drop offs, but **it turns out that if you

  
## state_analysis.R
library(maptools)
library(geosphere)

# load USA state-level spatial data
# download from http://gadm.org
# click the 'download' tab
# select county = 'united states', file format = 'R', click ok
# download 'level 1' for state-level data
load("USA_adm1.RData")

## rg_dyno_sim.R
# you can make a text file of request times (in ms, one number per line) and import it here, or you can use a probability distribution to simulate request times (see below where setting req_durations_in_ms)
# rq = read.table("~/Downloads/request_times.txt", header=FALSE)$V1

# argument notes:
# parallel_router_count is only relevant if router_mode is set to "intelligent"
# choice_of_two, power_of_two, and unicorn_workers_per_dyno are only relevant if router_mode is set to "naive"
# you can only select one of choice_of_two, power_of_two, and unicorn_workers_per_dyno

run_simulation = function(router_mode = "naive",
                          reqs_per_minute = 9000,
	library(maptools)
	library(geosphere)

	# load USA state-level spatial data
	# download from http://gadm.org
	# click the 'download' tab
	# select county = 'united states', file format = 'R', click ok
	# download 'level 1' for state-level data
	load("USA_adm1.RData")
	# you can make a text file of request times (in ms, one number per line) and import it here, or you can use a probability distribution to simulate request times (see below where setting req_durations_in_ms)
	# rq = read.table("~/Downloads/request_times.txt", header=FALSE)$V1

	# argument notes:
	# parallel_router_count is only relevant if router_mode is set to "intelligent"
	# choice_of_two, power_of_two, and unicorn_workers_per_dyno are only relevant if router_mode is set to "naive"
	# you can only select one of choice_of_two, power_of_two, and unicorn_workers_per_dyno

	run_simulation = function(router_mode = "naive",
	reqs_per_minute = 9000,