Skip to content

Instantly share code, notes, and snippets.

toddwschneider

Block or report user

Report or block toddwschneider

Hide content and notifications from this user.

Learn more about blocking users

Contact Support about this user’s behavior.

Learn more about reporting abuse

Report abuse
View GitHub Profile
@toddwschneider
toddwschneider / nyc_taxi_uniquely_identifiable.md
Created Dec 1, 2015
How many NYC taxi trips are uniquely identifiable by census tracts and the hour of pickup time
View nyc_taxi_uniquely_identifiable.md

40% of NYC Taxi Trips are Uniquely Identified by Pickup/Drop Off Census Tracts and Hour

In my recent post analyzing 1.1 billion NYC taxi and Uber trips, I included a section about privacy concerns which showed how precise latitude/longitude coordinates of taxi pickups and drop offs could potentially be used to reveal personal information about where people live, work, socialize, etc.

I wrote that if the Taxi & Limousine Commission wanted to avoid disclosing personal information, they would have to remove latitude/longitude from the dataset, perhaps replacing them with coarser census tract location data. Now it seems like maybe census tracts are still too precise.

I hadn't previously investigated how well census tracts uniquely identify pickups and drop offs, but **it turns out that if you

View state_analysis.R
library(maptools)
library(geosphere)
# load USA state-level spatial data
# download from http://gadm.org
# click the 'download' tab
# select county = 'united states', file format = 'R', click ok
# download 'level 1' for state-level data
load("USA_adm1.RData")
View rg_dyno_sim.R
# you can make a text file of request times (in ms, one number per line) and import it here, or you can use a probability distribution to simulate request times (see below where setting req_durations_in_ms)
# rq = read.table("~/Downloads/request_times.txt", header=FALSE)$V1
# argument notes:
# parallel_router_count is only relevant if router_mode is set to "intelligent"
# choice_of_two, power_of_two, and unicorn_workers_per_dyno are only relevant if router_mode is set to "naive"
# you can only select one of choice_of_two, power_of_two, and unicorn_workers_per_dyno
run_simulation = function(router_mode = "naive",
reqs_per_minute = 9000,
You can’t perform that action at this time.