Skip to content

Instantly share code, notes, and snippets.

@sbma44
Last active September 7, 2019 15:25
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save sbma44/1d8cc3a1d9d238d2fac9df94dad802e5 to your computer and use it in GitHub Desktop.
Save sbma44/1d8cc3a1d9d238d2fac9df94dad802e5 to your computer and use it in GitHub Desktop.

How OpenAddresses Organizes the World's Address Data

Location search relies upon address data, which links latitude and longitude coordinates like 38.8976,-77.0365 to human-readable addresses like 1600 Pennsylvania Avenue. This is an important but underappreciated type of geodata, and one that existing open data projects like OpenStreetMap have had limited success at collecting.

Many governments publish address data, but they do so in different ways. The data can be distributed as a Shapefile, CSV, KML, ZIP or GeoJSON file, or via an API. It can be presented in a variety of different geographic projections and character encodings. The names of fields are not standardized: for example, longitude might be stored in a field called LONG, X, LNG or LON. The license terms associated with the data can vary. Sometimes the data points to specific houses and sometimes it is no more specific than property parcels. The data might come with useful additional fields, like ZIP code or city name. The type of government offering the data might be a city, region or national government. And of course each government that publishes address data does so on their own website, often in a hard to find place.

The OpenAddresses project works to find and organize sources of address data, cataloguing the differences listed above so that people can easily use the data in a uniform way. Before OpenAddresses, a researcher trying to geolocate addresses in a given U.S. state might need to visit every one of the state’s counties’ websites, search them for address data, examine the files’ differences and load each into a shared database. Once she finished her project, that effort would probably be forgotten without others benefiting from it. OpenAddresses captures the information necessary to perform these steps -- the address source metadata -- in a standard way and makes it freely available to everyone. The project also uses this metadata to regularly fetch and process source address data into standardized CSV files that can be used immediately.

The licenses associated with each address data source vary considerably, but OpenAddresses makes a good-faith effort to classify the most common and relevant limitations. Although all the data catalogued by OpenAddresses is offered under some kind of open license allowing reuse, the address data itself does not belong to OpenAddresses, and the specific open licenses employed by the data’s owners can vary. Users of the data are obligated to respect the terms and conditions set by the rights holders. When OpenAddresses receives valid requests from rights holders asking that we remove their data from the project, we do so.

The metadata created by OpenAddresses contributors belongs to the project, so we are able to give it away to the public domain under Creative Commons CC0 terms. Anyone is free to take and use the metadata as they see fit.

The metadata is stored as JSON files in a Github repository. JSON is a widely-used markup language that is easily read both by humans and machines, and can be opened and manipulated by even the simplest text editing programs. Github is a collaborative website built around version-tracking software that makes it easy to preserve files’ histories and manage proposals to change them. Github occupies a prominent place in the culture of the global open source movement, which helps to make OpenAddresses discoverable and accessible to potential users. The idioms associated with collaboration on Github use are known to tens of millions of users, lowering the barriers faced by potential OpenAddresses contributors.

Anyone may contribute to OpenAddresses by submitting a Github “Pull Request” for an addition or revision to the project’s collection of metadata. Experienced project volunteers review these requests and either accept them, ask for revisions or explain why they cannot be accepted. Contributions are usually gratefully accepted, but sometimes they are rejected for reasons like unacceptable license terms or being duplicative.

Behind the scenes, OpenAddresses runs software that watches the Github repository for Pull Requests of metadata. When it sees one, it attempts to fetch the data that the metadata points to, analyze it and map it. This same software periodically retrieves all data sources, making note of any that have gone offline, and composes downloadable archives of CSVs that are organized by geography and license terms.

OpenAddresses represents a balance between making data usable and preserving its provenance. For instance, only nondestructive transformations are applied to the data as it makes its way into CSV files, such as moving geographic coordinates into the standard EPSG:4326 projection. It is always possible to trace the data’s origins back to its source. However, this means that some operations must be performed by end users, notably including deduplication in cases where data about an area is offered by governments of multiple levels (e.g. both a city and its containing state).

OpenAddresses.io has collected over 488 million address points from countries around the world, and is used by a variety of researchers, governments and some of the world’s largest mapping companies. This is possible thanks to the forward-looking open data policies of the governments whose data the project catalogues and thanks to the dozens of volunteers that make the project a success. We believe OpenAddresses is both an important source of open geodata and an effective model for organizing authoritative but messy information across government institutions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment