Skip to content

Instantly share code, notes, and snippets.

Embed
What would you like to do?
"Analysis Ready" Daylight OSM Distribution Available on AWS

What are the Daylight OpenStreetMap Parquet Files?

Listed on the registry of Open Data on AWS, the Daylight OpenStreetMap Parquet files contain the latest Daylight Map Distribution of OpenStreetMap in an analysis-ready format. This dataset is optimized for cloud-based queries with Amazon Athena, meaning anyone can access the entire dataset with SQL queries in the browser, without the need to download or access the files directly.

The Daylight Map Distribution of OpenStreetMap is always openly available for download in the standard OSM PBF format (find it at daylightmap.org). The parquet files, however, were first made available alongside Daylight release v1.9. They contain fully resolved geometries and additional metadata including areas, lengths, and quadkeys, not present in the PBF.

In total, the OSM features files contain nearly 1B features including 178M+ nodes, 816M+ ways, and 5M+ relations. These are all renderable features, meaning they have geometries that can be rendered on a map. The OSM features files therefore do not include untagged nodes or relations without basic geometries (such as turn restrictions). Untagged objects can also be found in this analysis-ready format in the OSM elements files, described in the comments.

Setting Up The daylight_osm_features Table

The following CREATE TABLE command will add the daylight_osm_features table to your AWS data (i.e. Glue) catalog. For best performance, make sure your Athena console is running in the same region as the s3://daylight-openstreetmap bucket: us-west-2.

CREATE EXTERNAL TABLE `daylight_osm_features`(
  `id` bigint, 
  `version` int, 
  `changeset` bigint, 
  `created_at` timestamp, 
  `tags` map<string,string>, 
  `wkt` string, 
  `min_lon` double, 
  `max_lon` double, 
  `min_lat` double, 
  `max_lat` double, 
  `quadkey` string, 
  `linear_meters` double,
  `square_meters` double)
PARTITIONED BY ( 
  `release` string,
  `type` string)
STORED AS PARQUET
LOCATION
  's3://daylight-openstreetmap/parquet/osm_features/'

Next, load the partitions by running: MSCK REPAIR TABLE daylight_osm_features.

Remember to use WHERE release = to query only one dataset (such as v1.10 or v1.9). These are the release numbers of the Daylight Map Distribution. Omitting this WHERE clause will result in querying duplicate features.

Basic Examples

1. Total kilometers of different highway features in OSM?

SELECT tags [ 'highway' ] AS highway_tag,
    count(id) AS osm_features,
    count(distinct(changeset)) AS total_changesets,
	sum(linear_meters) / 1000 AS total_km
FROM daylight_osm_features
WHERE linear_meters > 0 AND tags['highway'] IS NOT NULL
  AND release='v1.9'
GROUP BY tags [ 'highway' ]
ORDER BY total_km DESC

2. Extract point geometries for all hospital features in OSM?

SELECT 
    tags['name'] as name,
    ST_CENTROID(ST_GEOMETRYFROMTEXT(wkt)) as center,
    tags['amenity'] as amenity
FROM daylight_osm_features
WHERE tags['amenity'] LIKE 'hospital'
  AND release='v1.9'
@jenningsanderson
Copy link
Author

jenningsanderson commented Dec 23, 2021

Extrapolating on the Hospital Features query, here is a comaprison of all amenity=hospital (red) and amenity=ice_cream (blue) features in OpenStreetMap. You can see that ice cream shops are far more prevalent on coasts around the world (shown almost outlining Western Europe), with high density of both features (shown in white) in major cities globally.
image
Data (c) OpenStreetMap Contributors

Albeit a silly example, this query takes ~25 seconds and generated 16MB of data, demonstrating both the power and simplicity (fewer than 5 lines) of this approach to OSM data analysis.

@jenningsanderson
Copy link
Author

jenningsanderson commented Feb 16, 2022

For those familiar with the OSM data model, The OSM Elements files include all of the untagged nodes as well as the refs and roles for relation members and nodes within ways. The OSM Elements table contains all of the information in the Daylight OSM PBF, in addition to resolved geometries (where possible), bounding box, length, area, and quadkey.

The OSM Elements Table can be created with this query:

CREATE EXTERNAL TABLE `daylight_osm_elements`(
  `id` bigint, 
  `version` int, 
  `changeset` bigint, 
  `created_at` timestamp, 
  `tags` map<string,string>, 
  `wkt` string, 
  `min_lon` double, 
  `max_lon` double, 
  `min_lat` double, 
  `max_lat` double, 
  `quadkey` string, 
  `linear_meters` double, 
  `square_meters` double, 
  `node_refs` array<bigint>, 
  `node_roles` array<string>, 
  `way_refs` array<bigint>, 
  `way_roles` array<string>, 
  `rel_refs` array<bigint>, 
  `rel_roles` array<string>)
PARTITIONED BY ( 
  `release` string, 
  `type` string)
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
  's3://daylight-openstreetmap/parquet/osm_elements'
TBLPROPERTIES (
  'has_encrypted_data'='false', 
  'parquet.compression'='GZIP')

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment