Skip to content

Instantly share code, notes, and snippets.

@jenningsanderson
Last active October 4, 2023 15:12
Show Gist options
  • Save jenningsanderson/3e42a99dcb8f760038ad8aa47ea38ce8 to your computer and use it in GitHub Desktop.
Save jenningsanderson/3e42a99dcb8f760038ad8aa47ea38ce8 to your computer and use it in GitHub Desktop.
"Analysis Ready" Daylight OSM Distribution Available on AWS

What are the Daylight OpenStreetMap Parquet Files?

Listed on the registry of Open Data on AWS, the Daylight OpenStreetMap Parquet files contain the latest Daylight Map Distribution of OpenStreetMap in an analysis-ready format. This dataset is optimized for cloud-based queries with Amazon Athena, meaning anyone can access the entire dataset with SQL queries in the browser, without the need to download or access the files directly.

The Daylight Map Distribution of OpenStreetMap is always openly available for download in the standard OSM PBF format (find it at daylightmap.org). The parquet files, however, were first made available alongside Daylight release v1.9. They contain fully resolved geometries and additional metadata including areas, lengths, and quadkeys, not present in the PBF.

In total, the OSM features files contain nearly 1B features including 178M+ nodes, 816M+ ways, and 5M+ relations. These are all renderable features, meaning they have geometries that can be rendered on a map. The OSM features files therefore do not include untagged nodes or relations without basic geometries (such as turn restrictions). Untagged objects can also be found in this analysis-ready format in the OSM elements files, described in the comments.

Setting Up The daylight_osm_features Table

The following CREATE TABLE command will add the daylight_osm_features table to your AWS data (i.e. Glue) catalog. For best performance, make sure your Athena console is running in the same region as the s3://daylight-openstreetmap bucket: us-west-2.

CREATE EXTERNAL TABLE `daylight_osm_features`(
  `id` bigint, 
  `version` int, 
  `changeset` bigint, 
  `created_at` timestamp, 
  `tags` map<string,string>, 
  `wkt` string, 
  `min_lon` double, 
  `max_lon` double, 
  `min_lat` double, 
  `max_lat` double, 
  `quadkey` string, 
  `linear_meters` double,
  `square_meters` double)
PARTITIONED BY ( 
  `release` string,
  `type` string)
STORED AS PARQUET
LOCATION
  's3://daylight-openstreetmap/parquet/osm_features/'

Next, load the partitions by running: MSCK REPAIR TABLE daylight_osm_features.

Remember to use WHERE release = to query only one dataset (such as v1.10 or v1.9). These are the release numbers of the Daylight Map Distribution. Omitting this WHERE clause will result in querying duplicate features.

Basic Examples

1. Total kilometers of different highway features in OSM?

SELECT tags [ 'highway' ] AS highway_tag,
    count(id) AS osm_features,
    count(distinct(changeset)) AS total_changesets,
	sum(linear_meters) / 1000 AS total_km
FROM daylight_osm_features
WHERE linear_meters > 0 AND tags['highway'] IS NOT NULL
  AND release='v1.9'
GROUP BY tags [ 'highway' ]
ORDER BY total_km DESC

2. Extract point geometries for all hospital features in OSM?

SELECT 
    tags['name'] as name,
    ST_CENTROID(ST_GEOMETRYFROMTEXT(wkt)) as center,
    tags['amenity'] as amenity
FROM daylight_osm_features
WHERE tags['amenity'] LIKE 'hospital'
  AND release='v1.9'
@jenningsanderson
Copy link
Author

jenningsanderson commented Dec 23, 2021

Extrapolating on the Hospital Features query, here is a comaprison of all amenity=hospital (red) and amenity=ice_cream (blue) features in OpenStreetMap. You can see that ice cream shops are far more prevalent on coasts around the world (shown almost outlining Western Europe), with high density of both features (shown in white) in major cities globally.
image
Data (c) OpenStreetMap Contributors

Albeit a silly example, this query takes ~25 seconds and generated 16MB of data, demonstrating both the power and simplicity (fewer than 5 lines) of this approach to OSM data analysis.

@jenningsanderson
Copy link
Author

For those familiar with the OSM data model, The OSM Elements files include all of the untagged nodes as well as the refs and roles for relation members and nodes within ways. The OSM Elements table contains all of the information in the Daylight OSM PBF, in addition to resolved geometries (where possible), bounding box, length, area, and quadkey.

The OSM Elements Table can be created with this query:

CREATE EXTERNAL TABLE `daylight_osm_elements`(
  `id` bigint, 
  `version` int, 
  `changeset` bigint, 
  `created_at` timestamp, 
  `tags` map<string,string>, 
  `wkt` string, 
  `min_lon` double, 
  `max_lon` double, 
  `min_lat` double, 
  `max_lat` double, 
  `quadkey` string, 
  `linear_meters` double, 
  `square_meters` double, 
  `node_refs` array<bigint>, 
  `node_roles` array<string>, 
  `way_refs` array<bigint>, 
  `way_roles` array<string>, 
  `rel_refs` array<bigint>, 
  `rel_roles` array<string>)
PARTITIONED BY ( 
  `release` string, 
  `type` string)
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
  's3://daylight-openstreetmap/parquet/osm_elements'
TBLPROPERTIES (
  'has_encrypted_data'='false', 
  'parquet.compression'='GZIP')

@jonah-dawg
Copy link

jonah-dawg commented Apr 21, 2023

Is there an IAM role or policy needed in addition to any of this?

I’m encountering a “user denied” error when running the MSCK command to load the partitions. In the past we’ve created tables over external OSM data so I doubt that’s it

@jenningsanderson
Copy link
Author

@jonah-dawg - interesting, maybe double check your Athena results destination / permissions your user has over these types of queries. My guess is that the error is coming from athena trying to write the results of the partition load query.

Yes, if you’re using an IAM role for this, there are a number of glue and Athena permissions on your account required to run queries / affect the data catalog, typically the error returned however lists exactly the permission required.

@jenningsanderson
Copy link
Author

Additionally, it might not be necessary to load all partitions. You can use the “alter table, add partition” syntax for release=v1.25 if you only want the latest data. Maybe see if that returns a more descriptive error with regard to permissions?

@jonah-dawg
Copy link

@jenningsanderson thanks for the recommendation, it did reveal some more explicit errors

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment