Skip to content

Instantly share code, notes, and snippets.

@nyurik
Last active February 15, 2022 06:53
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save nyurik/88730133a8d00ead67ac8520640e1fc1 to your computer and use it in GitHub Desktop.
Save nyurik/88730133a8d00ead67ac8520640e1fc1 to your computer and use it in GitHub Desktop.
Convenient OSM data

OpenStreetMap data is heavily normalized, making it very hard to process. Modeled on a relational database, it seems to have missed the second part of the "Normalize until it hurts; denormalize until it works" proverb.

Each node has an ID, and every way and relation uses an ID to reference that node. This means that every data consumer must keep an enrmous cache of 8 billion node IDs and corresponding lat,lng pairs while processing input data. In most cases, node ID gets discarded right after parsing.

I would like to propose a new easy to process data strucutre, for both bulk downloads and streaming update use cases.

Target audience

  • YES -- Data consumers who transform OSM data into something else, i.e. tiles, shapes, analytical reports, etc.
  • NO -- Apps that submit changes back to OSM, unless they also download individual objects in the original format with all IDs intact.

Data Specification

  • Split nodes into two types -- position node and content node:
    • Position node is a node that has no tags -- just a lat,lng coordinate pair. Position node's coordinates are inlined into ways and relations. Their ID is essentially deleted from the output.
    • Content node is a regular node object with an ID, geo coordinate pair, and a list of tags (same as we have now).
  • A way has a list of geo points instead of a list of node IDs.
    • TBD: A way may have an optional list of content node IDs.
  • A relation has a list of values, where each value can be:
    • a lat,lng pair with an optional content node ID.
    • a way ID
    • a relation ID

Streaming

  • If a single OSM node is moved, the change stream will include all objects that include that node

Extras

  • Each PBF block should have uncompressed meta information (useful to skip unrelated info):
    • Number of each feature type contained in the block: counts of nodes, ways, relations, changesets (?).
    • Bounding box of all features in the block.
@nyurik
Copy link
Author

nyurik commented Feb 10, 2022

Thank you @mmd-osm !! It looks like @joto is leading the effort (?), but also seems that the effort is far more involved than what I proposed -- I think he is trying to change the internal API and data storage model. If so, this would be a far more complicated and take longer. I wonder if it would make sense to solve the "99%" problem, and just provide an alternative data dump/streaming format, and once its in place, work on changing the internals and/or API independently?

One issue @joto does mention is topology. But just as Jochen writes, the vast majority of the duplicate nodes at the same positions are errors. Plus it introduces ambiguity with updates -- if at first we strip node IDs from a way, and later a new feature is added with an identical node's location, it would not be possible to determine if the new one is the same node or a different one. One solution here would be to force OSM model to refuse duplicate nodes at the same location - forcing users to put separate nodes nearby instead. But this would create a slew of other issues - all editor tools would need to be aware of how a geo coords pair gets normalized into two 32 bit values, and ensure that they are different. Tricky, and the "obvious" solution is to treat this ultra-rare problem as non-existent... Not ideal, but solves the problem for the other 99.999% use cases.

P.S. A hacky workaround to handle dups just in the API - if multiple nodes share identical location, treat them as separate, and nudge them all apart by the minimum distance in any direction

@joto
Copy link

joto commented Feb 11, 2022

Your proposed 99% solution has been in place for years with an extension of the OSM file formats where node locations are stored on the ways. See https://docs.osmcode.org/osmium/latest/osmium-add-locations-to-ways.html for a way to create a file like this. That format is just not that widely known, but some people use it.

@nyurik
Copy link
Author

nyurik commented Feb 11, 2022

Thanks! Has there been any work to maybe make planet downloadable in this format? Having to run osmium tool defeats the whole purpose of not doing this work by each data consumer.

@joto
Copy link

joto commented Feb 11, 2022

@nyurik I don't know of anybody offering this as a reliable and sustained service. If and when the EWG decides they want to put money into this effort, I'd imagine it would be one of the first things we'd set up.

@nyurik
Copy link
Author

nyurik commented Feb 11, 2022

@joto thx. OSM might save some bandwidth costs by offering this as a download, but obviously tech efforts might be far more than the savings. I'm still experimenting with it, will look closer at that code and that data structure. I might want to re-implement it in Rust for safety/security perspective, and to make it easier to integrate into other code as a simple to use lib. Plus it appears that tool uses RAM for node resolution, which obviously wouldn't work for doing it on a laptop.

Has that flat structure been finalized, or is it still a work in progress?

@mmd-osm
Copy link

mmd-osm commented Feb 11, 2022

I don't think locations-on-ways has changed since it was first described in this blog post: https://blog.jochentopf.com/2016-04-20-node-locations-on-ways.html

Implementation is part of libosmium, e.g.: https://github.com/osmcode/libosmium/blob/master/include/osmium/io/detail/pbf_output_format.hpp#L709

@nyurik
Copy link
Author

nyurik commented Feb 14, 2022

Per email discussion, I added "extras" block above:

  • Each PBF block should have uncompressed meta information (useful to skip unrelated info):
    • Number of each feature types contained in the block -- node, way, and relation counts.
    • Bounding box of all features in the block.

@mmd-osm
Copy link

mmd-osm commented Feb 14, 2022

fyi: geographic indexing was part of the original PBF wiki page, but dropped at one point: https://wiki.openstreetmap.org/w/index.php?title=PBF_Format&type=revision&diff=590371&oldid=589464

@nyurik
Copy link
Author

nyurik commented Feb 14, 2022

@mmd-osm thx! Do you know why it was removed? Also, that proposal doesn't mention how indexing data was to be stored, more of a "we could store this as an extension, but we don't have a spec for that" if i read it correctly.

@mmd-osm
Copy link

mmd-osm commented Feb 15, 2022

It was Scott Crosby who removed that section. Maybe the idea wasn’t mature enough to keep it in a specification like document. Only speculating here. there might be some discussion on the mailing list back then.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment