nyurik/denormalize_osm_data.md

## denormalize_osm_data.md

      
    Raw
  

              denormalize_osm_data.md
            
          
    OpenStreetMap data is heavily normalized, making it very hard to process.
Modeled on a relational database, it seems to have missed the second part of the
"Normalize until it hurts; denormalize until it works" proverb.
Each node has an ID, and every way and relation uses an ID to reference that node. This means that every data consumer must keep an enrmous cache of 8 billion node IDs and corresponding lat,lng pairs while processing input data. In most cases, node ID gets discarded right after parsing.
I would like to propose a new easy to process data strucutre, for both bulk downloads and streaming update use cases.
Target audience


YES -- Data consumers who transform OSM data into something else, i.e. tiles, shapes, analytical reports, etc.
NO -- Apps that submit changes back to OSM, unless they also download individual objects in the original format with all IDs intact.

Data Specification


Split nodes into two types -- position node and content node:

Position node is a node that has no tags -- just a lat,lng coordinate pair. Position node's coordinates are inlined into ways and relations. Their ID is essentially deleted from the output.
Content node is a regular node object with an ID, geo coordinate pair, and a list of tags (same as we have now).


A way has a list of geo points instead of a list of node IDs.

TBD: A way may have an optional list of content node IDs.


A relation has a list of values, where each value can be:

a lat,lng pair with an optional content node ID.
a way ID
a relation ID


Streaming


If a single OSM node is moved, the change stream will include all objects that include that node

Extras


Each PBF block should have uncompressed meta information (useful to skip unrelated info):

Number of each feature type contained in the block: counts of nodes, ways, relations, changesets (?).
Bounding box of all features in the block.