tmcw/memory-format.md

## memory-format.md

      
    Raw
  

              memory-format.md
            
          
    A faster format in Placemark

Placemark uses GeoJSON, and JSON, everywhere. In the database, features are GeoJSON. When we're sending features into Mapbox GL, it's GeoJSON getting sent with postMessage.
GeoJSON, of course, has its issues:

Sending GeoJSON features across to the WebWorker that cuts tiles for Mapbox GL JS is a major bottleneck for the core editing experience.
GeoJSON in resident memory is bigger than it needs to be. GeoJSON's coordinate arrays, especially, are an issue - flat arrays would be much more compact.
Mapbox GL JS itself has to cut GeoJSON into tiles, which requires some transformation - it creates another flat representation of geometry coordinates.

There are many formats that aim to be better than GeoJSON. For example:

Apache Arrow
Parquet
FlatGeobuf

Placemark already supports FlatGeobuf by using its reference implementation, and it's pretty good, though it has a few limitations, like currently not being able to store JSON values of properties. Also, FlatGeobuf has a spatial index meant to support range requests, which is not a useful feature for Placemark, and it has no existing way to update a dataset.
The GeoParquet format seems like it's further along than GeoArrow. It seems like between Arrow, Parquet, and Feather, all of these formats are starting to converge - the Arrow v2 format is the same as Feature, and Parquet is used as a serialization format for Arrow? GeoParquet also appears to be targeting a few somewhat odd goals - multiple geometry columns, spherical features, multiple projections, all of which are neat but not relevant to Placemark.
I am thinking, for Placemark, that unfortunately the goals and design of these formats doesn't match with my goals. They are:

Almost exclusively read-oriented without updates considered
Often written with Python or C++ as an initial implementation target, making the Javascript implementation worse as a result
Focused on homogenous data, which can be stored in columnar form. GeoJSON is an interesting combination of very homogenous data - coordinate arrays - with very heterogenous data - properties.

Sizes (minzipped):

apache-arrow: 49.4kB
flatgeobuf: 14.2kB

Issues

The goals of using a GeoJSON alternative format in Placemark would be:

Reduce memory overhead
Make the editing loop faster by reducing deserialization / transformation steps

I am wary of a few gotchas, in particular the thing about "zero-copy" memory management. You'll probably need at least one copy of transformed data to display on the map, and you'll need to decode to GeoJSON to let people editing features as GeoJSON. But how many more copies? It's easy to do something that supports an efficient file format, but transforms it into an inefficient format which immediately negates the advantage.
So ideally, concretely, I'd want

To communicate with the server using this format
To encode tiles relatively "directly" with the format, without a GeoJSON go-between
And for there to be, most of the time, ways to only load the necessary geometries into memory

And some of the issues with that:

Placemark currently communicates updates via a combination of Server-Sent Events and "pull" operations. SSE is text based and pulls are JSON based. Using SSE for binary data is possible with hacks but is not very efficient. Using WebSockets we could send binary data. Also, the "updates" have some other wrapper features, which means - would I need a layer on top of something like Arrow to communicate updates?

memory-geojson

memory-geojson is my vehicle for exploring this idea so far. It checks a few boxes: it's JavaScript-first, has a tiny implementation, could be used "in-place," and has a viable path to supporting updates.
That said, it needs to justify its existence - the risk of it being a NIH project is high. But, on the other hand, it is sort of like MapShaper's implementation of an in-place memory representation of features.
Plan of attack

I don't think that memory-geojson, or another other approach listed, will be immediately viable in Placemark. But it may be demo-viable.
Anyway, I think research is the next step. Thoroughly understand MapShaper, read through GeoParquet, etc.