Skip to content

Instantly share code, notes, and snippets.

@tmcw
Created August 8, 2022 18:07
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save tmcw/91042dab0c7f0446773eafd31dd1f327 to your computer and use it in GitHub Desktop.
Save tmcw/91042dab0c7f0446773eafd31dd1f327 to your computer and use it in GitHub Desktop.

A faster format in Placemark

Placemark uses GeoJSON, and JSON, everywhere. In the database, features are GeoJSON. When we're sending features into Mapbox GL, it's GeoJSON getting sent with postMessage.

GeoJSON, of course, has its issues:

  • Sending GeoJSON features across to the WebWorker that cuts tiles for Mapbox GL JS is a major bottleneck for the core editing experience.
  • GeoJSON in resident memory is bigger than it needs to be. GeoJSON's coordinate arrays, especially, are an issue - flat arrays would be much more compact.
  • Mapbox GL JS itself has to cut GeoJSON into tiles, which requires some transformation - it creates another flat representation of geometry coordinates.

There are many formats that aim to be better than GeoJSON. For example:

  • Apache Arrow
  • Parquet
  • FlatGeobuf

Placemark already supports FlatGeobuf by using its reference implementation, and it's pretty good, though it has a few limitations, like currently not being able to store JSON values of properties. Also, FlatGeobuf has a spatial index meant to support range requests, which is not a useful feature for Placemark, and it has no existing way to update a dataset.

The GeoParquet format seems like it's further along than GeoArrow. It seems like between Arrow, Parquet, and Feather, all of these formats are starting to converge - the Arrow v2 format is the same as Feature, and Parquet is used as a serialization format for Arrow? GeoParquet also appears to be targeting a few somewhat odd goals - multiple geometry columns, spherical features, multiple projections, all of which are neat but not relevant to Placemark.

I am thinking, for Placemark, that unfortunately the goals and design of these formats doesn't match with my goals. They are:

  • Almost exclusively read-oriented without updates considered
  • Often written with Python or C++ as an initial implementation target, making the Javascript implementation worse as a result
  • Focused on homogenous data, which can be stored in columnar form. GeoJSON is an interesting combination of very homogenous data - coordinate arrays - with very heterogenous data - properties.

Sizes (minzipped):

  • apache-arrow: 49.4kB
  • flatgeobuf: 14.2kB

Issues

The goals of using a GeoJSON alternative format in Placemark would be:

  • Reduce memory overhead
  • Make the editing loop faster by reducing deserialization / transformation steps

I am wary of a few gotchas, in particular the thing about "zero-copy" memory management. You'll probably need at least one copy of transformed data to display on the map, and you'll need to decode to GeoJSON to let people editing features as GeoJSON. But how many more copies? It's easy to do something that supports an efficient file format, but transforms it into an inefficient format which immediately negates the advantage.

So ideally, concretely, I'd want

  1. To communicate with the server using this format
  2. To encode tiles relatively "directly" with the format, without a GeoJSON go-between
  3. And for there to be, most of the time, ways to only load the necessary geometries into memory

And some of the issues with that:

  • Placemark currently communicates updates via a combination of Server-Sent Events and "pull" operations. SSE is text based and pulls are JSON based. Using SSE for binary data is possible with hacks but is not very efficient. Using WebSockets we could send binary data. Also, the "updates" have some other wrapper features, which means - would I need a layer on top of something like Arrow to communicate updates?

memory-geojson

memory-geojson is my vehicle for exploring this idea so far. It checks a few boxes: it's JavaScript-first, has a tiny implementation, could be used "in-place," and has a viable path to supporting updates.

That said, it needs to justify its existence - the risk of it being a NIH project is high. But, on the other hand, it is sort of like MapShaper's implementation of an in-place memory representation of features.

Plan of attack

I don't think that memory-geojson, or another other approach listed, will be immediately viable in Placemark. But it may be demo-viable.

Anyway, I think research is the next step. Thoroughly understand MapShaper, read through GeoParquet, etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment