Skip to content

Instantly share code, notes, and snippets.

@FeepingCreature
Last active January 7, 2024 19:33
Show Gist options
  • Star 24 You must be signed in to star a gist
  • Fork 2 You must be signed in to fork a gist
  • Save FeepingCreature/d2fd982f485973a154abcaf0ccb4003c to your computer and use it in GitHub Desktop.
Save FeepingCreature/d2fd982f485973a154abcaf0ccb4003c to your computer and use it in GitHub Desktop.
You Are Doing JSON APIs Wrong

You are doing JSON APIs wrong.

When you use JSON to call an API - not a REST API, but something like JSON-RPC - you will usually want to encode one of several possible messages.

Your request body looks like this:

{
    "type": "MessageWithA",
    "data": {
        "a": 5
    }
}

Or like this:

{
    "type": "MessageWithB",
    "b": 5
}

However, that's bad. Here's how you should do it:

{
    "messageWithA": {
        "a": 5
    }
}

Or

{
    "type": "MessageWithA",
    "data": {
        "messageWithA": {
            "a": 5
        }
    }
}

Why? Stream parsers.

Stream parsers?

There are two ways of processing a JSON message. The first is to read the message into a JSON object, and then process the object inside-out. This usually takes the form of

if (event["type"] == "MessageWithA") {
    handleAMessage(event["data"]);
}

However, this has the unavoidable overhead of allocating an object for every part of the JSON tree. Especially if you are decoding into a well-typed internal data structure, you allocate this object just to throw it away shortly after. This creates unnecessary memory overhead.

A faster way is with a stream parser. A stream parser lexes the input text into a stream of JSON tokens, such as BeginObject, KeyString, BeginArray, String, String, Int, EndArray, EndObject for { "a": ["b", "c", 5]}. These tokens are then consumed by a recursive parser, usually generated, that produces the internal data structure directly. In other words, there never exists a recursive data structure for the JSON datagram.

This has the advantage of not requiring any allocation for values that we are not interested in. However, it also means we cannot access "type"; rather, we have to react to "type" as we come across it in the input stream.

As a consequence, if MessageWithA and MessageWithB have a different format for "data", as they usually do, we have to decode the message twice: Once, only decoding the "type" field, and then a second time, only decoding the "data" field.

Without knowing the value of the type field, the data field is unparseable! The stream parser will not know which recursive function it is supposed to call.

You may think that you can just read the "type" field first. But your message may look like this:

{
    "data": { .... "message": "severalMegabytesOfBase64Data" },
    "type": "VeryLargeMessage"
}

Now your stream parser has to skip over a large segment of the input text! Not only is this slow, it also requires holding the entire message in memory. If we knew from the start what the type of every field was, then we could consume the message in small chunks, improving cache efficiency as well as throughput.

Let's look at our alternative:

{
    "data": {
        "veryLargeMessage": { ... }
    },
    "type": "VeryLargeMessage"
}

This time, the JSON parser has it easy. There is no field in this message whose type is indeterminate. As a result, without having to search for the "type" field, we can create an internal data structure for this message in one go. We get an ambiguity, where the type of the "data" field may diverge from the "type" field, however, we can just leave out "type":

{
    "veryLargeMessage": { ... }
}

In summary, one simple rule

The type of every JSON object field should be uniquely determined by its field name.

@FeepingCreature
Copy link
Author

Correction: process the object outside-in, of course.

@nlyan
Copy link

nlyan commented May 4, 2022

Looks like the JSON serializations for both Protocol Buffers and Flatbuffers have this problem

In the Protobuf 3 JSON serialization Any fields have a "@ type" field embedded in to the value object (when set to a message type), and as a sibling to the "value" field for simple types.

Flatbuffers uses "foo_type" as a sister node to the the "foo" field when foo is a Union, so that could end-up out of order as well.

Interestingly Apache Avro uses "foo": {"FooType": { ... } }, which should work just fine with a streaming parser (which Avro's C++ implementation actually does have). Overall I think this is the best option

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment