Skip to content

Instantly share code, notes, and snippets.

@jroper
Last active August 29, 2015 14:02
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save jroper/a43d4a31c9f2d1138de3 to your computer and use it in GitHub Desktop.
Save jroper/a43d4a31c9f2d1138de3 to your computer and use it in GitHub Desktop.
RFC7159 Encoding

RFC7159 introduced a change to the RFC4627, as mentioned in Appendix A:

Changed the definition of "JSON text" so that it can be any JSON
  value, removing the constraint that it be an object or array.

This meant that the heuristic in Section 3 of RFC4627 was no longer valid:

Since the first two characters of a JSON text will always be ASCII
characters [RFC0020], it is possible to determine whether an octet
stream is UTF-8, UTF-16 (BE or LE), or UTF-32 (BE or LE) by looking
at the pattern of nulls in the first four octets.

       00 00 00 xx  UTF-32BE
       00 xx 00 xx  UTF-16BE
       xx 00 00 00  UTF-32LE
       xx 00 xx 00  UTF-16LE
       xx xx xx xx  UTF-8

since it no longer holds that the first two characters of a JSON text will always be ASCII characters (if the JSON text is a String, and the first character of the String is a unicode character). Hence that heuristic was removed from RFC7159.

RFC4627 also did not define a required or optional charset parameter for the application/json mime type. This wasn't needed, because the above heuristic existed. RFC7159 not only continued that, but also made an explicit exclusion for charset:

Note:  No "charset" parameter is defined for this registration.
  Adding one really has no effect on compliant recipients.

On encoding, RFC7159 clarifies the statement from RFC4627 that JSON is to be encoded in Unicode, saying:

JSON text SHALL be encoded in UTF-8, UTF-16, or UTF-32.

So, RFC7159 allows three different encodings, removes the possibility of automatically detecting which encoding is used, and specifically disallows specifying which encoding is used. This is an inconsistency, either exactly one encoding should be mandated, or a mechanism for specifying the encoding should be provided.

I suggest making one of the following changes to the spec:

1. The least invasive solution

This makes no change to JSON itself, but just updates the wording in the spec. Section 8 says:

JSON text SHALL be encoded in UTF-8, UTF-16, or UTF-32. The default
encoding is UTF-8, and JSON texts that are encoded in UTF-8 are
interoperable in the sense that they will be read successfully by the
maximum number of implementations; there are many implementations
that cannot successfully read texts in other encodings (such as
UTF-16 and UTF-32).

As it stands, this statement is wrong. It is not true that "there are many implementations that cannot successfully read texts in other encodings" if no mechanism is provided to publish the encoding and the default is UTF-8, since all implementations, when receiving JSON from a source that they have no out of band encoding agreement with, must therefore parse the JSON as UTF-8. Therefore, a more accurate statement would be "there are no spec conforming implementations that can successfully read texts in other encodings".

My suggestion here is that this wording be updated to point out that without out of band agreement on encoding, an interoperable implementation must use UTF-8, since no in band mechanism is provided to specify the encoding. My suggestion for wording is this:

JSON text SHALL be encoded in UTF-8, UTF-16, or UTF-32. The default
encoding is UTF-8. Unless two implementations have out of band
agreement to use UTF-16 or UTF-32, interoporable implementations
MUST use UTF-8, because there is no in band mechanism provided for
conveying which encoding is to be used.

2. The ideal solution

The ideal solution, in my opinion, is to disallow UTF-16 and UTF-32. While this may break existing implementations that use or provide the ability to use UTF-16 or UTF-32, those implementations have already been broken by virtue of the fact that the only allowed mechanism in RFC4627 for encoding detection is no longer valid. By mandating UTF-8, the spec becomes simpler, and implementations become simpler. This in my opinion is a win for everyone.

3. The other solution

The other solution is to make charset an optional parameter for the application/json media type, defaulting to UTF-8 if not present. This solution is not ideal because it does not address byte ordering.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment