Skip to content

Instantly share code, notes, and snippets.

@warpfork
Last active November 2, 2020 21:03
Show Gist options
  • Save warpfork/8dd88ffb90bdd40eff9945cf7a67fc15 to your computer and use it in GitHub Desktop.
Save warpfork/8dd88ffb90bdd40eff9945cf7a67fc15 to your computer and use it in GitHub Desktop.
IPLD Strings, round 4

IPLD Strings

For an IPLD library to have "Complete" support for the IPLD Data Model, Strings MUST support the full range of 8-bit bytes. Strings SHOULD use UTF-8, and library designers may encourage this however they see fit. In particular, NFC-normalized form is encouraged. However, this MUST NOT result in the inability of a library to handle non-UTF-8 byte sequences where Strings are handled, nor should libraries apply normalizations to data they read. (In other words: using normalized forms when creating new data is encouraged, but applying normalization to data being parsed is abhorrent. Normalization is mutation.)

Libraries MAY choose to support only some domains of strings, such as only allowing Unicode characters, or only allowing UTF-8, or only allowing UTF-8 with NFC normalization. Such libraries are known as "Limited Domain" libraries (in contrast to "Complete" libraries, which follow the "MUST"s above to their full extent). "Limited Domain" libraries are a part of the IPLD ecosystem, but should document themselves as such to avoid confusion.

Similarly, a Codec MAY limit its support to only some domains of strings, and such a Codec this is known as a "Limited Domain" Codec (more specifically, "incomplete(stringmangling)" -- see Codecs and Completeness). As with libraries, such "Limited Domain" codecs are a part of the IPLD ecosystem (there are many, in fact! and many codecs are "Limited Domain" for other reasons as well; sometimes this arises from other tradeoffs), but as with libraries, such codecs should document themselves as such to avoid confusion.

Why is this definition like this?

We use the word "MUST" to describe things we want implementations to support and build deeply into their understanding. We might describe what end-users should do with these features separately, but describe what libraries must support in somewhat stronger terms. We use "MUST" around our description of what data sequences libraries must be able to describe, because it is a high priority for the IPLD ecosystem that IPLD libraries be able to round-trip data without loss. (At the same time, we often describe what we recommend end-users do with this support range to be a much smaller recommendation.)

We use the word "SHOULD" to describe what we think implementations should encourage, but which we recognize cannot be strict rules because their unconditional enforcement would either result in a loss of functionality, or result in great difficulty in interoperability, or result in such great difficulty in practical enforcement that we would rather not create split ecosystems based on strictness. For example, we think that strings should be greatly encouraged to use UTF-8 encodings -- but we're unwilling to mandate this unconditionally, because it would mean failing to regard data which falls outside of that range, which would be a loss of functionality.

We use the word "MAY" where we want to make it clear that some libraries can make different choices. We avoid the use of "MAY" where "SHOULD" or "MUST" can be used; all cases where we use the word "MAY" are in paragraphs which describe what happens (and how we talk about it) when one of the "SHOULD" or "MUST" clauses has been disregarded by an implementation. (Namely: we use "SHOULD" and "MUST" for talking about complete, fully spec-complying implementations; and we use "MAY" to talk about implementations which have made choices outside those bounds.)

REVIEW: this use of "MAY" may (heh) be sketchy. But how else could we phrase this?

Alternative phrasing

TODO: write the spec paragraphs again, but with separate sentences back to back which make distinct discussion of what libraries must support vs what end users should do.

Effects

Effects on Libraries

The main concern in the design of IPLD libraries is how to bridge the potential gap between the IPLD Data Model definition of a "string", and any definition that the language standard library might have of "string". For languages that allow arbitrary string encodings in their standard library string type, the question is trivial. For langauges that have stricter or more complicated opinions about the data ranges supported by their standard library string type, things can be more intricate.

Fortunately, there are some general patterns libraries can use to work with this challenge:

IPLD Libraries typically have a Node type (as discussed in https://github.com/ipld/specs/blob/master/design/libraries/nodes-and-kinds.md). This Node type is a reasonable place to have AsString() -> StdlibString "unboxing" methods. In languages with standard library string types that have limited domains in some way, this method may return errors in order to clarify what happens when the domain limit is encountered: perhaps something like AsString() -> (StdlibStrictString|Error). In these cases, another method like AsBytes() -> (Bytes) could still be used to access the raw data without loss. In some languages and library environments, perhaps more and more varied methods for accessing the string content in filtered as well as unfiltered ways may be useful.

Library documentation should describe whatever methods are those that should be used for handling string data losslessly. End-users creating applications can use whichever methods they see fit, since they know of their own data domain and can thus make such choices. Writers of reusable functions working on the IPLD Data Model will need to look at which functions allow handling of string data losslessly, and prefer to use those functions exclusively if they want their functions to work on all data domains. (It should be possible to identify one such "reading" method and one such "writing" method which will always work losslessly for string-kind data.)

Effects on Specifications

You're lookin' at em.

Effects on Codecs

Codecs generally have three choices:

  1. Support sequences of 8-bit bytes as strings directly;
  2. If embedding IPLD strings within the subspace of UTF-8 strings: use an escaping mechanism;
    • One option is to use familiar patterns like "\xHH" hex escapes;
    • Another option which embeds even in other strings that don't allow new escape sequences is to use UTF-C8;
  3. Declare the codec to be "limited domain", and document which sequences are allowed and disallowed.

(Further, note that some codecs are "limited domain" by specification; some are "limited domain" by (some!) implementations. It is always possible for someone to produce a library of code which handles a limited domain of strings, even if the specification for that codec is not limited; such an implementation must simply be documented as such, to avoid confusion.)

Effects on DAG-JSON

There is only one reasonable choice for DAG-JSON, by process of exclusion:

  • It's desirable for DAG-JSON to support the full IPLD Data Model (or as much as it can; other existing compromises notwithstanding), so we do not want to choose option 3 (go "limited domain") for this codec.
  • Supporting 8-bit byte sequences "directly" in JSON simply isn't defined.
  • Escaping via introduction of "\xHH" sequences is not valid JSON and many widely used JSON systems will reject this data.

So: DAG-JSON should use UTF8-C8 as an escaping mechanism.

This means current implementations of DAG-JSON which do not implement UTF8-C8 escaping are partial implementations and technically should be flagged as "limited domain".

Effects on DAG-CBOR

DAG-CBOR should support sequences of 8-bit-bytes as strings directly.

This breaks from the CBOR specification's statement that strings should be UTF-8. However, the value we gain from this is significant.

DAG-CBOR is a binary protocol, and one where we're significantly interested in its performance. Simply putting all IPLD Data Model strings directly without escaping into the string "major type" in CBOR encoding results in a satisfactory and high-performance outcome (both in speed and in serial compactness compared to escaping-based approaches).

At the same time, there are few downsides to defining DAG-CBOR as encoding all string sequences directly. There are many fewer implementations of CBOR tools we care about crosscompat with; and those existing tools also often eschew strict unicode checking anyway (which is in line our with view); and it's simply easy to do this: the CBOR wire format is trivially supportive of this by writing less code.

The alternative would be also doing C8 in DAG-CBOR, but... the result of that would be inefficient for non-unicode bytes, and take more effort to implement (notably, probably requiring forking any existing cbor libraries to emit), and overall it's difficult to see any reason to prefer this path. (If we did specify DAG-CBOR in this way, it's likely it would cause an immediate follow-up of a proposal of a new spec for a new codec of DAG-CBOR-HARDRR codec which does encode strings directly; and thereafter we'd end up -- either because we encourage it or because it happens naturally as people seek higher performance systems -- seeing large swaths of the IPLD ecosystem move to DAG-CBOR-HARDRR; and at that point we'd have to ask ourself if the new codec was worth it, if its only purpose was to emaciate the older codec, at the cost of adding more code to our libraries and making a bigger migration headache for older data.)

Remarks and Reasoning

We make some choices here based on various constraints of various strengths:

  • First and foremost: what gives us the widest range of support in what our data model can handle;
  • More specifically, whether what our data model can handle is inclusive of common usecases we need to describe with it (filenames are a particularly present example of this);
  • Whether we can make these things work cleanly and clearly with existing codecs;
  • Whether we can make these things work fast in practical implementations;
  • Whether these things are easy to implement as a developer of a new IPLD library;
  • Whether these things are amenable to partial implementations;
  • How well these things will fare when handling real world data (including how resilient libraries are at handling data which may have come from partial implementations, and been produced with less strict rules).

Variations in strength of these constraints are numerous.

This is particularly true when it comes to codecs. It is important to note that DAG-CBOR is not CBOR, and DAG-JSON is not JSON. We are behooved to keep each of these things close to their origins, but they are distinct specs from their inspirations, and so the questions of interoperability are practical ones, and we can make trades based on practical realities.

For example, we consider some constraints very differently on different codecs:

  • We consider some of the compulsions of JSON compatibility for DAG-JSON to be particularly strong, because we are familiar with many JSON tools that predate IPLD and through which we would like to run IPLD data without difficulty. Since DAG-JSON is our bridge to doing so, we care a great deal that DAG-JSON data be very easy to handle with existing JSON systems.
  • In contrast to the above: we consider the compulsions for CBOR compatibility for DAG-CBOR to be somewhat more conditional. There are fewer pieces of CBOR tooling outside of IPLD that we have known interest in directly interfacing with; and, in this particular situation regarding strings, it is also the case that a number of those tools also already are willing to accept and process non-unicode strings.
  • The consideration of performance in the DAG-JSON codec is relatively weak, and in particular for strings, since escaping is already in the nature of the codec, adding more escaping (e.g., such as UTF8-C8) is not a major lapse of boundaries or change of expectations about the performance of the codec.
  • The consideration of performance in the DAG-CBOR codec is relatively strong, both because we typically describe DAG-CBOR in IPLD circles as being the fast, binary codec that you should reach for if you want speed; and also because being a binary codec, it's already naturally possible for DAG-CBOR implementations to make direct steps when handling string data as binary (whereas such a thing simply isn't on the table in the first place for non-binary codecs), meaning that to add escaping systems or validations with linear costs would be a major change of boundaries and change of expectations about the performance of the codec.

Fixture data

We should establish fixtures of interesting data, starting with general description of what's interesting, and then specializing it per codec.

"Interesting" should include examples of:

  • non-unicode sequences
  • sequences which are unicode but not particularly (and perhaps arguably) renderable -- such as the infamous U+200C zero-width non-joiner.
  • perhaps sequences which are valid UTF-16 but not UTF-8? etc.

ew-bang

An interesting fixture is the sequence "\xC3\x21".

This is a non-unicode sequence. The most reasonable interpretation of it is "\xC3!", because the first byte cannot be interpreted as Unicode, but in a resynchronizing encoding like UTF-8, the subsequent bytes can still be interpreted (in this case, as an ascii-plane exclamation point).

ew-bang in DAG-CBOR

The following hex sequence contains the ew-bang fixture as a key in a CBOR map: A162C32101.

IPLD libraries should be able to parse this as a map containing one entry, and should report that the Data Model Kind of the entry key is string and the Data Model Kind of the value is int. Whether the key is marked as invalid UTF-8 or not is an implementation detail left up to the library; however, the raw binary form should be accessible to library users, regardless of whether it's considered invalid,

Appendix: UTF8-C8

UTF8-C8 -- short for UTF8 "Clean 8" -- is a specification for escaping arbitrary bytes into a UTF8 string using certain unicode characters as the escape marks. A reference document is here: https://docs.raku.org/language/unicode#UTF8-C8

Thus, UTF8-C8 is one of the mechanisms that can be used to create a codec that supports arbitrary data in strings, even if other constraints dictate that the codec can't do so in more direct ways. It is not, however, the only way to create such an escaping mechanism. If UTF8-C8 is used by a codec, it is at that codec's choice and must be documented; IPLD does not presume any such escaping mechanism in a global way.

UTF8-C8 has reasonably nice semantics to a human reader: the escape byte is typically rendered as a question-mark-in-diamond; the following byte is "x" which hints "hexadecimal is coming" to a savvy human reader; and the following two bytes are capitalized hexadecimal characters. This mechanism thus contains no unrenderable bytes, no non-unicode bytes, no bytes such as "" which are likely to conflict with other escaping mechanisms, and despite all that does contain a representation of the original bytes.

There may be reasons to consider and possibly prefer other escaping mechanisms. Compare UTF8-C8 to "\xHH" encoding:

  • UTF8-C8 was concerned with readability, and did reasonably well on that goal. But some may consider "\xHH" to be more readable than "{unencodable placeholderglyph}xHH" (or more literally, if your browser renders it well: "􏿽xHH").
  • The encoding size expansion factor is notable:
    • UTF8-C8's encoding is simply large:
      • three bytes for the 0x10FFFD
      • three more bytes for the 'x' literal and then the two hex chars
      • six bytes in total per original raw byte!
    • By contrast: the "\xXX" strategy is four bytes in total.

REVIEW: We should verify the UTF8-C8 spec does reasonable things if it encounters another 0x10FFFD in the data. Does it escape it? Does it escape it only if it is followed by an "x"? The details matter. A test should verify that the escaping can nest without issue.

@warpfork
Copy link
Author

warpfork commented Nov 2, 2020

This draws also on thoughts from several previous writeups:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment