Skip to content

Instantly share code, notes, and snippets.

@cygx

cygx/rfc.md Secret

Last active October 12, 2016 13:37
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save cygx/b545c206a0f7ccc6d796ba4b26afccf6 to your computer and use it in GitHub Desktop.
Save cygx/b545c206a0f7ccc6d796ba4b26afccf6 to your computer and use it in GitHub Desktop.
RFC: Generalize the utf8-c8 approach to arbitrary encodings

RFC: Generalize the utf8-c8 approach to arbitrary encodings

This is supposed to allow handling two different types of questionable input: Outright invalid byte sequences on the one hand and non-canonical byte sequences on the other. In case of UTF-8, examples for 'non-canonical' byte sequences would be overlongs, surrogate pairs, sequences of 5 or 6 bytes and potentially the 66 Unicode noncharacters as well. Additionally, this will also be leveraged to deal with non-normalized input.

Any decoder should be able to operate in any of the following modes:

  • strict, which dies on invalid byte sequences as well as non-canonical sequences and automatically normalizes input as part of the conversion to NFG

  • warn, the default, which still dies on invalid sequences and also silently normalizes, but warns on non-canonical input

  • lax, which still dies on invalid sequences, but does not warn for non-canonical ones

  • compat, which accepts any input sequence, encodes losslessly and round-trips if the resulting string gets re-encoded; this is achieved by generating synthetic graphemes as necessary to represent non-normalized, non-canonical as well as entirely invalid input

@cowens
Copy link

cowens commented Oct 9, 2016

What would the output be? Anything other than strict would have to be Uni I think (based on my understanding of the current desire in #perl6-dev).

@cygx
Copy link
Author

cygx commented Oct 10, 2016

@cowens

No, the return value would be Str, with synthetic codepoints representing denormal grapheme clusters or invalid input sequences. The exact design still needs some thought, but I'd imagine something like this:

For denormal input, you store an additional Uni with the grapheme. String equality should still happen in terms of the normalized value, but when encoding, that codepoint sequence will be silently inserted. A string containing denormal clusters could have a different type (CompatStr?) to distinguish it from regular strings, but that's something we may not want. Canonically equivalent CompatStrs would still claim to be eq, but would fail ===.

For invalid sequences, you would store a blob8, blob16 or blob32 (depending on unit size of the input encoding) instead of a Uni. Such graphemes compare equal if these buffers are.

In addition to .NFC, .NFD, ..., I would add a method .Uni to Str, which is the same as .NFC for regular strings, but returns the denomal sequence for a CompatStr. Calling .Uni on a string that contains invalid sequences should probably die (but also could insert the values within the buffers as codepoints).

@cygx
Copy link
Author

cygx commented Oct 10, 2016

To clarify, warnings would not be generated for denormal input, but stuff like modified UTF-8 (which in particular encodes NUL as an overlong), CESU-8 and the original FSS-UTF encoding values up to 2^31.

@cygx
Copy link
Author

cygx commented Oct 12, 2016

Modes warn and lax use replacement chars instead of generating compatibility sequences.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment