This is supposed to allow handling two different types of questionable input: Outright invalid byte sequences on the one hand and non-canonical byte sequences on the other. In case of UTF-8, examples for 'non-canonical' byte sequences would be overlongs, surrogate pairs, sequences of 5 or 6 bytes and potentially the 66 Unicode noncharacters as well. Additionally, this will also be leveraged to deal with non-normalized input.
Any decoder should be able to operate in any of the following modes:
-
strict
, which dies on invalid byte sequences as well as non-canonical sequences and automatically normalizes input as part of the conversion to NFG -
warn
, the default, which still dies on invalid sequences and also silently normalizes, but warns on non-canonical input -
lax
, which still dies on invalid sequences, but does not warn for non-canonical ones -
compat
, which accepts any input sequence, encodes losslessly and round-trips if the resulting string gets re-encoded; this is achieved by generating synthetic graphemes as necessary to represent non-normalized, non-canonical as well as entirely invalid input
What would the output be? Anything other than strict would have to be Uni I think (based on my understanding of the current desire in #perl6-dev).