cygx/rfc.md Secret

## rfc.md

      
    Raw
  

              rfc.md
            
          
    RFC: Generalize the utf8-c8 approach to arbitrary encodings

This is supposed to allow handling two different types of questionable input: Outright invalid byte sequences on the one hand and non-canonical byte sequences on the other. In case of UTF-8, examples for 'non-canonical' byte sequences would be overlongs, surrogate pairs, sequences of 5 or 6 bytes and potentially the 66 Unicode noncharacters as well. Additionally, this will also be leveraged to deal with non-normalized input.
Any decoder should be able to operate in any of the following modes:


strict, which dies on invalid byte sequences as well as non-canonical sequences and automatically normalizes input as part of the conversion to NFG


warn, the default, which still dies on invalid sequences and also silently normalizes, but warns on non-canonical input


lax, which still dies on invalid sequences, but does not warn for non-canonical ones


compat, which accepts any input sequence, encodes losslessly and round-trips if the resulting string gets re-encoded; this is achieved by generating synthetic graphemes as necessary to represent non-normalized, non-canonical as well as entirely invalid input