GrabYourPitchforks/utf8string.md

## utf8string.md

      
    Raw
  

              utf8string.md
            
          
    Usage, usability, and behaviors

The goal of this project is to make a type that mirrors System.String as much as practical. It should be a heapable, immutable, indexable, and pinnable type. The data may contain embedded null characters. When pinned, the pointer should represent a null-terminated UTF-8 string.
We should provide conversions between String and Utf8String, though due to the expense of conversion we should avoid these operations when possible. There are a few ways to avoid these, including:

Adding Utf8String-based overloads to existing APIs like Console.WriteLine, File.WriteAllText, etc.
Adding ToUtf8String methods on existing types like Int32.
Implement utility classes like Utf8StringBuilder.
Not having implicit or explicit conversion operators that could perform expensive transcodings, but instead having constructor overloads or some other obvious "this may be expensive" mechanism.
Add support for marshaling Utf8String instances directly, even to methods which expect LPCWSTR.
Add language support for literal UTF-8 strings to assist with comparisons, including automatic conversion of a literal UTF-16 string to a literal UTF-8 string at compile time if the compiler can deduce correct usage.

Not all behaviors must be consistent between String and Utf8String. For example, pinning a null or empty String will result in a null char*. We can choose to implement a different behavior, e.g., pinning a null Utf8String will result in a null Utf8Char*; but pinning an empty Utf8String will result in a non-null Utf8Char*.
Ideally at some point in the future we can have full globalization support for UTF-8 sequences, including culture-aware sorting and case conversion routines. This will likely require a sizeable change to the globalization APIs, so it's possible that such a feature would be several versions out. We should at minimum support limited globalization-related operations on UTF-8 sequences, including Ordinal and OrdinalIgnoreCase comparisons, ToUpperInvariant and friends, and allowing the invariant culture to be passed to ToUtf8String.
Finally, we will want to make it simple for developers to expose UTF-8 data as raw binary data so that it can be easily sent across I/O, but we also want to draw a distinction that the conversion is unidirectional. This means that perhaps we need a small bifurcation in our APIs. This could involve work like providing an implicit conversion from Utf8String to both ReadOnlySpan<Utf8Char> and ReadOnlySpan<byte>.
Performance

Utf8String should have similar complexity characteristics as String: constant time indexing, linear time allocation and searching, etc. For marshaling, we may wish to consider similar optimizations as currently exist for UTF-16 strings, e.g., stack-copying small objects rather than pinning the object in the managed heap. It is not a goal to provide constant time indexing of scalar values or graphemes within either a UTF-8 or a UTF-16 string.
While Utf8String is useful for representing incoming UTF-8 data without the need for transcoding, it does still incur the cost of an allocation per instance. As part of this work we may want to consider making StringSlice or Utf8StringSlice first-class types in the framework. One could imagine these types as being thin wrappers (perhaps aliases?) for ReadOnlyMemory<char> and ReadOnlyMemory<Utf8Char> along with most (but not all) of the instance methods on String and Utf8String.
Security

UTF-8 processing has traditionally been a source of security vulnerabilities for applications and frameworks. There are subtleties with data processing that common lead to buffer overflows or exceptions in unexpected places.
We have had similar vulnerabilities in our own frameworks in the past where the UTF-16 processing logic can be subverted, leading to undefined or undesirable behavior in application code. However, these vulnerabilities are thankfully fairly rare. It's generally difficult for ill-formed UTF-16 sequences to make their way into the system because client-submitted data on the wire is normally in UTF-8 format, and the conversion process from UTF-8 to UTF-16 will naturally replace invalid sequences with a replacement character. When vulnerabilities have been found the culprit has generally been serializers like JSON readers which blindly splat "\uXXXX" unmatched surrogate sequences into a String rather than go through a proper encode / decode class.
UTF-8 is much more prone to misuse due to the fact that remote client input is already expected to be in UTF-8 format. Since there's no need for transcoding, there's a greater temptation to blit the provided data directly into a UTF-8 container without running through a verifier. This behavior generally leads to problems like those mentioned in the earlier CVE link. Therefore, as much as possible, we should strive to ensure that Utf8String instances represent well-formed UTF-8 data, where well-formedness is defined by The Unicode Standard, Chapter 3, Table 3-7 (PDF link), also duplicated below.


Code points
First byte
Second byte
Third byte
Fourth byte


U+0000..U+007F
00..7F


U+0080..U+07FF
C2..DF
80..BF


U+0800..U+0FFF
E0
A0..BF
80..BF


U+1000..U+CFFF
E1..EC
80..BF
80..BF


U+D000..U+D7FF
ED
80..9F
80..BF


U+E000..U+FFFF
EE..EF
80..BF
80..BF


U+10000..U+3FFFF
F0
90..BF
80..BF
80..BF


U+40000..U+FFFFF
F1..F3
80..BF
80..BF
80..BF


U+100000..U+10FFFF
F4
80..8F
80..BF
80..BF


Any Utf8String factory (where "factory" is anything that returns a Utf8String instance) should perform validation on its inputs, replacing ill-defined sequences with the replacement character U+FFFD. The validation logic should be compatible with the existing Utf8Encoding class in full framework. Furthermore, any component which transcodes or enumerates (as scalars) the UTF-8 data must validate the source data regardless.
There are a handful of exceptions to this rule. Some callers may know that the input data is already well-formed, perhaps because it has been loaded from a trusted source (like a resource string) or because it has already been validated. There must be "no-validate" equivalents of the factories to allow the caller to avoid the performance hit.
One other exception is Substring and related APIs. While it's true that this could theoretically be used to split a Utf8String in the middle of a multibyte sequence, in practice developers tend to use this API in a safe fashion. Consider the following two examples.
Utf8String str;
if (str.StartsWith("Foo")) { str = str.Substring(3); }
Utf8String str;
int idx = str.IndexOf("Foo");
if (idx >= 0) { str = str.Substring(idx); }
In both cases, the string is split at a proper scalar boundary due to the fact that the target string is well-formed. And since the target string is almost always a literal (or itself a Utf8String, which we assume to be well-formed), the split string will likewise be well-formed. Since this represents the typical use case of Substring, we can optimistically avoid validation on this and related calls.
Validation and inspection

We should expose APIs that allow developers to gather useful information about arbitrary UTF-8 sequences (not just Utf8String instances), including validation, transcoding, and enumeration of these sequences. There are three kinds of enumeration that are useful for both UTF-16 strings and UTF-8 strings.

Enumeration by code unit (Utf8Char or Char) - Provides access to the raw bit data of the string.
Enumeration by scalar (UnicodeScalar) - Provides access to the decoded data of the string. Can be used for transcoding purposes or to make ordinal comparisons between strings of different representations.
Enumeration by text element (type TBD) - Provides access to the displayed graphemes of the string. Can be used to extract individual "linguistic characters" from the string, including allowing manipulation such as string reversal.

The APIs we provide should be powerful and low-level enough for developers to build their own higher-level APIs on top, adding value where those developers see fit. As a concrete example, we needn't provide an API which says "the next scalar in the input string is CYRILLIC SMALL LETTER IOTIFIED A". But we should have an API which allows the developer to see that the next scalar in the input string is U+A657, allowing the developer to build their own higher-level API which then maps U+A657 to "CYRILLIC SMALL LETTER IOTIFIED A" (see code chart PDF).
Open question: should the framework provide a text element / grapheme enumerator? Or does it perhaps fall into the "separate component provides this facility using our lower-level APIs as implementation details" category?
See GitHub samples (first, second) for more preliminary thoughts on a validation / inspection API.
Code points	First byte	Second byte	Third byte	Fourth byte
`U+0000`..`U+007F`	`00`..`7F`
`U+0080`..`U+07FF`	`C2`..`DF`	`80`..`BF`
`U+0800`..`U+0FFF`	`E0`	`A0`..`BF`	`80`..`BF`
`U+1000`..`U+CFFF`	`E1`..`EC`	`80`..`BF`	`80`..`BF`
`U+D000`..`U+D7FF`	`ED`	`80`..`9F`	`80`..`BF`
`U+E000`..`U+FFFF`	`EE`..`EF`	`80`..`BF`	`80`..`BF`
`U+10000`..`U+3FFFF`	`F0`	`90`..`BF`	`80`..`BF`	`80`..`BF`
`U+40000`..`U+FFFFF`	`F1`..`F3`	`80`..`BF`	`80`..`BF`	`80`..`BF`
`U+100000`..`U+10FFFF`	`F4`	`80`..`8F`	`80`..`BF`	`80`..`BF`