GrabYourPitchforks/utf8char_ecosystem.md

## utf8char_ecosystem.md

      
    Raw
  

              utf8char_ecosystem.md
            
          
    Motivations and driving principles behind the Utf8Char proposal

Utf8Char is synonymous with Char: they represent a single UTF-8 code unit and a single UTF-16 code unit, respectively. They are distinct from the integral types Byte and UInt16 in that sequences of the UTF-* code unit types are meant to represent textual data, while sequences of the integral types are meant to represent binary data.
Drawing this distinction is important. With UTF-16 data (String, Char[]), this distinction historically hasn't been a source of confusion. Developers are generally cognizant of the fact that aside from RPC, most i/o involves some kind of transcoding mechanism. Binary data doesn't come in from disk or the network in a format that can be trivially projected as a textual string; it must go through validation, recombining, and substitution. Similarly, when writing a string to disk or the network, a trivial projection is again impossible. The transcoding step must run in reverse to get the text data into the correct binary format expected by i/o.
A brief interlude on conformance and security

There is a key aspect here that is often lost in nuance. The purpose of the transcoding step isn't simply to "shrink" a string of UTF-16 code units into a string of UTF-8 code units (conveniently the same size as octets!) so that it can be blasted across the wire, or vice versa. It is to do so in such a manner that the receiver can reconstruct the original string with full fidelity.
With UTF-8, it is tempting to perform a trivial projection between the binary i/o layer (bytes) and the textual layer (UTF-8 code units). The elemental data types are the same shape, after all, so a reinterpret cast seems legal at first glance. The problem with this design is that at a certain point, one or more components will need to operate on this text. If the text is ill-formed, the components may produce undefined behavior, or they may attempt to fix up the text on-the-fly but may disagree on the final shape of the fixed-up text. This violates the "with full fidelity" aspect mentioned in the previous paragraph.
As a concrete example, consider a web application that blindly treats all incoming form data as ReadOnlySpan<byte> and attempts to interpret it as UTF-8. Within the context of this single web application, there may not be a problem with this design. If the buffer contains ill-formed UTF-8, all of the APIs in the web application process might have undefined behavior as they're working with it, but they likely have consistent undefined behavior.
Web applications almost never exist as a single isolated process, however. There is undoubtedly a persistent data store - a database or other backend service. If the web application forwards the ReadOnlySpan<byte> (containing ill-formed UTF-8) through to these layers, the backend layers could look at the same sequence of bytes and process them differently. Perhaps Component A is using varchar(UTF8) for its backend storage but Component B is using nvarchar for its backend storage. There is now a mismatch - a loss of fidelity - between these two systems.
This places us into a somewhat peculiar position with respect to security. We generally think of CVEs as affecting individual frameworks or individual applications, but this underrepresents a class of issues best described as "the API surface leads developers to writing applications which appear secure in isolation but which are in fact dangerous when used in conjunction with other applications."
Some examples of where vulnerabilities arise due to the interplay of components which handle ill-formed sequences UTF-8 differently:

CVE-2017-7653, where Eclipse Mosquitto is not itself exploitable but where it can be leveraged by a malicious user to forward an ill-formed payload to a victim user, resulting in DoS against that user (not against the service).
CVE-2015-3438, where WordPress sites are vulnerable to XSS if a commenter submits a comment containing ill-formed UTF-8 and WordPress sends that ill-formed sequence to MySQL, which before persistence will modify the text in a manner unanticipated by WordPress.

These issues tend to go underreported in the public sphere because the attack often must be tailored to a specific deployment or configuration of an application.
Back to Utf8Char

The proposal ultimately is to have ReadOnlySpan<Utf8Char> represent well-formed UTF-8 text data as much as possible. This mirrors ReadOnlySpan<Char>, which generally represents well-formed UTF-16 text data. In both cases it's possible for a developer to intentionally create ill-formed payloads by creating and populating a Utf8Char[] or Char[] with garbage and then producing a span over that buffer. But since developers tend not to take such actions intentionally this shouldn't be a problem in practice. The standard way of getting a ReadOnlySpan<Utf8Char> from a ReadOnlySpan<byte> would be to use a factory that validates (and massages if necessary) the input data. This matches the behavior developers already expect when going from a byte sequence to a UTF-16 char sequence.
There are cases where the text span may not represent a standalone well-formed sequence. This can occur when text data is being operated on in a chunked fashion, and an ill-formed span is generated from a larger well-formed sequence that has been split across a multi-code unit boundary. A UTF-16 example is the well-formed sequence [ D808 DF45 ] chunked into the ill-formed subsequences [ D808 ] and [ DF45 ]. A UTF-8 example is the well-formed sequence [ F0 92 8D 85 ] chunked into the ill-formed sequences [ F0 ] and [ 92 8D 85 ]. The Framework should provide OperationStatus-based APIs as much as possible to allow for chunked inputs, with the expectation that assuming unlimited memory the concatenation of all chunks would result in a well-formed supersequence.
Of special note is that some text operations cannot be performed in a chunked fashion. APIs like case conversion (ToUpper, ToLower) and transcoding can be created to allow for chunking, but comparison APIs (CompareTo, StartsWith) do not allow for chunking. A concrete example of this follows. In this example, chunking will cause StartsWith to return a false positive result.
// Assumes current culture is en-US
static void Main(string[] args)
{
    string theString = "e\u0301";
    Console.WriteLine(theString.StartsWith("e")); // prints "False"

    theString = theString.Substring(0, 1); // chunk
    Console.WriteLine(theString.StartsWith("e")); // prints "True"
}
Generally speaking, Framework APIs which operate on ReadOnlySpan<byte> as UTF-8 input must not assume the input is well-formed and must have a well-defined behavior if ill-formed UTF-8 is encountered. The API may choose to take any number of actions - throw, return a failure code, perform replacement - as long as the behavior is part of the API contract and the caller understands this contract.
Framework APIs which operate on ReadOnlySpan<Utf8Char> should validate the input for well-formedness if such checks do not impose a hardship on the method implementation. There may be certain performance-sensitive routines which cannot incur that cost; such routines may assume the input is well-formed and may have undefined behavior if this invariant is violated, short of that behavior causing an access violation or other runtime corruption. For example, if a routine is given the single-element input [ C2 ], it mustn't attempt to read off the end of the source buffer. Routines which require well-formed input must be contracted as such. Chunking APIs must at the very least continue to check for boundary conditions, even if they don't check for other ill-formedness in the sequence.
For more information on conformance, validation, and the distinction between binary data and textual data, see:

The Unicode Specification, Chapter 3, Clause C10 and Sec. 3.9
The Unicode Specification, Chapter 5, Sec. 5.22
Unicode Technical Report #36 - Unicode Security Considerations, Sec. 3.1

Projections between Utf8Char and Byte

The UTF-8 code unit type Utf8Char does not attempt to validate its input.
Utf8Char c = (Utf8Char)(byte)0xC0; // creates a Utf8Char with the value C0
Rune r = new Rune(0xD800); // throws at runtime
In the above example, this creates a Utf8Char instance with the value C0, even though the Unicode Specification expressly states that C0 is never a valid value for a UTF-8 code unit. Contrast this with the Rune type, whose constructor prohibits creating instances from values outside the valid Unicode scalar range.
It is possible to project (reinterpret cast) a {ReadOnly}Span<Utf8Char> to a ReadOnlySpan<byte>. This is useful for operations like writing UTF-8 text directly to an i/o pipe.
ReadOnlySpan<Utf8Char> utf8 = ...;
ReadOnlySpan<byte> bytes = utf8.AsBytes();
stream.Write(bytes);
The projections Span<Utf8Char> -> Span<byte> and {ReadOnly}Span<byte> -> {ReadOnly}Span<Utf8Char> should also be possible. We do not want to prevent developers from removing any safety rails we provide within the Framework, but we also don't want developers to remove those rails inadvertently. Projections which blur the lines between textual representation and binary representation in a "dangerous" manner should require an affirmative action from the developer. One possible way to get this affirmation is to require use of the existing reinterpret_cast-like API.
ReadOnlySpan<Utf8Char> a = MemoryMarshal.Cast<byte, Utf8Char>(ReadOnlySpan<byte>);
Span<Utf8Char> b = MemoryMarshal.Cast<byte, Utf8Char>(Span<byte>);
Span<byte> c = MemoryMarshal.Cast<Utf8Char, byte>(Span<Utf8Char>);
The methods Span<T>.ToString and Memory<T>.ToString (and their read-only equivalents) will be enlightened for T = Utf8Char, just as they're enlightened for T = char today. The behavior of the method will be to transcode the data to UTF-16 (with invalid sequence replacement if necessary) and to return the expected String instance. This enlightenment will not extend to the case where T = byte.
Unlike Span<T>, Memory<T> instances cannot be projected to a different type Memory<U>. This means that there is no way to cast between Memory<Utf8Char> and Memory<byte> (or their read-only equivalents) on-the-fly.
Comparing to other languages

In Go 1.x, string and []byte are distinct sliceable types. Developers generally use strings to store textual data and byte slices to store binary data. This distinction is sometimes a bit blurry and developers may require external information (documentation, context, method names) to determine exactly what kind of information is represented by the slice, akin to using traditional char* pointers in C.
The biggest difference between the two types is that string represents truly immutable data (not just an immutable view into mutable data), where []byte represents mutable data. Thus there's no trivial projection possible between the two, and any conversion must necessarily be implemented as a copy. There are proposals for readonly slices in a future version of Go, though to the best of my knowledge these proposals have not been approved. If such a feature comes to fruition it seems like a non-copying projection string -> <readonly> []byte would be allowed implicitly, but the reverse projection <readonly> []byte -> string would still require a copy. (See golang/go#20443 and https://groups.google.com/forum/#!topic/golang-dev/Y7j4B2r_eDw/discussion for further information.)
In Swift, it is possible to create a UTF-* view over any String instance. The corresponding types are String.UTF8View, String.UTF16View, and String.UTF32View. These types are specialized text sequence types distinct from normal binary data sequence types; though their elemental types of enumeration are UInt8, UInt16, and UInt32, respectively. This means that it is not possible to project String.UTF8View and [UInt8] between each other trivially; a copy must take place. (See https://developer.apple.com/documentation/swift/string/utf8view for further information.)
Alternative proposals

Utf8Slice

Instead of introducing a Utf8Char type and allowing ReadOnlySpan<Utf8Char> to represent a slice of UTF-8 textual data, one could imagine introducing a Utf8Slice type which is a thin wrapper around ReadOnlySpan<Byte>. Inspection or manipulation methods would operate on this type rather than exist as specialized extensions on MemoryExtensions. Utf8Slice would be indexable (with Byte as the elemental type).
There is some prior art here in that it's similar to how the Go language operates. But this leads to a problem in that Utf8Slice instances would be limited in functionality. They'd be immutable, requiring manipulation APIs to bounce through a separate byte sequence and wrap a new Utf8Slice around it. We'd have to determine if we'd want a heapable (ReadOnlyMemory<Byte>-based) sibling type. There would be confusion as to why there's asymmetry between this and the UTF-16 types. After these and other considerations we're basically reinventing the Utf8String proposal, so there's minimal benefit to Utf8Slice as proposed here.
Use ReadOnlySpan<byte> for everything

This is tempting from the perspective of a system that wants to treat everything as pass-through as much as possible, but I don't believe it's appropriate from the perspective of a framework. There are two main issues I have with this approach.
The first is that it interferes with the general concept of a type system and makes it more difficult to reason about code. If a developer has a Byte[] in their code, they shouldn't need the additional bookkeeping overhead of asking themselves "does this represent binary data like a JPG, or does this represent UTF-8 text?" Text-based extension methods (Contains, ToUpper, etc.) also shouldn't begin appearing for arbitrary binary data sequences.
The second is that this blurs the line between binary data and textual data, leading to the validation and conformance problems mentioned earlier. I don't want the framework to encourage developers to play fast and loose with this, potentially leading to undefined behavior in their applications. This is still subject to the earlier caveats: power developers should absolutely be able to project the data with minimal fuss, but this should be an affirmative action.