The goal of this project is to make a type that mirrors System.String
as much as practical. It should be a heapable, immutable, indexable, and pinnable type. The data may contain embedded null characters. When pinned, the pointer should represent a null-terminated UTF-8 string.
We should provide conversions between String
and Utf8String
, though due to the expense of conversion we should avoid these operations when possible. There are a few ways to avoid these, including:
- Adding
Utf8String
-based overloads to existing APIs likeConsole.WriteLine
,File.WriteAllText
, etc. - Adding
ToUtf8String
methods on existing types likeInt32
. - Implement utility classes like
Utf8StringBuilder
. - Not having implicit or explicit conversion operators that could perform expensive transcodings, but instead having constructor overloads or some other obvious "this may be expensive" mechanism.
- Add support for marshaling
Utf8String
instances directly, even to methods which expectLPCWSTR
. - Add language support for literal UTF-8 strings to assist with comparisons, including automatic conversion of a literal UTF-16 string to a literal UTF-8 string at compile time if the compiler can deduce correct usage.
Not all behaviors must be consistent between String
and Utf8String
. For example, pinning a null or empty String
will result in a null char*
. We can choose to implement a different behavior, e.g., pinning a null Utf8String
will result in a null Utf8Char*
; but pinning an empty Utf8String
will result in a non-null Utf8Char*
.
Ideally at some point in the future we can have full globalization support for UTF-8 sequences, including culture-aware sorting and case conversion routines. This will likely require a sizeable change to the globalization APIs, so it's possible that such a feature would be several versions out. We should at minimum support limited globalization-related operations on UTF-8 sequences, including Ordinal
and OrdinalIgnoreCase
comparisons, ToUpperInvariant
and friends, and allowing the invariant culture to be passed to ToUtf8String
.
Finally, we will want to make it simple for developers to expose UTF-8 data as raw binary data so that it can be easily sent across I/O, but we also want to draw a distinction that the conversion is unidirectional. This means that perhaps we need a small bifurcation in our APIs. This could involve work like providing an implicit conversion from Utf8String
to both ReadOnlySpan<Utf8Char>
and ReadOnlySpan<byte>
.
Utf8String
should have similar complexity characteristics as String
: constant time indexing, linear time allocation and searching, etc. For marshaling, we may wish to consider similar optimizations as currently exist for UTF-16 strings, e.g., stack-copying small objects rather than pinning the object in the managed heap. It is not a goal to provide constant time indexing of scalar values or graphemes within either a UTF-8 or a UTF-16 string.
While Utf8String
is useful for representing incoming UTF-8 data without the need for transcoding, it does still incur the cost of an allocation per instance. As part of this work we may want to consider making StringSlice
or Utf8StringSlice
first-class types in the framework. One could imagine these types as being thin wrappers (perhaps aliases?) for ReadOnlyMemory<char>
and ReadOnlyMemory<Utf8Char>
along with most (but not all) of the instance methods on String
and Utf8String
.
UTF-8 processing has traditionally been a source of security vulnerabilities for applications and frameworks. There are subtleties with data processing that common lead to buffer overflows or exceptions in unexpected places.
We have had similar vulnerabilities in our own frameworks in the past where the UTF-16 processing logic can be subverted, leading to undefined or undesirable behavior in application code. However, these vulnerabilities are thankfully fairly rare. It's generally difficult for ill-formed UTF-16 sequences to make their way into the system because client-submitted data on the wire is normally in UTF-8 format, and the conversion process from UTF-8 to UTF-16 will naturally replace invalid sequences with a replacement character. When vulnerabilities have been found the culprit has generally been serializers like JSON readers which blindly splat "\uXXXX"
unmatched surrogate sequences into a String
rather than go through a proper encode / decode class.
UTF-8 is much more prone to misuse due to the fact that remote client input is already expected to be in UTF-8 format. Since there's no need for transcoding, there's a greater temptation to blit the provided data directly into a UTF-8 container without running through a verifier. This behavior generally leads to problems like those mentioned in the earlier CVE link. Therefore, as much as possible, we should strive to ensure that Utf8String
instances represent well-formed UTF-8 data, where well-formedness is defined by The Unicode Standard, Chapter 3, Table 3-7 (PDF link), also duplicated below.
Code points | First byte | Second byte | Third byte | Fourth byte |
---|---|---|---|---|
U+0000 ..U+007F |
00 ..7F |
|||
U+0080 ..U+07FF |
C2 ..DF |
80 ..BF |
||
U+0800 ..U+0FFF |
E0 |
A0 ..BF |
80 ..BF |
|
U+1000 ..U+CFFF |
E1 ..EC |
80 ..BF |
80 ..BF |
|
U+D000 ..U+D7FF |
ED |
80 ..9F |
80 ..BF |
|
U+E000 ..U+FFFF |
EE ..EF |
80 ..BF |
80 ..BF |
|
U+10000 ..U+3FFFF |
F0 |
90 ..BF |
80 ..BF |
80 ..BF |
U+40000 ..U+FFFFF |
F1 ..F3 |
80 ..BF |
80 ..BF |
80 ..BF |
U+100000 ..U+10FFFF |
F4 |
80 ..8F |
80 ..BF |
80 ..BF |
Any Utf8String
factory (where "factory" is anything that returns a Utf8String
instance) should perform validation on its inputs, replacing ill-defined sequences with the replacement character U+FFFD
. The validation logic should be compatible with the existing Utf8Encoding
class in full framework. Furthermore, any component which transcodes or enumerates (as scalars) the UTF-8 data must validate the source data regardless.
There are a handful of exceptions to this rule. Some callers may know that the input data is already well-formed, perhaps because it has been loaded from a trusted source (like a resource string) or because it has already been validated. There must be "no-validate" equivalents of the factories to allow the caller to avoid the performance hit.
One other exception is Substring
and related APIs. While it's true that this could theoretically be used to split a Utf8String
in the middle of a multibyte sequence, in practice developers tend to use this API in a safe fashion. Consider the following two examples.
Utf8String str;
if (str.StartsWith("Foo")) { str = str.Substring(3); }
Utf8String str;
int idx = str.IndexOf("Foo");
if (idx >= 0) { str = str.Substring(idx); }
In both cases, the string is split at a proper scalar boundary due to the fact that the target string is well-formed. And since the target string is almost always a literal (or itself a Utf8String
, which we assume to be well-formed), the split string will likewise be well-formed. Since this represents the typical use case of Substring
, we can optimistically avoid validation on this and related calls.
We should expose APIs that allow developers to gather useful information about arbitrary UTF-8 sequences (not just Utf8String
instances), including validation, transcoding, and enumeration of these sequences. There are three kinds of enumeration that are useful for both UTF-16 strings and UTF-8 strings.
- Enumeration by code unit (
Utf8Char
orChar
) - Provides access to the raw bit data of the string. - Enumeration by scalar (
UnicodeScalar
) - Provides access to the decoded data of the string. Can be used for transcoding purposes or to make ordinal comparisons between strings of different representations. - Enumeration by text element (type TBD) - Provides access to the displayed graphemes of the string. Can be used to extract individual "linguistic characters" from the string, including allowing manipulation such as string reversal.
The APIs we provide should be powerful and low-level enough for developers to build their own higher-level APIs on top, adding value where those developers see fit. As a concrete example, we needn't provide an API which says "the next scalar in the input string is CYRILLIC SMALL LETTER IOTIFIED A". But we should have an API which allows the developer to see that the next scalar in the input string is U+A657
, allowing the developer to build their own higher-level API which then maps U+A657
to "CYRILLIC SMALL LETTER IOTIFIED A" (see code chart PDF).
Open question: should the framework provide a text element / grapheme enumerator? Or does it perhaps fall into the "separate component provides this facility using our lower-level APIs as implementation details" category?
See GitHub samples (first, second) for more preliminary thoughts on a validation / inspection API.