Skip to content

Instantly share code, notes, and snippets.

@GrabYourPitchforks
Last active September 14, 2019 17:38
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save GrabYourPitchforks/ad37f280e9b9f4fe1b1ccc5775e05722 to your computer and use it in GitHub Desktop.
Save GrabYourPitchforks/ad37f280e9b9f4fe1b1ccc5775e05722 to your computer and use it in GitHub Desktop.
UTF8 design for LDM

Utf8String design overview

Audience and scenarios

Utf8String and related concepts are meant for modern internet-facing applications that need to speak "the language of the web" (or i/o in general, really). Currently applications spend some amount of time transcoding into formats that aren't particularly useful, which wastes CPU cycles and memory.

A naive way to accomplish this would be to represent UTF-8 data as byte[] / Span<byte>, but this leads to a usability pit of failure. Developers would then become dependent on situational awareness and code hygiene to be able to know whether a particular byte[] instance is meant to represent binary data or UTF-8 textual data, leading to situations where it's very easy to write code like byte[] imageData = ...; imageData.ToUpperInvariant();. This defeats the purpose of using a typed language.

We want to expose enough functionality to make the Utf8String type usable and desirable by our developer audience, but it's not intended to serve as a full drop-in replacement for its sibling type string. For example, we might add Utf8String-related overloads to existing APIs in the System.IO namespace, but we wouldn't add an overload Assembly.LoadFrom(Utf8String assemblyName).

In addition to networking and i/o scenarios, it's expected that there will be an audience who will want to use Utf8String for interop scenarios, especially when interoperating with components written in Rust or Go. Both of these languages use UTF-8 as their native string representation, and providing a type which can be used as a data exchange type for that audience will make their scenarios a bit easier.

Finally, we should afford power developers the opportunity to improve their throughput and memory utilization by limiting data copying where feasible. This doesn't imply that we must be allocation-free or zero-copy for every scenario. But it does imply that we should investigate common operations and consider alternative ways of performing these tasks as long as it doesn't compromise the usability of the mainline scenarios.

It's important to call out that Utf8String is not intended to be a replacement for string. The standard UTF-16 string will remain the core primitive type used throughout the .NET ecosystem and will enjoy the largest supported API surface area. We expect that developers who use Utf8String in their code bases will do so deliberately, either because they're working in one of the aforementioned scenarios or because they find other aspects of Utf8String (such as its API surface or behavior guarantees) desirable.

Design decisions and type API

To make internal Utf8String implementation details easier, and to allow consumers to better reason about the type's behavior, the Utf8String type maintains the following invariants:

  • Instances are immutable. Once data is copied to the Utf8String instance, it is unchanging for the lifetime of the instance. All members on Utf8String are thread-safe.

  • Instances are heap-allocated. This is a standard reference type, like string and object.

  • The backing data is guaranteed well-formed UTF-8. It can be round-tripped through string (or any other Unicode-compatible encoding) and back without any loss of fidelity. It can be passed verbatim to any other component whose contract requires that it operate only on well-formed UTF-8 data.

  • The backing data is null-terminated. If the Utf8String instance is pinned, the resulting byte* can be passed to any API which takes a LPCUTF8STR parameter. (Like string, Utf8String instances can contain embedded nulls.)

These invariants help shape the proposed API and usage examples as described throughout this document.

[Serializable]
public sealed class Utf8String : IComparable<Utf8String>, IEquatable<Utf8String>, ISerializable
{
    public static readonly Utf8String Empty; // matches String.Empty

    /*
     * CTORS AND FACTORIES
     *
     * These ctors all have "throw on invalid data" behavior since it's intended that data should
     * be faithfully retained and should be round-trippable back to its original encoding.
     */

    public Utf8String(byte[]? value, int startIndex, int length);
    public Utf8String(char[]? value, int startIndex, int length);
    public Utf8String(ReadOnlySpan<byte> value);
    public Utf8String(ReadOnlySpan<char> value);
    public Utf8String(string value) { }

    // These ctors expect null-terminated UTF-8 or UTF-16 input.
    // They'll compute strlen / wcslen on the caller's behalf.

    public unsafe Utf8String(byte* value);
    public unsafe Utf8String(char* value);

    public static Utf8String Create<TState>(int length, TState state, SpanAction<byte, TState> action);

    // "Try" factories are non-throwing equivalents of the above methods. They use a try pattern instead
    // of throwing if invalid input is detected.

    public static bool TryCreateFrom(ReadOnlySpan<byte> buffer, out Utf8String? value);
    public static bool TryCreateFrom(ReadOnlySpan<char> buffer, out Utf8String? value);

    // "Loose" factories also perform validation, but if an invalid sequence is detected they'll
    // silently fix it up by performing U+FFFD substitution in the returned Utf8String instance
    // instead of throwing.

    public static Utf8String CreateFromLoose(ReadOnlySpan<byte> buffer);
    public static Utf8String CreateFromLoose(ReadOnlySpan<char> buffer);
    public static Utf8String CreateLoose<TState>(int length, TState state, SpanAction<byte, TState> action);

    // "Unsafe" factories skip validation entirely. It's up to the caller to uphold the invariant
    // that Utf8String instances only ever contain well-formed UTF-8 data.

    [RequiresUnsafe]
    public static Utf8String UnsafeCreateWithoutValidation(ReadOnlySpan<byte> utf8Contents);
    [RequiresUnsafe]
    public static Utf8String UnsafeCreateWithoutValidation<TState>(int length, TState state, SpanAction<byte, TState> action);

    /*
     * ENUMERATION
     *
     * Since there's no this[int] indexer on Utf8String, these properties allow enumeration
     * of the contents as UTF-8 code units (Bytes), as UTF-16 code units (Chars), or as
     * Unicode scalar values (Runes). The enumerable struct types are defined at the bottom
     * of this type.
     */

    public ByteEnumerable Bytes { get; }
    public CharEnumerable Chars { get; }
    public RuneEnumerable Runes { get; }

    // Also allow iterating over extended grapheme clusters (not yet ready).
    // public GraphemeClusterEnumerable GraphemeClusters { get; }

    /*
     * COMPARISON
     *
     * All comparisons are Ordinal unless the API takes a parameter such
     * as a StringComparison or CultureInfo.
     */

    // The "AreEquivalent" APIs compare UTF-8 data against UTF-16 data for equivalence, where
    // equivalence is defined as "the texts would transcode as each other".
    // (Shouldn't these methods really be on a separate type?)

    public static bool AreEquivalent(Utf8String? utf8Text, string? utf16Text);
    public static bool AreEquivalent(Utf8Span utf8Text, ReadOnlySpan<char> utf16Text);
    public static bool AreEquivalent(ReadOnlySpan<byte> utf8Text, ReadOnlySpan<char> utf16Text);
    
    public int CompareTo(Utf8String? other);
    public int CompareTo(Utf8String? other, StringComparison comparisonType);

    public override bool Equals(object? obj); // 'obj' must be Utf8String, not string
    public static bool Equals(Utf8String? left, Utf8String? right);
    public static bool Equals(Utf8String? left, Utf8String? right, StringComparison comparisonType);
    public bool Equals(Utf8String? value);
    public bool Equals(Utf8String? value, StringComparison comparisonType);

    public static bool operator !=(Utf8String? left, Utf8String? right);
    public static bool operator ==(Utf8String? left, Utf8String? right);

    /*
     * SEARCHING
     *
     * Like comparisons, all searches are Ordinal unless the API takes a
     * parameter dictating otherwise.
     */
    
    public bool Contains(char value);
    public bool Contains(char value, StringComparison comparisonType);
    public bool Contains(Rune value);
    public bool Contains(Rune value, StringComparison comparisonType);
    public bool Contains(Utf8String value);
    public bool Contains(Utf8String value, StringComparison comparisonType);

    public bool EndsWith(char value);
    public bool EndsWith(char value, StringComparison comparisonType);
    public bool EndsWith(Rune value);
    public bool EndsWith(Rune value, StringComparison comparisonType);
    public bool EndsWith(Utf8String value);
    public bool EndsWith(Utf8String value, StringComparison comparisonType);

    public bool StartsWith(char value);
    public bool StartsWith(char value, StringComparison comparisonType);
    public bool StartsWith(Rune value);
    public bool StartsWith(Rune value, StringComparison comparisonType);
    public bool StartsWith(Utf8String value);
    public bool StartsWith(Utf8String value, StringComparison comparisonType);

    // TryFind is the equivalent of IndexOf. It returns a Range instead of an integer
    // index because there's no this[int] indexer on the Utf8String type, and encouraging
    // developers to slice by integer indices will almost certainly lead to bugs.
    // More on this later.

    public bool TryFind(char value, out Range range);
    public bool TryFind(char value, StringComparison comparisonType, out Range range);
    public bool TryFind(Rune value, out Range range);
    public bool TryFind(Rune value, StringComparison comparisonType, out Range range);
    public bool TryFind(Utf8String value, out Range range);
    public bool TryFind(Utf8String value, StringComparison comparisonType, out Range range);

    public bool TryFindLast(char value, out Range range);
    public bool TryFindLast(char value, StringComparison comparisonType, out Range range);
    public bool TryFindLast(Rune value, out Range range);
    public bool TryFindLast(Rune value, StringComparison comparisonType, out Range range);
    public bool TryFindLast(Utf8String value, out Range range);
    public bool TryFindLast(Utf8String value, StringComparison comparisonType, out Range range);

    /*
     * SLICING
     *
     * All slicing operations uphold the "well-formed data" invariant and
     * validate that creating the new substring instance will not split a
     * multi-byte UTF-8 subsequence. This check is O(1).
     */

    public Utf8String this[Range range] { get; }

    public (Utf8String Before, Utf8String? After) SplitOn(char separator);
    public (Utf8String Before, Utf8String? After) SplitOn(char separator, StringComparison comparisonType);
    public (Utf8String Before, Utf8String? After) SplitOn(Rune separator);
    public (Utf8String Before, Utf8String? After) SplitOn(Rune separator, StringComparison comparisonType);
    public (Utf8String Before, Utf8String? After) SplitOn(Utf8String separator);
    public (Utf8String Before, Utf8String? After) SplitOn(Utf8String separator, StringComparison comparisonType);

    public (Utf8String Before, Utf8String? After) SplitOnLast(char separator);
    public (Utf8String Before, Utf8String? After) SplitOnLast(char separator, StringComparison comparisonType);
    public (Utf8String Before, Utf8String? After) SplitOnLast(Rune separator);
    public (Utf8String Before, Utf8String? After) SplitOnLast(Rune separator, StringComparison comparisonType);
    public (Utf8String Before, Utf8String? After) SplitOnLast(Utf8String separator);
    public (Utf8String Before, Utf8String? After) SplitOnLast(Utf8String separator, StringComparison comparisonType);

    /*
     * INSPECTION & MANIPULATION
     */

    // some number of overloads to help avoid allocation in the common case
    public static Utf8String Concat<T>(params IEnumerable<T> values);
    public static Utf8String Concat<T0, T1>(T0 value0, T1 value1);
    public static Utf8String Concat<T0, T1, T2>(T0 value0, T1 value1, T2 value2);

    public bool IsAscii();

    public bool IsNormalized(NormalizationForm normalizationForm = NormalizationForm.FormC);

    public static Utf8String Join<T>(char separator, params IEnumerable<T> values);
    public static Utf8String Join<T>(Rune separator, params IEnumerable<T> values);
    public static Utf8String Join<T>(Utf8String? separator, params IEnumerable<T> values);

    public Utf8String Normalize(NormalizationForm normalizationForm = NormalizationForm.FormC);

    // Do we also need Insert, Remove, etc.?

    public Utf8String Replace(char oldChar, char newChar); // Ordinal
    public Utf8String Replace(char oldChar, char newChar, StringComparison comparison);
    public Utf8String Replace(char oldChar, char newChar, bool ignoreCase, CultureInfo culture);
    public Utf8String Replace(Rune oldRune, Rune newRune); // Ordinal
    public Utf8String Replace(Rune oldRune, Rune newRune, StringComparison comparison);
    public Utf8String Replace(Rune oldRune, Rune newRune, bool ignoreCase, CultureInfo culture);
    public Utf8String Replace(Utf8String oldText, Utf8String newText); // Ordinal
    public Utf8String Replace(Utf8String oldText, Utf8String newText, StringComparison comparison);
    public Utf8String Replace(Utf8String oldText, Utf8String newText, bool ignoreCase, CultureInfo culture);

    public Utf8String ToLower(CultureInfo culture);
    public Utf8String ToLowerInvariant();

    public Utf8String ToUpper(CultureInfo culture);
    public Utf8String ToUpperInvariant();

    // The Trim* APIs only trim whitespace for now. When we figure out how to trim
    // additional data we can add the appropriate overloads.

    public Utf8String Trim();
    public Utf8String TrimStart();
    public Utf8String TrimEnd();

    /*
     * PROJECTING
     */

    public ReadOnlySpan<byte> AsBytes(); // perhaps an extension method instead?
    public static explicit operator ReadOnlySpan<byte>(Utf8String? value);
    public static implicit operator Utf8Span(Utf8String? value);

    /*
     * MISCELLANEOUS
     */
    
    public override int GetHashCode(); // Ordinal
    public int GetHashCode(StringComparison comparisonType);

    // Used for pinning and passing to p/invoke. If the input Utf8String
    // instance is empty, returns a reference to the null terminator.

    [EditorBrowsable(EditorBrowsableState.Never)]
    public ref readonly byte GetPinnableReference();

    public static bool IsNullOrEmpty(Utf8String? value);
    public static bool IsNullOrWhiteSpace(Utf8String? value);

    public override string ToString(); // transcode to UTF-16

    /*
     * SERIALIZATION
     * (Throws an exception on deserialization if data is invalid.)
     */
    
    // Could also use an IObjectReference if we didn't want to implement the deserialization ctor.
    private Utf8String(SerializationInfo info, StreamingContext context);
    void ISerializable.GetObjectData(SerializationInfo info, StreamingContext context);

    /*
     * HELPER NESTED STRUCTS
     */

    public readonly struct ByteEnumerable : IEnumerable<byte> { /* ... */ }
    public readonly struct CharEnumerable : IEnumerable<char> { /* ... */ }
    public readonly struct RuneEnumerable : IEnumerable<Rune> { /* ... */ }
}

public static class MemoryExtensions
{
    public static ReadOnlyMemory<byte> AsMemory(Utf8String value);
    public static ReadOnlyMemory<byte> AsMemory(Utf8String value, int offset);
    public static ReadOnlyMemory<byte> AsMemory(Utf8String value, int offset, int count);
}

Non-allocating types

While Utf8String is an allocating, heap-based, null-terminated type; there are scenarios where a developer may want to represent a segment (or "slice") of UTF-8 data from an existing buffer without incurring an allocation.

The Utf8Segment (alternative name: Utf8Memory) and Utf8Span types can be used for this purpose. They represent a view into UTF-8 data, with the following guarantees:

  • They are immutable views into immutable data.
  • They are guaranteed well-formed UTF-8 data. (Tearing will be covered shortly.)

These types have Utf8String-like methods hanging off of them as instance methods where appropriate. Additionally, they can be projected as ROM<byte> and ROS<byte> for developers who want to deal with the data at the raw binary level or who want to call existing extension methods on the ROM and ROS types.

Since Utf8Segment and Utf8Span are standalone types distinct from ROM and ROS, they can have behaviors that developers have come to expect from string-like types. For example, Utf8Segment (unlike ROM<char> or ROM<byte>) can be used as a key in a dictionary without jumping through hoops:

Dictionary<Utf8Segment, int> dict = ...;

Utf8String theString = u"hello world";
Utf8Segment segment = theString.AsMemory(0, 5); // u"hello"

if (dict.TryGetValue(segment, out int value))
{
    Console.WriteLine(value);
}

Utf8Span instances can be compared against each other:

Utf8Span data1 = ...;
Utf8Span data2 = ...;

int hashCode = data1.GetHashCode(); // Marvin32 hash

if (data1 == data2) { /* ordinal comparison of contents */ }

An alternative design that was considered was to introduce a type Char8 that would represent an 8-bit code unit - it would serve as the elemental type of Utf8String and its slices. However, ReadOnlyMemory<Char8> and ReadOnlySpan<Char8> were a bit unweildy for a few reasons.

First, there was confusion as to what ROS<Char8> actually meant when the developer could use ROS<byte> for everything. Was ROS<Char8> actually providing guarantees that ROS<byte> couldn't? (No.) When would I ever want to use a lone Char8 by itself rather than as part of a larger sequence? (You probably wouldn't.)

Second, it introduced a complication that if you had a ROM<Char8>, it couldn't be converted to a ROM<byte>. This impacted the ability to perform text manipulation and then act on the data in a binary fashion, such as sending it across the network.

Creating segment types

Segment types can be created safely from Utf8String backing objects. As mentioned earlier, we enforce that data in the UTF-8 segment types is well-formed. This implies that an instance of a segment type cannot represent data that has been sliced in the middle of a multibyte boundary. Calls to slicing APIs will throw an exception if the caller tries to slice the data in such a manner.

The Utf8Segment type introduces additional complexity in that it could be torn in a multi-threaded application, and that tearing may invalidate the well-formedness assumption by causing the torn segment to begin or end in the middle of a multi-byte UTF-8 subsequence. To resolve this issue, any instance method on Utf8Segment (including its projection to ROM<byte>) must first validate that the instance has not been torn. If the instance has been torn, an exception is thrown. This check is O(1) algorithmic complexity.

It is possible that the developer will want to create a Utf8Segment or Utf8Span instance from an existing buffer (such as a pooled buffer). There are zero-cost APIs to allow this to be done; however, they are unsafe because they easily allow the developer to violate invariants held by these types.

If the developer wishes to call the unsafe factories, they must maintain the following three invariants hold.

  1. The provided buffer (ROM<byte> or ROS<byte>) remains "alive" and immutable for the duration of the Utf8Segment or Utf8Span's existence. Whichever component receives a Utf8Segment or Utf8Span - however the instance has been created - must never observe that the underlying contents change or that dereferencing the contents might result in an AV or other undefined behavior.

  2. The provided buffer contains only well-formed UTF-8 data, and the boundaries of the buffer do not split a multibyte UTF-8 sequence.

  3. For Utf8Segment in particular, the caller must not create a Utf8Segment instance wrapped around a ROM<byte> in circumstances where the component which receives the newly created Utf8Segment might tear it. The reason for this is that the "check that the Utf8Segment instance was not torn across a multi-byte subsequence" protection is only reliable when the Utf8Segment instance is backed by a Utf8String. The Utf8Segment type makes a best effort to offer protection for other backing buffers, but this protection is not ironclad in those scenarios. This could lead to a violation of invariant (2) immediately above.

The type design here - including the constraints placed on segment types and the elimination of the Char8 type - also draws inspiration from the Go, Swift, and Rust communities.

public readonly ref struct Utf8Span
{
    public Utf8Span(Utf8String? value);

    // This "Unsafe" ctor wraps a Utf8Span around an arbitrary span. It is non-copying.
    // The caller must uphold Utf8Span's invariants: that it's immutable and well-formed
    // for the lifetime that any component might be consuming the Utf8Span instance.
    // Consumers (and Utf8Span's own internal APIs) rely on this invariant, and
    // violating it could lead to undefined behavior at runtime.

    [RequiresUnsafe]
    public static Utf8Span UnsafeCreateWithoutValidation(ReadOnlySpan<byte> buffer);

    // The equality operators and GetHashCode() operate on the underlying buffers.
    // Two Utf8Span instances containing the same data will return equal and have
    // the same hash code, even if they're referencing different memory addresses.

    [EditorBrowsable(EditorBrowsableState.Never)]
    [Obsolete("Equals(object) on Utf8Span will always throw an exception. Use Equals(Utf8Span) or == instead.")]
    public override bool Equals(object? obj);
    public bool Equals(Utf8Span other);
    public bool Equals(Utf8Span other, StringComparison comparison);
    public static bool Equals(Utf8Span left, Utf8Span right);
    public static bool Equals(Utf8Span left, Utf8Span right, StringComparison comparison);
    public override int GetHashCode();
    public int GetHashCode(StringComparison comparison);
    public static bool operator !=(Utf8Span left, Utf8Span right);
    public static bool operator ==(Utf8Span left, Utf8Span right);

    // Unlike Utf8String.GetPinnableReference, Utf8Span.GetPinnableReference returns
    // null if the span is zero-length. This is because we're not guaranteed that the
    // backing data has a null terminator at the end, so we don't know whether it's
    // safe to dereference the element just past the end of the span.

    public ReadOnlySpan<byte> Bytes { get; }
    public bool IsEmpty { get; }
    [EditorBrowsable(EditorBrowsableState.Never)]
    public ref readonly byte GetPinnableReference();

    // For the most part, Utf8Span's remaining APIs mirror APIs already on Utf8String.
    // There are some exceptions: methods like ToUpperInvariant have a non-allocating
    // equivalent that allows the caller to specify the buffer which should
    // contain the result of the operation. Like Utf8String, all APIs are assumed
    // Ordinal unless the API takes a parameter which provides otherwise.

    public static Utf8Span Empty { get; }

    public ReadOnlySpan<byte> Bytes { get; } // returns ROS<byte>, not custom enumerable
    public CharEnumerable Chars { get; }
    public RuneEnumerable Runes { get; }

    // Also allow iterating over extended grapheme clusters (not yet ready).
    // public GraphemeClusterEnumerable GraphemeClusters { get; }

    public int CompareTo(Utf8Span other);
    public int CompareTo(Utf8Span other, StringComparison comparison);

    public bool Contains(char value);
    public bool Contains(char value, StringComparison comparison);
    public bool Contains(Rune value);
    public bool Contains(Rune value, StringComparison comparison);
    public bool Contains(Utf8Span value);
    public bool Contains(Utf8Span value, StringComparison comparison);

    public bool EndsWith(char value);
    public bool EndsWith(char value, StringComparison comparison);
    public bool EndsWith(Rune value);
    public bool EndsWith(Rune value, StringComparison comparison);
    public bool EndsWith(Utf8Span value);
    public bool EndsWith(Utf8Span value, StringComparison comparison);

    public bool IsAscii();

    public bool IsEmptyOrWhiteSpace();

    public bool IsNormalized(NormalizationForm normalizationForm = NormalizationForm.FormC);

    public Utf8String Normalize(NormalizationForm normalizationForm = NormalizationForm.FormC);
    public int Normalize(Span<byte> destination, NormalizationForm normalizationForm = NormalizationForm.FormC);

    public Utf8Span this[Range range] { get; }

    public SplitResult SplitOn(char separator);
    public SplitResult SplitOn(char separator, StringComparison comparisonType);
    public SplitResult SplitOn(Rune separator);
    public SplitResult SplitOn(Rune separator, StringComparison comparisonType);
    public SplitResult SplitOn(Utf8String separator);
    public SplitResult SplitOn(Utf8String separator, StringComparison comparisonType);

    public SplitResult SplitOnLast(char separator);
    public SplitResult SplitOnLast(char separator, StringComparison comparisonType);
    public SplitResult SplitOnLast(Rune separator);
    public SplitResult SplitOnLast(Rune separator, StringComparison comparisonType);
    public SplitResult SplitOnLast(Utf8String separator);
    public SplitResult SplitOnLast(Utf8String separator, StringComparison comparisonType);

    public bool StartsWith(char value);
    public bool StartsWith(char value, System.StringComparison comparison);
    public bool StartsWith(Rune value);
    public bool StartsWith(Rune value, StringComparison comparison);
    public bool StartsWith(Utf8Span value);
    public bool StartsWith(Utf8Span value, StringComparison comparison);

    public int ToChars(Span<char> destination);

    public Utf8String ToLower(CultureInfo culture);
    public int ToLower(Span<byte> destination, CultureInfo culture);

    public Utf8String ToLowerInvariant();
    public int ToLowerInvariant(Span<byte> destination);

    public override string ToString();

    public Utf8String ToUpper(CultureInfo culture);
    public int ToUpper(Span<byte> destination, CultureInfo culture);

    public Utf8String ToUpperInvariant();
    public int ToUpperInvariant(Span<byte> destination);

    public Utf8String ToUtf8String();

    // Should we also have Trim* overloads that return a range instead
    // of the span directly? Does this actually enable any new scenarios?

    public Utf8Span Trim();
    public Utf8Span TrimStart();
    public Utf8Span TrimEnd();

    public bool TryFind(char value, out Range range);
    public bool TryFind(char value, StringComparison comparisonType, out Range range);
    public bool TryFind(Rune value, out Range range);
    public bool TryFind(Rune value, StringComparison comparisonType, out Range range);
    public bool TryFind(Utf8Span value, out Range range);
    public bool TryFind(Utf8Span value, StringComparison comparisonType, out Range range);

    public bool TryFindLast(char value, out Range range);
    public bool TryFindLast(char value, StringComparison comparisonType, out Range range);
    public bool TryFindLast(Rune value, out Range range);
    public bool TryFindLast(Rune value, StringComparison comparisonType, out Range range);
    public bool TryFindLast(Utf8Span value, out Range range);
    public bool TryFindLast(Utf8Span value, StringComparison comparisonType, out Range range);

    /*
     * HELPER NESTED STRUCTS
     */

    public readonly ref struct CharEnumerable { /* pattern match for 'foreach' */ }
    public readonly ref struct RuneEnumerable { /* pattern match for 'foreach' */ }

    public readonly ref struct SplitResult
    {
        private SplitResult();

        [EditorBrowsable(EditorBrowsable.Never)]
        public void Deconstruct(out Utf8Span before, out Utf8Span after);
    }
}

public readonly struct Utf8Segment : IComparable<Utf8Segment>, IEquatable<Utf8Segment>
{
    private readonly ReadOnlyMemory<byte> _data;

    public Utf8Span Span { get; }

    // Not all span-based APIs are present. APIs on Utf8Span that would
    // return a new Utf8Span (such as Trim) should be present here, but
    // other APIs that return bool / int (like Contains, StartsWith)
    // should only be present on the Span type to discourage heavy use
    // of APIs hanging directly off of this type.

    public override bool Equals(object? other); // ok to call
    public bool Equals(Utf8Segment other); // defaults to Ordinal
    public bool Equals(Utf8Segment other, StringComparison comparison);

    public override int GetHashCode(); // Ordinal
    public int GetHashCode(StringComparison comparison);

    // Caller is responsible for ensuring:
    // - Input buffer contains well-formed UTF-8 data.
    // - Input buffer is immutable and accessible for the lifetime of this Utf8Segment instance.
    public static Utf8Segment UnsafeCreateWithoutValidation(ReadOnlyMemory<byte> data);
}

Supporting types

Like StringComparer, there's also a Utf8StringComparer which can be passed into the Dictionary<,> and HashSet<> constructors. This Utf8StringComparer also implements IEqualityComparer<Utf8Segment>, which allows using Utf8Segment instances directly as the keys inside dictionaries and other collection types.

The Dictionary<,> class is also being enlightened to understand that these types have both non-randomized and randomized hash code calculation routines. This allows dictionaries instantiated with TKey = Utf8String or TKey = Utf8Segment to enjoy the same performance optimizations as dictionaries instantiated with TKey = string.

Finally, the Utf8StringComparer type has convenience methods to compare Utf8Span instances against one another. This will make it easier to compare texts using specific cultures, even if that specific culture is not the current thread's active culture.

public abstract class Utf8StringComparer : IComparer<Utf8Segment>, IComparer<Utf8String?>, IEqualityComparer<Utf8Segment>, IEqualityComparer<Utf8String?>
{
    private Utf8StringComparer(); // all implementations are internal

    public static Utf8StringComparer CurrentCulture { get; }
    public static Utf8StringComparer CurrentCultureIgnoreCase { get; }
    public static Utf8StringComparer InvariantCulture { get; }
    public static Utf8StringComparer InvariantCultureIgnoreCase { get; }
    public static Utf8StringComparer Ordinal { get; }
    public static Utf8StringComparer OrdinalIgnoreCase { get; }

    public static Utf8StringComparer Create(CultureInfo culture, bool ignoreCase);
    public static Utf8StringComparer Create(CultureInfo culture, CompareOptions options);
    public static Utf8StringComparer FromComparison(StringComparison comparisonType);

    public abstract int Compare(Utf8Segment x, Utf8Segment y);
    public abstract int Compare(Utf8String? x, Utf8String? y);
    public abstract int Compare(Utf8Span x, Utf8Span y);
    public abstract bool Equals(Utf8Segment x, Utf8Segment y);
    public abstract bool Equals(Utf8String? x, Utf8String? y);
    public abstract bool Equals(Utf8Span x, Utf8Span y);
    public abstract int GetHashCode(Utf8Segment obj);
    public abstract int GetHashCode(Utf8String obj);
    public abstract int GetHashCode(Utf8Span obj);
}

Manipulating UTF-8 data

CoreFX and Azure scenarios

  • What exchange types do we use when passing around UTF-8 data into and out of Framework APIs?

  • How do we generate UTF-8 data in a low-allocation manner?

  • How do we apply a series of transformations to UTF-8 data in a low-allocation manner?

    • Leave everything as Span<byte>, use a special Utf8StringBuilder type, or something else?

    • Do we need to support UTF-8 string interpolation?

    • If we have builders, who is ultimately responsible for lifetime management?

    • Perhaps should look at ValueStringBuilder for inspiration.

    • A MutableUtf8Buffer type would be promising, but we'd need to be able to generate Utf8Span slices from it, and if the buffer is being modified continually the spans could end up holding invalid data. Example below:

      MutableUtf8Buffer buffer = GetBuffer();
      Utf8Span theSpan = buffer[0..1];
      
      buffer.InsertAt(0, utf8("💣")); // U+1F483 ([ F0 9F 92 A3 ])
      
      // 'theSpan' now contains only the first byte ([ F0 ]).
      // Trying to use it could corrupt the application.
      //
      // Any such mutable UTF-8 type would necessarily be unsafe. This
      // also matches Rust's semantics: direct byte manipulation can only
      // take place within an unsafe context.
      // See:
      // * https://doc.rust-lang.org/std/string/struct.String.html#method.as_mut_vec
      // * https://doc.rust-lang.org/std/primitive.str.html#method.as_bytes_mut
  • Some folks will want to perform operations in-place.

Sample operations on arbitrary buffers

(Devs may want to perform these operations on arbitrary byte buffers, even if those buffers aren't guaranteed to contain valid UTF-8 data.)

  • Validate that buffer contains well-formed UTF-8 data.

  • Convert ASCII data to upper / lower in-place, leaving all non-ASCII data untouched.

  • Split on byte patterns. (Probably shouldn't split on runes or UTF-8 string data, since we can't guarantee data is well-formed UTF-8.)

These operations could be on the newly-introduced System.Text.Unicode.Utf8 static class. They would take ROS<byte> and Span<byte> as input parameters because they can operate on arbitrary byte buffers. Their runtime performance would be subpar compared to similar methods on Utf8String, Utf8Span, or other types where we can guarantee that no invalid data will be seen, as the APIs which operate on raw byte buffers would need to be defensive and would probably operate over the input in an iterative fashion rather than in bulk. One potential behavior could be skipping over invalid data and leaving it unchanged as part of the operation.

Sample Utf8StringBuilder implementation for private use

internal ref struct Utf8StringBuilder
{
    public void Append<T>(T value) where T : IUtf8Formattable;
    public void Append<T>(T value, string format, CultureInfo culture) where T : IUtf8Formattable;

    public void Append(Utf8String value);
    public void Append(Utf8Segment value);
    public void Append(Utf8Span value);

    // Some other Append methods, resize methods, etc.
    // Methods to query the length.

    public Utf8String ToUtf8String();

    public void Dispose(); // when done with the instance
}

// Would be implemented by numeric types (int, etc.),
// DateTime, String, Utf8String, Guid, other primitives,
// Uri, and anything else we might want to throw into
// interpolated data.
internal interface IUtf8Formattable
{
    void Append(ref Utf8StringBuilder builder);
    void Append(ref Utf8StringBuilder builder, string format, CultureInfo culture);
}

Code samples and metadata representation

The C# compiler could detect support for UTF-8 strings by looking for the existence of the System.Utf8String type and the appropriate helper APIs on RuntimeHelpers as called out in the samples below. If these APIs don't exist, then the target framework does not support the concept of UTF-8 strings.

Literals

Literal UTF-8 strings would appear as regular strings in source code, but would be prefixed by a u as demonstrated below. The u prefix would denote that the return type of this literal string expression should be Utf8String instead of string.

Utf8String myUtf8String = u"A literal string!";
// Normal ldstr to literal UTF-16 string in PE string table, followed by
// call to helper method which translates this to a UTF-8 string literal.
// The end result of these calls is that a Utf8String instance sits atop
// the stack.

ldstr "A literal string!"
call class System.Utf8String System.Runtime.CompilerServices.RuntimeHelpers.InitializeUtf8StringLiteral(string)

The u prefix would also be combinable with the @ prefix and the $ prefix (more on this below).

Additionally, literal UTF-8 strings must be well-formed Unicode strings.

// Below line would be a compile-time error since it contains ill-formed Unicode data.
Utf8String myUtf8String = u"A malformed \ud800 literal string!";

Three alternative designs were considered. One was to use RVA statics (through ldsflda) instead of literal UTF-16 strings (through ldstr) before calling a "load from RVA" method on RuntimeHelpers. The overhead of using RVA statics is somewhat greater than the overhead of using the normal UTF-16 string table, so the normal UTF-16 string literal table should still be the more optimized case for small-ish strings, which we believe to be the common case.

Another alternative considered was to introduce a new opcode ldstr.utf8, which would act as a UTF-8 equivalent to the normal ldstr opcode. This would be a breaking change to the .NET tooling ecosystem, and the ultimate decision was that there would be too much pain to the ecosystem to justify the benefit.

The third alternative considered was to smuggle UTF-8 data in through a normal UTF-16 string in the string table, then call a RuntimeHelpers method to reinterpret the contents. This would result in a "garbled" string for anybody looking at the raw IL. While that in itself isn't terrible, there is the possibility that smuggling UTF-8 data in this manner could result in a literal string which has ill-formed UTF-16 data. Not all .NET tooling is resilient to this. For example, xunit's test runner produces failures if it sees attributes initialized from literal strings containing ill-formed UTF-16 data. There is a risk that other tooling would behave similarly, potentially modifying the DLL in such a manner that errors only manifest themselves at runtime. This could result in difficult-to-diagnose bugs.

We may wish to reconsider this decision in the future. For example, if we see that it is common for developers to use large UTF-8 literal strings, maybe we'd want to dynamically switch to using RVA statics for such strings. This would lower the resulting DLL size. However, this would add extra complexity to the compilation process, so we'd want to tread lightly here.

Constant handling

class MyClass
{
    public const Utf8String MyConst = u"A const string!";
}
// Literal field initialized to literal UTF-16 value. The runtime doesn't care about
// this (modulo FieldInfo.GetRawConstantValue, which perhaps we could fix up), so
// only the C# compiler would need to know that this is a UTF-8 constant and that
// references to it should get the same (ldstr, call) treatment as stated above.

.field public static literal class System.Utf8String MyConst = "A const string!";

String concatenation

There would be APIs on Utf8String which mirror the string.Concat APIs. The compiler should special-case the + operator to call the appropriate overload n-ary overload of Concat.

Utf8String a = ...;
Utf8String b = ...;

Utf8String c = a + u", " + b; // calls Utf8String.Concat(...)

Since we expect use of Utf8String to be "deliberate" when compared to string (see the beginning of this document), we should consider that a developer who is using UTF-8 wants to stay in UTF-8 during concatenation operations. This means that if there's a line which involves the concatenation of both a Utf8String and a string, the final type post-concatenation should be Utf8String.

Utf8String a = ...;
string b = ...;

Utf8String concatFoo = a + b;
string concatBar = (object)a + b; // compiler can't statically determine that any argument is Utf8String

This is still open for discussion, as the behavior may be surprising to people. Another alternative is to produce a build warning if somebody tries to mix-and-match UTF-8 strings and UTF-16 strings in a single concatenation expression.

If string interpolation is added in the future, this shouldn't result in ambiguity. The $ interpolation operator will be applied to a literal Utf8String or a literal string, and that would dictate the overall return type of the operation.

Equality comparisons

There are standard == and != operators defined on the Utf8String class.

public static bool operator ==(Utf8String a, Utf8String b);
public static bool operator !=(Utf8String a, Utf8String b);

The C# compiler should special-case when either side of an equality expression is known to be a literal null object, and if so the compiler should emit a referential check against the null object instead of calling the operator method. This matches the if (myString == null) behavior that the string type enjoys today.

Additionally, equality / inequality comparisons between Utf8String and string should produce compiler warnings, as they will never succeed.

Utf8String a = ...;
string b = ...;

// Below line should produce a warning since it will end up being the equivalent
// of Object.ReferenceEquals, which will only succeed if both arguments are null.
// This probably wasn't what the developer intended to check.

if (a == b) { /* ... */ }

I attempted to define operator ==(Utf8String a, string b) so that I could slap [Obsolete] on it and generate the appropriate warning, but this had the side effect of disallowing the user to write the code if (myUtf8String == null) since the compiler couldn't figure out which overload of operator == to call. This was also one of the reasons I had opened dotnet/csharplang#2340.

Marshaling behaviors

Like the string type, the Utf8String type shall be marshalable across p/invoke boundaries. The corresponding unmanaged type shall be LPCUTF8 (equivalent to a BYTE* pointing to null-terminated UTF-8 data) unless a different unmanaged type is specified in the p/invoke signature.

If a different [MarshalAs] representation is specified, the stub routine creates a temporary copy in the desired representation, performs the p/invoke, then destroys the temporary copy or allows the GC to reclaim the temporary copy.

class NativeMethods
{
    [DllImport]
    public static extern int MyPInvokeMethod(
        [In] Utf8String marshaledAsLPCUTF8,
        [In, MarshalAs(UnmanagedType.LPUTF8Str)] Utf8String alsoMarshaledAsLPCUTF8,
        [In, MarshalAs(UnmanagedType.LPWStr)] Utf8String marshaledAsLPCWSTR,
        [In, MarshalAs(UnmanagedType.BStr)] Utf8String marshaledAsBSTR);
}

If a Utf8String must be marshaled from native-to-managed (e.g., a reverse p/invoke takes place on a delegate which has a Utf8String parameter), the stub routine is responsible for fixing up invalid UTF-8 data before creating the Utf8String instance (or it may let the Utf8String constructor perform the fixup automatically).

Unmanaged routines must not modify the contents of any Utf8String instance marshaled across the p/invoke boundary. Utf8String instances are assumed to be immutable once created, and violating this assumption could cause undefined behaviors within the runtime.

There is no default marshaling behavior for Utf8Segment or Utf8Span since they are not guaranteed to be null-terminated. If in the future the runtime allows marshaling {ReadOnly}Span<T> across a p/invoke boundary (presumably as a non-null-terminated array equivalent), library authors may fetch the underlying ReadOnlySpan<byte> from the Utf8Segment or Utf8Span instance and directly marshal that span across the p/invoke boundary.

Automatic coercion of UTF-16 literals to UTF-8 literals

If possible, it would be nice if UTF-16 literals (not arbitrary string instances) could be automatically coerced to UTF-8 literals (via the ldstr / call routines mentioned earlier). This coercion would only be considered if attempting to leave the data as a string would have caused a compilation error. This could help eliminate some errors resulting from developers forgetting to put the u prefix in front of the string literal, and it could make the code cleaner. Some examples follow.

// String literal being assigned to a member / local of type Utf8String.
public const Utf8String MyConst = "A literal!";

public void Foo(string s);
public void Foo(Utf8String s);

public void FooCaller()
{
    // Calls Foo(string) since it's an exact match.
    Foo("A literal!");
}

public void Bar(object o);
public void Bar(Utf8String s);

public void BarCaller()
{
    // Calls Bar(object), passing in the string literal,
    // since it's the closest match.
    Bar("A literal!");
}

public void Baz(int i);
public void Baz(Utf8String s);

public void BazCaller1()
{
    // Calls Baz(Utf8String), passing in the UTF-8 literal,
    // since there's no closer match.
    Baz("A literal!");
}

public void BazCaller2(string someInput)
{
    // Compiler error. The input isn't a literal, so no auto-coercion
    // takes place. Dev should call Baz(new Utf8String(someInput)).
    Baz(someInput);
}

public void Quux<T>(ReadOnlySpan<T> value);
public void Quux(Utf8String s);

public void QuuxCaller()
{
    // Calls Quux<char>(ReadOnlySpan<char>), passing in the string literal,
    // since string satisfies the constraints.
    Quux("A literal!");
}

public void Glomp(Utf8Span value);

public void GlompCaller()
{
    // Calls Glomp(Utf8Span), passing in the UTF-8 literal, since there's
    // no closer match and Utf8String can be implicitly cast to Utf8Span.
    Glomp("A literal!");
}

UTF-8 String interpolation

The string interpolation feature is undergoing significant churn (see dotnet/csharplang#2302). I envision that when a final design is chosen, there would be a UTF-8 counterpart for symmetry. The internal IUtf8Formattable interface as proposed above is being designed partly with this feature in mind in order to allow single-allocation Utf8String interpolation.

ustring contextual language keyword

For simplicity, we may want to consider a contextual language keyword which corresponds to the System.Utf8String type. The exact name is still up for debate, as is whether we'd want it at all, but we could consider something like the below.

Utf8String a = u"Some UTF-8 string.";

// 'ustring' and 'System.Utf8String' are aliases, as shown below.

ustring b = a;
Utf8String c = b;

The name ustring is intended to invoke "Unicode string". Another leading candidate was utf8. We may wish not to ship with this keyword support in v1 of the Utf8String feature. If we opt not to do so we should be mindful of how we might be able to add it in the future without introducing breaking changes.

An alternative design to use a u suffix instead of a u prefix. I'm mostly impartial to this, but there is a nice symmetry to having the characters u, $, and @ all available as prefixes on literal strings.

We could also drop the u prefix entirely and rely solely on type targeting:

ustring a = "Literal string type-targeted to UTF-8.";
object b = (ustring)"Another literal string type-targeted to UTF-8.";

This has implications for string interpolation, as it wouldn't be possible to prepend both the (ustring) coercion hint and the $ interpolation operator simultaneously.

Switching and pattern matching

If a value whose type is statically known to be Utf8String is passed to a switch statement, the corresponding case statements should allow the use of literal Utf8String values.

Utf8String value = ...;

switch (value)
{
    case u"Some literal": /* ... */
    case u"Some other literal": /* ... */
    case "Yet another literal": /* target typing also works */
}

Since pattern matching operates on input values of arbitrary types, I'm pessimistic that pattern matching will be able to take advantage of target typing. This may instead require that developers specify the u prefix on Utf8String literals if they wish such values to participate in pattern matching.

A brief interlude on indexers and IndexOf

Utf8String and related types do not expose an elemental indexer (this[int]) or a typical IndexOf method because they're trying to rid the developer of the notion that bytewise indices into UTF-8 buffers can be treated equivalently as charwise indices into UTF-16 buffers. Consider the naïve implementation of a typical "string split" routine as presented below.

void SplitString(string source, string target, StringComparison comparisonType, out string beforeTarget, out string afterTarget)
{
    // Locates 'target' within 'source', splits on it, then populates the two out parameters.
    // ** NOTE ** This code has a bug, as will be explained in detail below.

    int index = source.IndexOf(target, comparisonType);
    if (index < 0) { throw new Exception("Target string not found!"); }

    beforeTarget = source.Substring(0, index);
    afterTarget = source.Substring(index + target.Length, source.Length - index - target.Length);
}

One subtlety of the above code is that when culture-sensitive or case-insensitive comparers are used (such as OrdinalIgnoreCase in the above example), the target string doesn't have to be an exact char-for-char match of a sequence present in the source string. For example, consider the UTF-16 string "GREEN" ([ 0047 0052 0045 0045 004E ]). Performing an OrdinalIgnoreCase search for the substring "e" ([ 0065 ]) will result in a match, as 'e' (U+0065) and 'E' (U+0045) compare as equal under an OrdinalIgnoreCase comparer.

As another example, consider the UTF-16 string "preſs" ([ 0070 0072 0065 017F 0073 ]), whose fourth character is the Latin long s 'ſ' (U+017F). Performing an OrdinalIgnoreCase search for the substring "S" ([ 0053 ]) will result in a match, as 'ſ' (U+017F) and 'S' (U+0053) compare as equal under an OrdinalIgnoreCase comparer.

There are also scenarios where the length of the match within the search string might not be equal to the length of the target string. Consider the UTF-16 string "encyclopædia" ([ 0065 006E 0063 0079 0063 006C 006F 0070 00E6 0064 0069 0061 ]), whose ninth character is the ligature 'æ' (U+00E6). Performing an InvariantCultureIgnoreCase search for the substring "ae" ([ 0061 0065 ]) will result in a match at index 8, as "æ" ([ 00E6 ]) and "ae" ([ 0061 0065 ]) compare as equal under an InvariantCultureIgnoreCase comparer.

This result is interesting and should give us pause. Since "æ".Length == 1 and "ae".Length == 2, the arithmetic at the end of the method will actually result in the wrong substrings being returned to the caller.

beforeTarget = source.Substring(0, 8 /* index */); // = "encyclop"
afterTarget = source.Substring(
    10 /* index + target.Length */,
    2 /* source.Length - index - target.Length */); // = "ia" (expected "dia"!)

Due to the nature of UTF-16 (used by string), when performing an Ordinal or an OrdinalIgnoreCase comparison, the length of the matched substring within the source will always have a char count equal to target.Length. The length mismatch as demonstrated by "encyclopædia" above can only happen with a culture-sensitive comparer or any of the InvariantCulture comparers.

However, in UTF-8, these same guarantees do not hold. Under UTF-8, only when performing an Ordinal comparison is there a guarantee that the length of the matched substring within the source will have a byte count equal to the target. All other comparers - including OrdinalIgnoreCase - have the behavior that the byte length of the matched substring can change (either shrink or grow) when compared to the byte length of the target string.

As an example of this, consider the string "preſs" from earlier, but this time in its UTF-8 representation ([ 70 72 65 C5 BF 73 ]). Performing an OrdinalIgnoreCase for the target UTF-8 string "S" ([ 53 ]) will match on the ([ C5 BF ]) portion of the source string. (This is the UTF-8 representation of the letter 'ſ'.) To properly split the source string along this search target, the caller need to know not only where the match was, but also how long the match was within the original source string.

This fundamental problem is why Utf8String and related types don't expose a standard IndexOf function or a standard this[int] indexer. It's still possible to index directly into the underlying byte buffer by using an API which projects the data as a ROS<byte>. But for splitting operations, these types instead offer a simpler API that performs the split on the caller's behalf, handling the length adjustments appropriately. For callers who want the equivalent of IndexOf, the types instead provide TryFind APIs that return a Range instead of a typical integral index value. This Range represents the matching substring within the original source string, and new C# language features make it easy to take this result and use it to create slices of the original source input string.

This also addresses feedback that was given in a previous prototype: users weren't sure how to interpret the result of the IndexOf method. (Is it a byte count? Is it a char count? Is it something else?) Similarly, there was confusion as to what parameters should be passed to a this[int] indexer or a Substring(int, int) method. By having the APIs promote use of Range and related C# language features, this confusion should subside. Power developers can inspect the Range instance directly to extract raw byte offsets if needed, but most devs shouldn't need to query such information.

API usage samples

Scenario: Split an incoming string of the form "LastName, FirstName" into individual FirstName and LastName components.

// Using Utf8String input and producing Utf8String instances
void SplitSample(ustring input)
{
    // Method 1: Use the SplitOn API to find the ',' char, then trim manually.

    (ustring lastName, ustring firstName) = input.Split(',');
    if (firstName is null) { /* ERROR: no ',' detected in input */ }

    lastName = lastName.Trim();
    firstName = firstName.Trim();

    // Method 2: Use the SplitOn API to find the ", " target string, assuming no trim needed.

    (ustring lastName, ustring firstName) = input.Split(u", ");
    if (firstName is null) { /* ERROR: no ", " detected in input */ }
}

// Using Utf8Span input and producing Utf8Span instances
void SplitSample(Utf8Span input)
{
    // Method 1: Use the SplitOn API to find the ',' char, then trim manually.

    (Utf8Span lastName, Utf8Span firstName) = input.Split(',');
    lastName = lastName.Trim();
    firstName = firstName.Trim();
    if (firstName.IsEmpty) { /* ERROR: trailing ',', or no ',' detected in input */ }

    // Method 2: Use the SplitOn API to find the ", " target string, assuming no trim needed.

    (Utf8Span lastName, Utf8Span firstName) = input.Split(", ");
    if (firstName.IsEmpty) { /* ERROR: trailing ", ", or no ", " detected in input */ }
}

Additionally, the SplitResult struct returned by Utf8Span.Split implements both a standard IEnumerable<T> pattern and the C# deconstruct pattern, which allows it to be used separately from enumeration for simple cases where only a small handful of values are returned.

Utf8Span str = ...;

// The result of Utf8Span.Split can be used in an enumerator

foreach (Utf8Span substr in str.Split(','))
{
    /* operate on substr */
}

// Or it can be used in tuple deconstruction
// (See docs for description of behavior for each arity.)

(Utf8Span before, Utf8Span after) = str.Split(',');
(Utf8Span part1, Utf8Span part2, Utf8Span part3, ...) = str.Split(',');

Scenario: Split a comma-delimited input into substrings, then perform an operation with each substring.

// Using Utf8String input and producing Utf8String instances
// The Utf8Span code would look  identical (sub. 'Utf8Span' for 'ustring')

void SplitSample(ustring input)
{
    while (input.Length > 0)
    {
        // 'TryFind' is the 'IndexOf' equivalent. It returns a Range instead
        // of an integer index because there's no this[int] indexer on Utf8String.

        if (!input.TryFind(',', out Range matchedRange))
        {
            // The remainder of the input string is empty, but no comma
            // was found in the remaining portion. Process the remainder
            // of the input string, then finish.

            ProcessValue(input);
            break;
        }

        // We found a comma! Substring and process.
        // The 'matchedRange' local contains the range for the ',' that we found.

        ProcessValue(input[..matchedRange.Start]); // fetch segment to the left of the comma, then process it
        input = input[matchedRange.End..]; // set 'input' to the remainder of the input string and loop
    }

    // Could also have an IEnumerable<ustring>-returning version if we wanted, I suppose.
}

Miscellaneous topics and open questions

What about comparing UTF-16 and UTF-8 data?

Currently there is a set of APIs Utf8String.AreEquivalent which will decode sequences of UTF-16 and UTF-8 data and compare them for ordinal equality. The general code pattern is below.

ustring a = ...;
string b = ...;

// The below line fails to compile because there's no operator==(Utf8String, string) defined.

bool result = (a == b);

// The below line is probably what the developer intended to write.

bool result = ustring.AreEquivalent(a, b);

// The below line should compile since literal strings can be type targeted to Utf8String.

bool result = (a == "Hello!");

Do we want to add an operator==(Utf8String, string) overload which would allow easy == comparison of UTF-8 and UTF-16 data? There are three main downsides to this which caused me to vote no, but I'm open to reconsideration.

  1. The compiler would need to special-case if (myUtf8String == null), which would now be ambiguous between the two overloads. (If the compiler is already special-casing null checks, this is a non-issue.)

  2. The performance of UTF-16 to UTF-8 comparison is much worse than the performance of UTF-16 to UTF-16 (or UTF-8 to UTF-8) comparison. When the representation is the same on both sides, certain shortcuts can be implemented to avoid the O(n) comparison, and even the O(n) comparison itself can be implemented as a simple memcmp operation. When the representations are heterogeneous, the opportunity for taking shortcuts is much more restricted, and the O(n) comparison itself has a higher constant factor. Developers might not expect such a performance characteristic from an equality operator.

  3. Comparing a Utf8String against a literal string would no longer go through the fast path, as target typing would cause the compiler to emit a call to operator==(Utf8String, string) instead of operator==(Utf8String, Utf8String). The comparison itself would then have the lower performance described by bullet (2) above.

One potential upside to having such a comparison is that it would prevent developers from using the antipattern if (myUtf8String.ToString() == someString), which would result in unnecessary allocations. If we are concerned about this antipattern one way to address it would be through a Code Analyzer.

What if somebody passes invalid data to the "skip validation" factories?

When calling the "unsafe" APIs, callers are fully responsible for ensuring that the invariants are maintained. Our debug builds could double-check some of these invariants (such as the initial Utf8String creation consisting only of well-formed data). We could also consider allowing applications to opt-in to these checks at runtime by enabling an MDA or other diagnostic facility. But as a guiding principle, when "unsafe" APIs are called the Framework should trust the developer and should have as little overhead as possible.

Consider consolidating the unsafe factory methods under a single unsafe type.

This would prevent pollution of the type's normal API surface and could help write tools which audit use of a single "unsafe" type.

Some of the methods may need to be extension methods instead of normal static factories. (Example: Unsafe slicing routines, should we choose to expose them.)

Potential APIs to enlighten

System namespace

Include Utf8String / Utf8Span overloads on Console.WriteLine. Additionally, perhaps introduce an API Console.ReadLineUtf8.

System.Data.* namepace

Include generalized support for serializing Utf8String properties as a primitive with appropriate mapping to nchar or nvarchar.

System.Diagnostics.* namespace

Enlighten EventSource so that a caller can write Utf8String / Utf8Span instances cheaply. Additionally, some types like ActivitySpanId already have ROS<byte> ctors; overloads can be introduced here.

System.Globalization.* namespace

The CompareInfo type has many members which operate on string instances. These should be spanified foremost, and Utf8String / Utf8Span overloads should be added. Good candidates are Compare, GetHashCode, IndexOf, IsPrefix, and IsSuffix.

The TextInfo type has members which should be treated similarly. ToLower and ToUpper are good candidates. Can we get away without enlightening ToTitleCase?

System.IO.* namespace

BinaryReader and BinaryWriter should have overloads which operate on Utf8String and Utf8Span. These overloads could potentially be cheaper than the normal string / ROS<char> based overloads, since the reader / writer instances may in fact be backed by UTF-8 under the covers. If this is the case then writing is simple projection, and reading is validation (faster than transcoding).

File: WriteAllLines, WriteAllText, AppendAllText, etc. are good candidates for overloads to be added. On the read side, there's ReadAllTextUtf8 and ReadAllLinesUtf8.

TextReader.ReadLine and TextWriter.Write are also good candidates to overload. This follows the same general premise as BinaryReader and BinaryWriter as mentioned above.

Should we also enlighten SerialPort or GPIO APIs? I'm not sure if UTF-8 is a bottleneck here.

System.Net.Http.* namespace

Introduce Utf8StringContent, which automatically sets the charset header. This type already exists in the System.Utf8String.Experimental package.

System.Text.* namespace

UTF8Encoding: Overload candidates are GetChars, GetString, and GetCharCount (of Utf8String or Utf8Span). These would be able to skip validation after transcoding as long as the developer hasn't subclassed the type.

Rune: Add ToUtf8String API. Add IsDefined API to query the OS's NLS tables (could help with databases and other components that need to adhere to strict case / comparison processing standards).

TextEncoder: Add Encode(Utf8String): Utf8String and FindFirstIndexToEncode(Utf8Span): Index. This is useful for HTML-escaping, JSON-escaping, and related operations.

Utf8JsonReader: Add read APIs (GetUtf8String) and overloads to both the ctor and ValueTextEquals.

JsonEncodedText: Add an EncodedUtf8String property.

Regex is a bit of a special case because there has been discussion about redoing the regex stack all-up. If we did proceed with redoing the stack, then it would make sense to add first-class support for UTF-8 here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment