Skip to content

Instantly share code, notes, and snippets.

Embed
What would you like to do?
Using Rune

Introduction

Most .NET developers are familiar with the string class as a means to represent and manipulate text. A string is logically a sequence of 16-bit values called chars, and the string.Length property returns the number of chars that are present in the string instance.

Consider the sample function below, which prints out all the chars in a string.

public static void PrintChars(string s)
{
    Console.WriteLine("\"{0}\".Length = {1}", s, s.Length);
    for (int i = 0; i < s.Length; i++)
    {
        Console.WriteLine("s[{0}] = '{1}' ('\\u{2:x4}')", i, s[i], (int)s[i]);
    }
}

Calling PrintChars("Hello"); returns the following:

"Hello".Length = 5
s[0] = 'H' ('\u0048')
s[1] = 'e' ('\u0065')
s[2] = 'l' ('\u006c')
s[3] = 'l' ('\u006c')
s[4] = 'o' ('\u006f')

A char value can represent most of the world's writing systems, as demonstrated by the call to PrintChars("你好");. (你好 translates to nǐ hǎo, Chinese for Hello.)

"你好".Length = 2
s[0] = '你' ('\u4f60')
s[1] = '好' ('\u597d')

However, things are not so straightforward when dealing with strings containing characters from less-common writing systems or for strings containing some symbols or emoji. Consider the string "𐓏𐓘𐓻𐓘𐓻𐓟 𐒻𐓟", which means "Osage" in the Osage language.

"𐓏𐓘𐓻𐓘𐓻𐓟 𐒻𐓟".Length = 17
s[0] = '�' ('\ud801')
s[1] = '�' ('\udccf')
s[2] = '�' ('\ud801')
s[3] = '�' ('\udcd8')
s[4] = '�' ('\ud801')
s[5] = '�' ('\udcfb')
s[6] = '�' ('\ud801')
s[7] = '�' ('\udcd8')
s[8] = '�' ('\ud801')
s[9] = '�' ('\udcfb')
s[10] = '�' ('\ud801')
s[11] = '�' ('\udcdf')
s[12] = ' ' ('\u0020')
s[13] = '�' ('\ud801')
s[14] = '�' ('\udcbb')
s[15] = '�' ('\ud801')
s[16] = '�' ('\udcdf')

Or consider the string "🐂", which is a single emoji ox.

"🐂".Length = 2
s[0] = '�' ('\ud83d')
s[1] = '�' ('\udc02')

This behavior might be surprising to most developers, and it demonstrates a fundamental principle of string and char.

Breaking down a string into individual chars and inspecting those chars one-by-one doesn't always provide meaningful results. Similarly, the value string.Length doesn't necessarily correlate to what a user will perceive as the number of characters displayed when rendering a string to the screen.

The basics behind Unicode encodings

It's at this point we need to dive a bit more into how data is encoded in Unicode. The Unicode Standard defines over 1.1 million code points. These code points are abstractions which might represent characters like a - z, symbols like €, or emoji like 🛫. Some code points don't directly correspond to display characters. They might instead perform other actions such as modify the appearance of surrounding text, swap between left-to-right and right-to-left text, or overlay two adjacent characters on top of each other so that they display as a single character. Most code points are unassigned or reserved for future use.

At their most basic, code points are just integers. They are referred to using the syntax U+xxxx, where xxxx is their hex-encoded identifier. The full range of Unicode code points is the range U+0000..U+10FFFF, inclusive. This gives a total number of 1,114,112 possible code points that can be assigned by Unicode, though as mentioned earlier most code points are not yet assigned. Some examples of assigned code points are listed below.

  • U+0061 LATIN SMALL LETTER A ('a') [ PDF chart ]
  • U+0232 LATIN CAPITAL LETTER Y WITH MACRON ('Ȳ') [ PDF chart ]
  • U+6C34 CJK UNIFIED IDEOGRAPH-6C34 ('水') [ PDF chart ]
  • U+10C43 OLD TURKIC LETTER ORKHON AT ('𐱃') [ PDF chart ]
  • U+1F339 ROSE ('🌹') [ PDF chart ]

As mentioned earlier, a .NET char is a 16-bit value. This means that each char can represent a single code point in the range U+0000..U+FFFF. This range of 65,536 code points is often referred to as the Unicode Basic Multilingual Plane ("BMP"). While this is enough to cover the majority of the world's writing systems, it cannot represent supplementary code points, which are code points in the range U+10000..U+10FFFF.

To address this limitation, there is a special range of code points U+D800..U+DFFF called the surrogate code points. When a high surrogate code point (U+D800..U+DBFF) is immediately followed by a low surrogate code point (U+DC00..U+DFFF), they are interpreted as a supplementary code point by using the following formula.

actual = ((hi - 0xD800) * 0x0400) + (lo - 0xDC00) + 0x10000

For example, given the 2-char string "\ud83c\udf39", the actual code point which results from this pair is computed as:

actual = ((    hi - 0xD800) * 0x0400) + (    lo - 0xDC00) + 0x10000
       = ((0xD83C - 0xD800) * 0x0400) + (0xDF39 - 0xDC00) + 0x10000
       = (          0x003C  * 0x0400) +           0x0339  + 0x10000
       =                      0xF000  +           0x0339  + 0x10000
       = 0x1F339

This demonstrates that "\ud83c\udf39" is the char-based encoding of the U+1F339 ROSE ('🌹') code point mentioned previously.

In the C# programming language, the syntax "\Uxxxxxxxx" (a slash, an uppercase U, and 8 hexidecimal digits) can also be used to represent a supplementary code point. So the lines string s = "I have a \ud83c\udf39."; and string s = "I have a \U0001F339."; are equivalent.

The Unicode Standard also defines a concept called a Unicode scalar value, which is any Unicode code point except the set of surrogate code points. In other words, a Unicode scalar value is any code point in the range U+0000..U+D7FF, inclusive, or U+E000..U+10FFFF, inclusive. Typical Unicode operations are defined only in terms of scalar values, not arbitrary code points.

For example, consider the following two code points. (These code points are also scalar values.)

  • U+10421 DESERET CAPITAL LETTER ER ('𐐡')
    • As chars: [ D801 DC21 ]
  • U+10449 DESERET SMALL LETTER ER ('𐑉')
    • As chars: [ D801 DC49 ]

Given an API IsUpperCase(int codePoint), such an API would return true if passed 0x10421. Similarly, IsLowerCase(int codePoint) would return true if passed 0x10449. But what if we were to pass the surrogate code points to such an API? Querying IsUpperCase(0xD801) is meaningless, as without the second half of the surrogate pair it's impossible to know what supplementary code point would have resulted. Similarly, querying IsLowerCase(0xDC49) is meaningless because the first half of the surrogate pair is missing.

This situation occurs more commonly in applications than one might initially expect. For instance, consider a naïve method like the one shown below, where a string instance is converted to uppercase.

// THIS SAMPLE SHOWS INCORRECT CODE.
// DO NOT DO THIS IN A PRODUCTION APPLICATION.
public static string ConvertToUpper(string input)
{
    StringBuilder builder = new StringBuilder(input.Length);
    for (int i = 0; i < input.Length; i++) /* or 'foreach' */
    {
        builder.Append(char.ToUpperInvariant(input[i]));
    }
    return builder.ToString();
}

In this sample, if input contains the substring "𐑉", the result will still contain the lowercase form "𐑉" instead of the uppercase form "𐐡". The reason for this is that when char.ToUpperInvariant sees char values corresponding to surrogate code points, it does not have enough information to perform the conversion properly. (Calling string.ToUpperInvariant on the original input string rather than iterating char-by-char would produce the correct results, as string.ToUpperInvariant can inspect the entire contents of the string instance.)

Likewise, some applications may perform inappropriate string splitting, perhaps to insert newlines into documents.

// THIS SAMPLE SHOWS INCORRECT CODE.
// DO NOT DO THIS IN A PRODUCTION APPLICATION.
public static string InsertNewlinesEveryEightyChars(string input)
{
    StringBuilder builder = new StringBuilder();

    // First, append chunks in multiples of 80 chars
    // followed by a newline.

    int i = 0;
    for (; i < input.Length - 80; i += 80)
    {
        builder.Append(input, i, 80);
        builder.AppendLine(); // newline
    }

    // Then append any leftover data followed by
    // one final newline.

    builder.Append(input, i, input.Length - i);
    builder.AppendLine(); // newline
    
    return builder.ToString();
}

In the above sample, if a surrogate pair happens to straddle an 80-char boundary, the pair will be split and a newline injected between them. This introduces data corruption, as the resulting document is not "well-formed" and the consuming application could encounter errors. (More on well-formedness in the Q & A section later in this document.)

For example, consider the query "Is this code point an uppercase letter?" This query is only meaningful when the input is a scalar value. The question is unanswerable when the input is a surrogate code point, as a surrogate code point only provides half the information necessary.

The Rune type as a scalar value

To help address these issues and to provide a more reliable programming pattern, .NET Core 3 introduces the Rune type. The Rune type corresponds exactly to a Unicode scalar value. The Rune constructors will validate that the resulting instance would be a valid Unicode scalar value, otherwise an exception will be thrown.

Rune a = new Rune('a'); // OK, 'a' (U+0061) is a valid scalar value.
Rune b = new Rune(0x0061); // OK, this is a valid scalar value.
Rune c = new Rune('\ud801'); // Throws, U+D801 is not a valid scalar value.
Rune d = new Rune(0x10421); // OK, this is a valid scalar value.
Rune e = new Rune('\ud801', '\udc21'); // OK, this is equivalent to the above.
Rune f = new Rune(0x12345678); // Throws, outside the range of valid scalar values.

Since a Rune is a valid Unicode scalar value, it can be queried or manipulated and the result will be well-defined.

// The code below is ok.
// Using char.ToUpperInvariant(char) could produce incorrect results.
public static string ConvertToUpper(string input)
{
    StringBuilder builder = new StringBuilder(input.Length);
    foreach (Rune rune in input.EnumerateRunes())
    {
        builder.Append(Rune.ToUpperInvariant(rune));
    }
    return builder.ToString();
}
// The code below is ok.
// Using char.IsLetter(char) could produce incorrect results.
public static bool StringConsistsEntirelyOfLetters(string input)
{
    foreach (Rune rune in input.EnumerateRunes())
    {
        if (!Rune.IsLetter(rune))
        {
            return false;
        }
    }
    return true;
}

For developers accustomed to the static APIs on the char type, the Rune type exposes analogs of many of those APIs. Methods like Rune.IsWhiteSpace, Rune.IsLetterOrDigit, and Rune.GetUnicodeCategory mirror the static APIs available on the char class.

To get the raw scalar value from a Rune instance, use the Rune.Value property. To convert a Rune instance back to a sequence of chars, use Rune.ToString or Rune.EncodeToUtf16. Since any Unicode scalar value is representable by a single char (for BMP scalar values) or by a surrogate pair (for supplementary scalar values), any Rune instance can be represented by at most 2 chars. Use Rune.Utf16SequenceLength to see how many chars are required to hold the representation of this Rune instance.

A brief digression on Unicode strings and UTF-* encodings

Let's pause a moment and discuss how text is represented in applications. Unicode text is logically a sequence of Unicode scalar values, but to be usable by an application those scalar values must be represented in memory somehow. .NET's string type uses a representation known as "UTF-16", in which the scalar values are encoded as a sequence of 16-bit elements. These elements - known to .NET programmers as chars - are called code units in the Unicode Standard.

A code unit is simply an integer that serves as the fundamental building block of a particular UTF-* encoding. .NET developers will be most familiar with UTF-16, which uses 16-bit code units. Other encodings like UTF-8 and UTF-32 also exist; these encodings use 8-bit code units and 32-bit code units, respectively.

Depending on the particular UTF-* encoding being used, it might take multiple code units to represent a single scalar value. Some examples of this follow.

Scalar: U+0061 LATIN SMALL LETTER A ('a')
UTF-8 : [ 61 ]           (1x  8-bit code unit  = 8 bits total)
UTF-16: [ 0061 ]         (1x 16-bit code unit  = 16 bits total)
UTF-32: [ 00000061 ]     (1x 32-bit code unit  = 32 bits total)

Scalar: U+0429 CYRILLIC CAPITAL LETTER SHCHA ('Щ')
UTF-8 : [ D0 A9 ]        (2x  8-bit code units = 16 bits total)
UTF-16: [ 0429 ]         (1x 16-bit code unit  = 16 bits total)
UTF-32: [ 00000429 ]     (1x 32-bit code unit  = 32 bits total)

Scalar: U+A992 JAVANESE LETTER GA ('ꦒ')
UTF-8 : [ EA A6 92 ]     (3x  8-bit code units = 24 bits total)
UTF-16: [ A992 ]         (1x 16-bit code unit  = 16 bits total)
UTF-32: [ 0000A992 ]     (1x 32-bit code unit  = 32 bits total)

Scalar: U+104CC OSAGE CAPITAL LETTER TSHA ('𐓌')
UTF-8 : [ F0 90 93 8C ]  (4x  8-bit code units = 32 bits total)
UTF-16: [ D801 DCCC ]    (2x 16-bit code units = 32 bits total)
UTF-32: [ 000104CC ]     (1x 32-bit code unit  = 32 bits total)

In these examples, the UTF-8 standalone code unit [ EA ] is meaningless unless it occurs as part of a longer sequence. Simiarly, the UTF-16 standalone code unit [ D801 ] is meaningless unless it occurs as part of a longer sequence. So while code units might serve as building blocks for a particular UTF-* encoding, they're not always meaningful in isolation. This is analagous to calls like char.ToUpperInvariant('\ud801') being meaningless, as mentioned previously.

For .NET strings, the UTF-16 code units (chars) are stored in contiguous memory as a sequence of 16-bit integers. And, as with any primitive integer data type, individual code units are laid out according to the endianness of the current architecture. On a little-endian architecture, the string consisting of the UTF-16 code points [ D801 DCCC ] would actually be laid out in memory as the bytes [ 0x01, 0xD8, 0xCC, 0xDC ], but on a big-endian architecture that same string would be laid out in memory as the bytes [ 0xD8, 0x01, 0xDC, 0xCC ].

In a connected world, computer systems which communicate with each other must agree on the representation of data crossing the wire. Most network protocols use UTF-8 as a standard when transmitting text, partly for bandwidth savings, but also to solve issues that might result from a big-endian machine communicating with a little-endian machine. The string consisting of the UTF-8 code points [ F0 90 93 8C ] will always be represented as the bytes [ 0xF0, 0x90, 0x93, 0x8C ] regardless of endianness. This means that it is convenient to represent UTF-8 data as a series of bytes and to transmit those raw bytes directly across the wire. This is also the reason that code such as below is so common in .NET applications.

// writes a string to the network
string stringToWrite = GetString();
byte[] stringAsUtf8Bytes = Encoding.UTF8.GetBytes(stringToWrite);
await outputStream.WriteAsync(stringAsUtf8Bytes, 0, stringAsUtf8Bytes.Length);

In the above sample, the method Encoding.UTF8.GetBytes decodes the UTF-16 string back into a series of Unicode scalar values, then it re-encodes those scalar values into UTF-8 and places the resulting sequence into a byte[]. The method Encoding.UTF8.GetString performs the opposite transform, converting a UTF-8 byte[] to a UTF-16 string.

There is a word of warning here. Since UTF-8 is commonplace on the internet, it is tempting to read raw bytes from the wire and to treat those bytes as if they were UTF-8. However, not all byte sequences are well-formed UTF-8. (See the section on "well-formed" data in the Q & A at the end of this document.) If a malicious client submits ill-formed UTF-8 to your service and the service attempts to operate on that data as if it were well-formed UTF-8, it could cause errors or security holes inside your application. Before operating on such data you should validate that it is indeed well-formed UTF-8. Alternatively, use a method like Encoding.UTF8.GetString, which will perform validation while converting the incoming data to a string.

Typical Rune usage and behaviors

To recap, here are the basic definitions once again.

  • A code point is any Unicode value within the range U+0000..U+10FFFF. Some - but not all - code points correspond to display characters.

  • A scalar value (Rune) is any code point that is not a surrogate code point. That is, a scalar value is any code point within the range U+0000..U+D7FF or U+E000..U+10FFFF. It is meaningful to query scalar values for their properties, such as "Does this scalar value represent an uppercase letter?" Every scalar value can be represented in any given UTF-* encoding.

  • Unicode defines three encodings UTF-8, UTF-16, and UTF-32 to define how sequences of scalar values are to be represented in-memory. The string type uses UTF-16 as its representation. Many network protocols use UTF-8 as the wire format. Methods like Encoding.UTF8.GetBytes and Encoding.UTF8.GetString allow conversion between these different representations.

  • A code unit is the elemental building block of UTF-* encoded text. UTF-8 uses an 8-bit code unit (usually interchangeable with byte). UTF-16 uses a 16-bit code unit (char). UTF-32 uses a 32-bit code unit and is not commonly seen within .NET applications except during limited p/invoke scenarios. Depending on the UTF-* encoding in use and the particular scalar value being encoded, multiple code units may be required to represent the scalar value.

Having convered these concepts allows a bit of a deeper dive into the capabilities of the Rune type and demonstrating common usage patterns.

Reading a Rune from existing data

There are several ways to get a Rune instance. One way is to use the constructor to create a Rune directly from a code point, a single char, or a surrogate char pair.

// The calls below all create a Rune with value U+20AC EURO SIGN ('€')
Rune a = new Rune('€');
Rune b = new Rune('\u20ac');
Rune c = new Rune(0x20AC);

// The calls below all create a Rune with value U+1F52E CRYSTAL BALL ('🔮')
Rune d = new Rune('\ud83d', '\udd2e');
Rune e = new Rune(0x1F52E);

All of the above constructors will throw an ArgumentException if the input argument does not represent a valid Unicode scalar value. There are also Rune.TryCreate methods available for callers who want a try-style operation and who don't want exceptions to be thrown on failure.

To get the integer code point value of a Rune, use the Rune.Value property.

Rune rune = new Rune('\ud83d', '\udd2e'); // U+1F52E CRYSTAL BALL ('🔮')
int codePoint = rune.Value; // = 128302 decimal (= 0x1F52E hexadecimal)

Rune instances can also be read from existing input sequences. For instance, given a ReadOnlySpan<char> which represents UTF-16 data, the Rune.DecodeFromUtf16 method will decode and return the first Rune which occurs at the beginning of the input span. The Rune.DecodeFromUtf8 method operates similarly, accepting a ReadOnlySpan<byte> parameter which represents UTF-8 data. There are equivalent methods to read from the end of the span instead of the beginning of the span.

Querying properties of a Rune

Many of the static APIs available on the char type are also available on the Rune type. For instance, Rune.IsWhiteSpace and Rune.GetUnicodeCategory are Rune-based equivalents to the static char.IsWhiteSpace and char.GetUnicodeCategory methods. These methods can be used to operate with Rune instances in a manner similar to code which operates directly on chars, but with full support for the wide range of Unicode scalar values that are not representable by individual char elements.

Below is a sample method which takes a ReadOnlySpan<char> as input and trims from both the start and the end of the span every Rune which isn't a letter or a digit.

public static ReadOnlySpan<char> TrimNonLettersAndNonDigits(ReadOnlySpan<char> span)
{
    // First, trim from the front. If any Rune can't be decoded (return value is anything other
    // than "Done"), or if the Rune is a letter or digit, stop trimming from the front and
    // instead work from the end.
    while (DecodeFromUtf16(span, out Rune rune, out int charsConsumed) == OperationStatus.Done)
    {
        if (Rune.IsLetterOrDigit(rune)) { break; }
        span = span[charsConsumed..];
    }

    // Next, trim from the end. If any Rune can't be decoded, or if the Rune is a letter or digit,
    // break from the loop, and we're finished.
    while (DecodeLastFromUtf16(span, out Rune rune, out int charsConsumed) == OperationStatus.Done)
    {
        if (Rune.IsLetterOrDigit(rune)) { break; }
        span = span[..^charsConsumed];
    }

    return span; // this is now trimmed on both sides
}

There are some API differences between char and Rune. For example, char.IsSurrogate(char) will return true for inputs in the range '\ud800' to '\udfff', inclusive. However, since Rune instances are by construction scalar values and can never be surrogate code points, there is no equivalent API Rune.IsSurrogate(Rune). Additionally, while char.GetUnicodeCategory does not always return the same result as CharUnicodeInfo.GetUnicodeCategory (see Remarks), the Rune.GetUnicodeCategory method will always return the same result as the CharUnicodeInfo.GetUnicodeCategory method.

Converting a Rune to UTF-8 or UTF-16

Since a Rune is a Unicode scalar value, it can be converted to any UTF-* encoding losslessly. The Rune type has built-in support for conversion to UTF-8 and UTF-16.

To query the number of UTF-16 code units (chars) that would result from representing a particular Rune as UTF-16, use the Rune.Utf16SequenceLength property. The method Rune.EncodeToUtf16 can then be used to write the resulting char values. Similar methods exist for UTF-8 conversion.

Rune rune = GetRune();

// Convert to UTF-16 char[]

char[] chars = new char[rune.Utf16SequenceLength];
int numCharsWritten = rune.EncodeToUtf16(chars);
Debug.Assert(numCharsWritten == chars.Length);

// Shortcut to convert to UTF-16 string

string theString = rune.ToString();

// Convert to UTF-8 byte[]

byte[] bytes = new byte[rune.Utf8SequenceLength];
int numBytesWritten = rune.EncodeToUtf8(bytes);
Debug.Assert(numBytesWritten == bytes.Length);

Both the EncodeToUtf16 and the EncodeToUtf8 methods will return the actual number of elements written, and they'll throw an exception if the destination buffer is too short to contain the result. There are non-throwing TryEncode* methods as well for callers who want to avoid exceptions in the case where the destination buffer is too short.

See the Rune documentation for more information on the available API surface.

When to use Rune in your code

Iterating through some text char-by-char and calling static methods on char

If your code iterates through a string or a ReadOnlySpan<char> char-by-char and calls any of the below methods, consider replacing these calls with code that uses the Rune type directly. (This list is intended to be representative, not exhaustive.)

  • char.GetNumericValue
  • char.GetUnicodeCategory
  • char.IsDigit
  • char.IsLetter
  • char.IsLetterOrDigit
  • char.IsLower
  • char.IsNumber
  • char.IsPunctuation
  • char.IsSymbol
  • char.IsUpper

For example, consider a method which counts the number of letters in a string or in a ReadOnlySpan<char>.

// THIS SAMPLE SHOWS INCORRECT CODE.
// DO NOT DO THIS IN A PRODUCTION APPLICATION.
public static int CountLettersInString(string s)
{
    int letterCount = 0;

    foreach (char ch in s)
    {
        if (char.IsLetter(ch)) { letterCount++; }
    }

    return letterCount;
}

// THIS SAMPLE SHOWS INCORRECT CODE.
// DO NOT DO THIS IN A PRODUCTION APPLICATION.
public static int CountLettersInSpan(ReadOnlySpan<char> span)
{
    int letterCount = 0;

    foreach (char ch in span)
    {
        if (char.IsLetter(ch)) { letterCount++; }
    }

    return letterCount;
}

The above method might appear to work correctly with some languages like English: CountLettersInString("Hello") = 5. But it won't work correctly for the Osage example shown earlier in this document: CountLettersInString("𐓏𐓘𐓻𐓘𐓻𐓟 𐒻𐓟") = 0. The reason this method returns incorrect results for Osage text is that most of the individual chars in the Osage string are surrogate code points, and it is meaningless to query "Does this code point represent a letter?" for surrogate code points.

Changing this code to use Rune instead of char will result in the method returning the correct value.

// This sample shows correct usage of the Rune type.
public static int CountLettersInString(string s)
{
    int letterCount = 0;

    foreach (Rune rune in s.EnumerateRunes())
    {
        if (Rune.IsLetter(rune)) { letterCount++; }
    }

    return letterCount;
}

// This sample shows correct usage of the Rune type.
public static int CountLettersInSpan(ReadOnlySpan<char> span)
{
    int letterCount = 0;

    foreach (Rune rune in span.EnumerateRunes())
    {
        if (Rune.IsLetter(rune)) { letterCount++; }
    }

    return letterCount;
}

With these updated samples, CountLettersInString("Hello") = 5 and CountLettersInString("𐓏𐓘𐓻𐓘𐓻𐓟 𐒻𐓟") = 8, as expected.

Custom code dealing with surrogate chars

If your code contains explicit calls to any of the below methods, consider replacing these calls with code that uses the Rune type directly. (This list is intended to be representative, not exhaustive.)

  • char.IsSurrogate
  • char.IsSurrogatePair
  • char.IsHighSurrogate
  • char.IsLowSurrogate
  • char.ConvertFromUtf32
  • char.ConvertToUtf32

For example, consider a method such as the below which already has special logic to deal with surrogate char pairs.

// Example of code which performs manual char surrogate checks.
public static void ProcessString(string s)
{
    for (int i = 0; i < s.Length; i++)
    {
        if (!char.IsSurrogate(s[i]))
        {
            ProcessCodePoint(s[i]);
        }
        else if (i + 1 < s.Length && char.IsSurrogatePair(s[i], s[i + 1]))
        {
            int codePoint = char.ConvertToUtf32(s[i], s[i + 1]);
            ProcessCodePoint(codePoint);
            i++; // so that when the loop iterates it's actually +2
        }
        else
        {
            throw new Exception("String was not well-formed UTF-16.");
        }
    }
}

private static void ProcessCodePoint(int codePoint) { /* ... */ }

Such a method can be more easily written in terms of Rune, as below.

// Example of code which uses Rune instead of performing manual char surrogate checks.
public static void ProcessString(string s)
{
    for (int i = 0; i < s.Length;)
    {
        if (!Rune.TryGetRuneAt(s, i, out Rune rune))
        {
            throw new Exception("String was not well-formed UTF-16.");
        }

        ProcessCodePoint(rune.Value);
        i += rune.Utf16SequenceLength; // increment the iterator by the number of chars in this Rune
    }
}

When not to use Rune in your code

Iterating through a string char-by-char looking for exact char matches

Consider the following code which iterates through a string looking for specific characters, returning the index where the first match occurs in the string. There is no need to change this code to use Rune, as the code is looking for a known set of characters already known by the developer to be representable by a single char.

// This code returns the index of the first char 'A' - 'Z' that appears in the string.
// It demonstrates valid usage of the char type.
public static int GetIndexOfFirstAToZ(string s)
{
    for (int i = 0; i < s.Length; i++)
    {
        char thisChar = s[i];
        if ('A' <= thisChar && thisChar <= 'Z')
        {
            return i; // found a match
        }
    }

    return -1; // didn't find 'A' - 'Z' in the input string
}

Splitting a string on a known char

It's common to call string.Split passing constant delimiters such as ' ' (space) or ',' (comma). This is acceptable for the same reason as in the previous example: the code is looking for a known set of characters already known by the developer to be representable by a single char.

// These lines demonstrate valid usage of the string.Split method and the char type.

string inputString = GetInputString();
string[] splitOnSpace = inputString.Split(' ');
string[] splitOnComma = inputString.Split(',');

Counting the number of display characters in a string

A Rune does not necessarily correlate directly to a display character, so counting the number of Runes in a string will not always match the number of user-perceivable characters shown when displaying a string.

However, since Rune instances represent Unicode scalar values, components which follow the Unicode text segmentation guidelines can use Rune as a building block for counting display characters or locating word or sentence boundaries. The .NET type StringInfo can also assist with this.

For more information, see the discussion on Rune vs. "character" in the Q & A section which follows.

Q & A

What's the relationship between a .NET Rune and a "character"?

There's no universally agreed-upon definition of a "character" across programming languages, and even the Unicode glossary lists multiple definitions for "character".

For the purposes of this discussion, we'll define a character as what a reader logically perceives as a single display element. This is often referred to as a grapheme cluster. A grapheme cluster consists of one or more Runes.

Consider the strings "a", "é", and "👩🏽‍🚒". These strings should each appear as a single logical unit (depending on your operating system), so we'll say that each string consists of a single grapheme cluster. But they each take a different number of Runes to encode.

The string "a" consists of one Rune:

  • U+0061 LATIN SMALL LETTER A

The string "é" consists of one Rune:

  • U+00E9 LATIN SMALL LETTER E WITH ACUTE

Alternatively, "é" could instead be written to consist of two Runes:

  • U+0065 LATIN SMALL LETTER E
  • U+0301 COMBINING ACUTE ACCENT

Finally, "👩🏽‍🚒" is represented as four Runes:

  • U+1F469 WOMAN
  • U+1F3FD EMOJI MODIFIER FITZPATRICK TYPE-4
  • U+200D ZERO WIDTH JOINER
  • U+1F692 FIRE ENGINE

In this particular example, querying "👩🏽‍🚒".Length will print 7 because the string consists of 7 chars, which represents the UTF-16 encoding of the 4 Runes listed above.

In some of the above samples - such as the combining accent modifier or the skin tone modifier - the code point does not display as a standalone element on the screen. Rather, it serves to modify the appearance of an element that came before it. These examples show that a Rune can be thought of as a basic building block of a piece of text, but it might take multiple Runes to contribute to what we think of as a logical singular "character" (grapheme cluster).

If you're interested in enumerating the grapheme clusters of a string instance, you can do so via the StringInfo class. For developers familiar with Swift, .NET's StringInfo type is conceptually similar to Swift's character type. For technical information on how grapheme cluster boundaries are determined, see the Unicode Standard Annex #29, Section 3.

// This method demonstrates counting the number of display characters in a string.
// In the Unicode technical documentation, these are called "grapheme clusters".
// .NET refers to these as "text elements".
public static int CountTextElements(string s)
{
    TextElementEnumerator enumerator = StringInfo.GetTextElementEnumerator(s);

    int textElementCount = 0;
    while (enumerator.MoveNext())
    {
        textElementCount++;
    }

    return textElementCount;
}

int charCount = "👩🏽‍🚒".Length;                    // returns 7
int runeCount = "👩🏽‍🚒".EnumerateRunes().Count();  // returns 4
int textElementCount = CountTextElements("👩🏽‍🚒"); // returns 1

The above sample will only return the correct values on .NET 5 and later. The .NET Framework and previous versions of .NET Core have older implementations of the StringInfo class which do not fully handle grapheme clusters correctly. See https://github.com/dotnet/corefx/issues/41324 for more information.

Where did the term "rune" come from?

The term dates back to the creation of UTF-8 in 1992 by Rob Pike and Ken Thompson. The pair were looking for a term to describe what would eventually become known as a code point. With no prior art to go by, they settled on the term "rune". That became the de facto term used throughout the Plan 9 operating system code base, on which they both worked at the time. The term has stuck due to historical precedent within UTF-8 and due to Rob Pike's later influence over the Go programming language, which repopularized the term amongst developers.

The Unicode Standard does not define or recognize the term "rune". The term "rune" is not to be confused with the Runic range (U+16A0..U+16FF) defined by the Unicode Standard.

Is a .NET Rune the same as a Go rune?

No. In .NET, the System.Text.Rune type corresponds precisely to a Unicode scalar value. This means that it is impossible for a Rune instance to contain a value that is not a legal Unicode scalar value.

// The line below will throw an exception since 0x00ABCDEF is not
// a valid Unicode scalar value.
Rune rune = new Rune(0x00ABCDEF);

In Go, the rune type is an alias for int32. A Go rune is intended to represent a Unicode code point, but in practice it can contain any 32-bit value, including values which are not legal Unicode code points.

For similar types in other programming languages, see Rust's primitive char type or Swift's Unicode.Scalar type, both of which represent Unicode scalar values. They are similar to .NET's Rune type in terms of functionality and in that they disallow construction around values which are not legal Unicode scalar values.

Is a .NET Rune the same as a Unicode code point?

No. A Unicode code point is a value in the range U+0000..U+10FFFF. A Rune is a Unicode scalar value and is any Unicode code point except the range U+D800..U+DFFF. The set of all valid Rune instances is therefore a strict subset of all valid Unicode code points.

In practice, what this means is that while a char is a Unicode code point, a Rune is not guaranteed to be. This is because a char could represent a standalone high surrogate or a standalone low surrogate code point, while a Rune cannot.

Console.WriteLine(char.GetUnicodeCategory('\ud800')); // prints "Surrogate"

Since a Rune can never represent a surrogate code point (see next question), APIs like Rune.GetUnicodeCategory will never return Surrogate as in the above example.

What is the relationship between a surrogate code point and a supplementary code point?

A supplementary code point is any code point in the range U+10000..U+10FFFF. In UTF-8, supplementary code points are encoded as 4 code units. For example, the code point U+1F970 (the emoji character 🥰) is encoded in UTF-8 as [ F0 9F A5 B0 ]. In UTF-32, this would be encoded as [ 0001F970 ], as all scalar values (including supplementary code points) are encoded as 1 code unit each under UTF-32.

In UTF-16, supplementary code points are encoded as 2 code units: a high supplementary code point U+D800..U+DBFF followed by a low supplementary code point U+DC00..U+DFFF. Collectively, this range U+D800..U+DFFF is referred to as the surrogate code point range. Continuing the previous example, the supplementary code point U+1F970 would be encoded in UTF-16 as [ D83E DD70 ].

Surrogate code points are only allowed within UTF-16 text, and only when they occur in pairs of a high surrogate code point immediately followed by a low surrogate code point. (Back-to-back sequences such as "high-low-high-low" are also legal.)

Surrogate code points cannot be represented in UTF-8 or UTF-32. Because of this, surrogate code point are not legal Unicode scalar values. In fact, surrogate code points are the only code points which are not legal Unicode scalar values.

What does it mean for UTF-16 data to be "well-formed"?

In Unicode, a well-formed string is a string that can be decoded unambiguously and without error into a sequence of Unicode scalar values. This also means that such a string can be transcoded freely back and forth between UTF-8, UTF-16, and UTF-32.

For example, consider a string containing the Spanish phrase "¡Hola!" ("Hello!"). Including the leading and trailing punctuation, there are 6 characters. This would be encoded as follows.

         ¡        H        o        l        a        !     (characters)
[   U+00A1   U+0048   U+006F   U+006C   U+0061   U+0021 ]   (scalars)
[    C2 A1       48       6F       6C       61       21 ]   UTF-8
[     00A1     0048     006F     006C     0061     0021 ]   UTF-16
[ 000000A1 00000048 0000006F 0000006C 00000061 00000021 ]   UTF-32

The three samples above are "well-formed" because they follow the rules for UTF-8, UTF-16, and UTF-32 encoding, and a sequence of Unicode scalar values can be extracted from it.

If there is any data that cannot be decoded from the string into a sequence of Unicode scalar values, we say that the string is "ill-formed". The three samples below show some examples of this.

  • In UTF-8, the sequence [ 6C C2 61 ] is ill-formed because C2 cannot be followed by 61.

  • In UTF-16, the sequence [ DC00 DD00 ] (or, in C#, the string "\udc00\udd00") is ill-formed because the low surrogate DC00 cannot be followed by another low surrogate DD00.

  • In UTF-32, the sequence [ 0011ABCD ] is ill-formed because 0011ABCD is outside the range of legal Unicode scalar values.

Importantly, well-formedness does not imply that the data is contextually meaningful. For instance, the string "\uffff\u200d" is well-formed because it is a valid representation of Unicode scalar sequence [ U+FFFF U+200D ]. That particular sequence of scalars is the Unicode equivalent of gibberish, but because the string was able to be decoded to scalars at all it's considered well-formed.

Also remember: while string instances in .NET almost always contain well-formed data, they are not strictly required to be well-formed. The examples below are valid C#.

// specifying an ill-formed literal
const string s = "\ud800";

// substringing in the middle of a supplementary code point
string x = "\ud83e\udd70"; // "🥰"
string y = x.Substring(1, 1); // "\udd70" standalone low surrogate

In general, it is very rare for there to be ill-formed string instances in .NET applications. APIs like Encoding.UTF8.GetString will never return ill-formed string instances. The presence of ill-formed string instances is almost always a library bug (such as a mistake in a network deserializer) or due to calling string.Substring with incorrect arguments, as in the previous example.

Methods like Encoding.GetString and Encoding.GetBytes detect ill-formed sequences in the input and perform character substitution when generating the output. For example, if Encoding.ASCII.GetString(byte[]) sees a non-ASCII byte in the input, it will insert a '?' into the returned string instance. Encoding.UTF8.GetString(byte[]) will replace ill-formed UTF-8 sequences with U+FFFD REPLACEMENT CHARACTER ('�') in the returned string instance. For the technical details of how ill-formed sequences are detected and how the substitution is performed, see the Unicode Standard, Sections 5.22 and 3.9.

The built-in Encoding classes can also be configured to throw an exception rather than perform character substitution when ill-formed sequences are seen. This is often used in security-sensitive applications where character substitution might not be acceptable.

byte[] utf8Bytes = ReadFromNetwork();
UTF8Encoding encoding = new UTF8Encoding(encoderShouldEmitUTF8Identifier: false, throwOnInvalidBytes: true);
string asString = encoding.GetString(utf8Bytes); // will throw if 'utf8Bytes' is ill-formed
@daveaglick

This comment has been minimized.

Copy link

@daveaglick daveaglick commented Nov 14, 2019

This was an excellent - best article I’ve read about .NET Runes all week. I actually don’t have a lot of feedback because it was so informative and complete, but did make a few notes as I was reading through:

  • I wonder if a simple graphic like a horizontal bar showing the full Unicode range and delineating BMP code points, surrogate pairs, the range of a char, supplementary code points, etc. would help. At least that’s how I ended up visualizing everything as I was reading.
  • In what situations would I encounter surrogate pairs as opposed to a full supplementary Unicode scalar value? I.e., if I open a Unicode-encoded text file and read it into a string, does .NET “convert” the supplementary Unicode scalar values outside the range of char into surrogate pairs?
  • I wrote that last question before getting to the section about UTF encoding - now I’m wondering what the relationship is between surrogate pairs and UTF code units if they’re both intended to represent a 32-bit Unicode scalar value in 16-bit space. Why have both abstractions? How do they relate?
@GrabYourPitchforks

This comment has been minimized.

Copy link
Owner Author

@GrabYourPitchforks GrabYourPitchforks commented Nov 22, 2019

Thanks @daveaglick for the comments! I'm not quite sure how to work them into the document just yet, so I'll at least drop answers here so that we can consider this for future drafts.

I wonder if a simple graphic like a horizontal bar showing the full Unicode range and delineating BMP code points, surrogate pairs, the range of a char, supplementary code points, etc. would help. At least that’s how I ended up visualizing everything as I was reading.

Oh, definitely. Wikipedia has a similar graphic, and even some of the Unicode Standard (such as Ch. 3, Table 3-5) has useful tables and diagrams that would be good to pull in. When I ran the concept of this article by the docs team a while back I had mentioned that I'd need help creating diagrams. They hopefully have more talent than I do as far as these things go. :)

In what situations would I encounter surrogate pairs as opposed to a full supplementary Unicode scalar value? I.e., if I open a Unicode-encoded text file and read it into a string, does .NET “convert” the supplementary Unicode scalar values outside the range of char into surrogate pairs?

In the string type, you'll always see surrogates instead of supplementary characters. This is a consequence of char being a 16-bit data type, so it can't represent any numeric value beyond 65,535 (0xFFFF). When reading a file from disk, such as via File.ReadAllText, the runtime will attempt to automatically determine the UTF-* encoding that was used to save the file. By default, we assume the file was saved using UTF-8, but if there's a special marker at the beginning of the file stating that a different UTF-* encoding was used we'll honor that marker instead. Under the covers, what's happening is that the runtime is going through the file, decoding individual Unicode scalar values (Runes) from the file contents. These Rune instances are then essentially concatenated together and turned into a single string. When generating this final string, any Rune instances that are within the BMP range U+0000..U+FFFF and which can be represented as a single char will remain a single char in the returned string. Any Rune instances that are within the supplementary range U+10000..U+10FFFF will get exploded into two chars - a UTF-16 surrogate pair - and this pair will be present in the returned string.

If you wanted to see this in practice for yourself, check out the Rune.ToString method. For BMP Runes, this method returns a single-char string. For supplementary Runes, this method returns a two-char string whose elements are the UTF-16 surrogate pair.

Logically, this means that to form a string from a sequence of Rune values, it's equivalent to call Rune.ToString on each value and to concatenate the intermediate results together into a final result.

Example:

Rune[] runes = new Rune[3]
{
    new Rune('I'),
    new Rune('\ud83d', '\ude18'), // U+1F618 FACE THROWING A KISS (😘)
    new Rune('U')
};

string a = runes[0].ToString(); // = "I"
string b = runes[1].ToString(); // = "😘" = "\ud83d\ude18" (surrogate pair)
string c = runes[2].ToString(); // = "U"

string concated = string.Concat(runes); // = "I😘U"

I wrote that last question before getting to the section about UTF encoding - now I’m wondering what the relationship is between surrogate pairs and UTF code units if they’re both intended to represent a 32-bit Unicode scalar value in 16-bit space. Why have both abstractions? How do they relate?

Any Unicode string requires some in-memory representation. For UTF-8 Unicode strings, the "string" is a sequence of 8-bit elements. For UTF-16 Unicode strings, it's a sequence of 16-bit elements. And for UTF-32, it's a sequence of 32-bit elements. These elements are code units. They're primarily useful for thinking of a "string" as a contiguous in-memory block of data, and you would index into the string by code units. The width of the code unit depends on the particular UTF-* encoding we're talking about.

They're also useful for determining the total size (in bytes) of the in-memory representation of the string. If a UTF-8 string consists of 17 code units, its total size is 17 bytes. If a UTF-16 string consists of 11 code units, its total size is 22 bytes. And if a UTF-32 string consists of 9 code units, its total size is 36 bytes. It's a typical totalByteSizeOf(T[] t) = t.ElementCount * sizeof(T); calculation.

(This is also the definition of char - it's the elemental type of our UTF-16 string type. Therefore a char is also a UTF-16 code unit.)

Since code units are really just arbitrary integers of a given width, they can't always be treated as scalar values. Consider what was outlined earlier in this document: a single char (UTF-16 code unit) might not be sufficient to represent a full Unicode scalar value. Similarly, since a code unit could have any integer value of a given width, there's no guarantee that it's well-formed. For example, the byte 0xFF is an 8-bit code unit, but the byte 0xFF can never appear anywhere in well-formed UTF-8 text. That byte is always forbidden. Similarly, the value 0xDEADBEEF is a 32-bit code unit, but it can never appear anywhere in well-formed UTF-32 text.

A scalar value (Rune) is guaranteed to exist in the Unicode code space and is guaranteed not to be a reserved UTF-16 surrogate code point. This means that there's a precise, unambiguous, and lossless mapping from Rune to any given UTF-* code unit sequence. It also means that if you can successfully create a Rune instance from a given UTF-* code unit sequence, that code unit sequence was well-formed, and you can then query the Rune for properties about the data it represents. This ability to convert to/from anything and to query it for information makes it a substantially powerful API.

@GrabYourPitchforks

This comment has been minimized.

Copy link
Owner Author

@GrabYourPitchforks GrabYourPitchforks commented Nov 23, 2019

I'm also trying to work an "aBOMination" pun in here somewhere, but as of yet to no avail.

@ufcpp

This comment has been minimized.

Copy link

@ufcpp ufcpp commented Dec 7, 2019

graphic

3 years ago, I wrote an article about Unicode history (Unicode itself and .NET characters) in Japanese. Diagrams/illustrations in the article are drawn by using PowerPoint. I hope this pptx helps you.

@Serentty

This comment has been minimized.

Copy link

@Serentty Serentty commented Jan 17, 2020

I think this is a very good write-up. There's one aspect that I disagree with however, and that's the recommendation to use char instead when you're sure that the character will be representable as a single UTF-16 code unit. I think this is an unnecessary complication to the mental model, and also makes it harder to switch the backing encoding of a string (say, to a Utf8String) without breaking code. I think that going forward, it makes more sense to avoid treating char as an entire character, even when it is known to be. When searching for a character in a string, users shouldn't have to look up whether or not that character is in the BMP when it is simpler to just use Rune.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.