GrabYourPitchforks/using_rune.md

Last active March 28, 2021 01:44

Star () You must be signed in to star a gist
Fork () You must be signed in to fork a gist

Learn more about clone URLs
Clone this repository at <script src="https://gist.github.com/GrabYourPitchforks/b9dbd348b448c938497cff37a3526725.js"></script>
Save GrabYourPitchforks/b9dbd348b448c938497cff37a3526725 to your computer and use it in GitHub Desktop.

Download ZIP

Using Rune

Raw

using_rune.md

This article has moved to the official .NET Docs site.

See https://docs.microsoft.com/dotnet/standard/base-types/character-encoding-introduction.

Author

GrabYourPitchforks commented Nov 22, 2019

Thanks @daveaglick for the comments! I'm not quite sure how to work them into the document just yet, so I'll at least drop answers here so that we can consider this for future drafts.

I wonder if a simple graphic like a horizontal bar showing the full Unicode range and delineating BMP code points, surrogate pairs, the range of a char, supplementary code points, etc. would help. At least that’s how I ended up visualizing everything as I was reading.

Oh, definitely. Wikipedia has a similar graphic, and even some of the Unicode Standard (such as Ch. 3, Table 3-5) has useful tables and diagrams that would be good to pull in. When I ran the concept of this article by the docs team a while back I had mentioned that I'd need help creating diagrams. They hopefully have more talent than I do as far as these things go. :)

In what situations would I encounter surrogate pairs as opposed to a full supplementary Unicode scalar value? I.e., if I open a Unicode-encoded text file and read it into a string, does .NET “convert” the supplementary Unicode scalar values outside the range of char into surrogate pairs?

In the string type, you'll always see surrogates instead of supplementary characters. This is a consequence of char being a 16-bit data type, so it can't represent any numeric value beyond 65,535 (0xFFFF). When reading a file from disk, such as via File.ReadAllText, the runtime will attempt to automatically determine the UTF-* encoding that was used to save the file. By default, we assume the file was saved using UTF-8, but if there's a special marker at the beginning of the file stating that a different UTF-* encoding was used we'll honor that marker instead. Under the covers, what's happening is that the runtime is going through the file, decoding individual Unicode scalar values (Runes) from the file contents. These Rune instances are then essentially concatenated together and turned into a single string. When generating this final string, any Rune instances that are within the BMP range U+0000..U+FFFF and which can be represented as a single char will remain a single char in the returned string. Any Rune instances that are within the supplementary range U+10000..U+10FFFF will get exploded into two chars - a UTF-16 surrogate pair - and this pair will be present in the returned string.

If you wanted to see this in practice for yourself, check out the Rune.ToString method. For BMP Runes, this method returns a single-char string. For supplementary Runes, this method returns a two-char string whose elements are the UTF-16 surrogate pair.

Logically, this means that to form a string from a sequence of Rune values, it's equivalent to call Rune.ToString on each value and to concatenate the intermediate results together into a final result.

Example:

Rune[] runes = new Rune[3]
{
    new Rune('I'),
    new Rune('\ud83d', '\ude18'), // U+1F618 FACE THROWING A KISS (😘)
    new Rune('U')
};

string a = runes[0].ToString(); // = "I"
string b = runes[1].ToString(); // = "😘" = "\ud83d\ude18" (surrogate pair)
string c = runes[2].ToString(); // = "U"

string concated = string.Concat(runes); // = "I😘U"

I wrote that last question before getting to the section about UTF encoding - now I’m wondering what the relationship is between surrogate pairs and UTF code units if they’re both intended to represent a 32-bit Unicode scalar value in 16-bit space. Why have both abstractions? How do they relate?

Any Unicode string requires some in-memory representation. For UTF-8 Unicode strings, the "string" is a sequence of 8-bit elements. For UTF-16 Unicode strings, it's a sequence of 16-bit elements. And for UTF-32, it's a sequence of 32-bit elements. These elements are code units. They're primarily useful for thinking of a "string" as a contiguous in-memory block of data, and you would index into the string by code units. The width of the code unit depends on the particular UTF-* encoding we're talking about.

They're also useful for determining the total size (in bytes) of the in-memory representation of the string. If a UTF-8 string consists of 17 code units, its total size is 17 bytes. If a UTF-16 string consists of 11 code units, its total size is 22 bytes. And if a UTF-32 string consists of 9 code units, its total size is 36 bytes. It's a typical totalByteSizeOf(T[] t) = t.ElementCount * sizeof(T); calculation.

(This is also the definition of char - it's the elemental type of our UTF-16 string type. Therefore a char is also a UTF-16 code unit.)

Since code units are really just arbitrary integers of a given width, they can't always be treated as scalar values. Consider what was outlined earlier in this document: a single char (UTF-16 code unit) might not be sufficient to represent a full Unicode scalar value. Similarly, since a code unit could have any integer value of a given width, there's no guarantee that it's well-formed. For example, the byte 0xFF is an 8-bit code unit, but the byte 0xFF can never appear anywhere in well-formed UTF-8 text. That byte is always forbidden. Similarly, the value 0xDEADBEEF is a 32-bit code unit, but it can never appear anywhere in well-formed UTF-32 text.

A scalar value (Rune) is guaranteed to exist in the Unicode code space and is guaranteed not to be a reserved UTF-16 surrogate code point. This means that there's a precise, unambiguous, and lossless mapping from Rune to any given UTF-* code unit sequence. It also means that if you can successfully create a Rune instance from a given UTF-* code unit sequence, that code unit sequence was well-formed, and you can then query the Rune for properties about the data it represents. This ability to convert to/from anything and to query it for information makes it a substantially powerful API.

Author

GrabYourPitchforks commented Nov 23, 2019

I'm also trying to work an "aBOMination" pun in here somewhere, but as of yet to no avail.

ufcpp commented Dec 7, 2019

graphic

3 years ago, I wrote an article about Unicode history (Unicode itself and .NET characters) in Japanese. Diagrams/illustrations in the article are drawn by using PowerPoint. I hope this pptx helps you.

Serentty commented Jan 17, 2020 •

edited

Loading

I think this is a very good write-up. There's one aspect that I disagree with however, and that's the recommendation to use char instead when you're sure that the character will be representable as a single UTF-16 code unit. I think this is an unnecessary complication to the mental model, and also makes it harder to switch the backing encoding of a string (say, to a Utf8String) without breaking code. I think that going forward, it makes more sense to avoid treating char as an entire character, even when it is known to be. When searching for a character in a string, users shouldn't have to look up whether or not that character is in the BMP when it is simpler to just use Rune.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment