This article has moved to the official .NET Docs site.
See https://docs.microsoft.com/dotnet/standard/base-types/character-encoding-introduction.
This article has moved to the official .NET Docs site.
See https://docs.microsoft.com/dotnet/standard/base-types/character-encoding-introduction.
Thanks @daveaglick for the comments! I'm not quite sure how to work them into the document just yet, so I'll at least drop answers here so that we can consider this for future drafts.
I wonder if a simple graphic like a horizontal bar showing the full Unicode range and delineating BMP code points, surrogate pairs, the range of a char, supplementary code points, etc. would help. At least that’s how I ended up visualizing everything as I was reading.
Oh, definitely. Wikipedia has a similar graphic, and even some of the Unicode Standard (such as Ch. 3, Table 3-5) has useful tables and diagrams that would be good to pull in. When I ran the concept of this article by the docs team a while back I had mentioned that I'd need help creating diagrams. They hopefully have more talent than I do as far as these things go. :)
In what situations would I encounter surrogate pairs as opposed to a full supplementary Unicode scalar value? I.e., if I open a Unicode-encoded text file and read it into a string, does .NET “convert” the supplementary Unicode scalar values outside the range of char into surrogate pairs?
In the string
type, you'll always see surrogates instead of supplementary characters. This is a consequence of char
being a 16-bit data type, so it can't represent any numeric value beyond 65,535 (0xFFFF
). When reading a file from disk, such as via File.ReadAllText
, the runtime will attempt to automatically determine the UTF-* encoding that was used to save the file. By default, we assume the file was saved using UTF-8, but if there's a special marker at the beginning of the file stating that a different UTF-* encoding was used we'll honor that marker instead. Under the covers, what's happening is that the runtime is going through the file, decoding individual Unicode scalar values (Rune
s) from the file contents. These Rune
instances are then essentially concatenated together and turned into a single string
. When generating this final string
, any Rune
instances that are within the BMP range U+0000..U+FFFF
and which can be represented as a single char
will remain a single char
in the returned string
. Any Rune
instances that are within the supplementary range U+10000..U+10FFFF
will get exploded into two char
s - a UTF-16 surrogate pair - and this pair will be present in the returned string
.
If you wanted to see this in practice for yourself, check out the Rune.ToString
method. For BMP Rune
s, this method returns a single-char
string
. For supplementary Rune
s, this method returns a two-char
string
whose elements are the UTF-16 surrogate pair.
Logically, this means that to form a string
from a sequence of Rune
values, it's equivalent to call Rune.ToString
on each value and to concatenate the intermediate results together into a final result.
Example:
Rune[] runes = new Rune[3]
{
new Rune('I'),
new Rune('\ud83d', '\ude18'), // U+1F618 FACE THROWING A KISS (😘)
new Rune('U')
};
string a = runes[0].ToString(); // = "I"
string b = runes[1].ToString(); // = "😘" = "\ud83d\ude18" (surrogate pair)
string c = runes[2].ToString(); // = "U"
string concated = string.Concat(runes); // = "I😘U"
I wrote that last question before getting to the section about UTF encoding - now I’m wondering what the relationship is between surrogate pairs and UTF code units if they’re both intended to represent a 32-bit Unicode scalar value in 16-bit space. Why have both abstractions? How do they relate?
Any Unicode string requires some in-memory representation. For UTF-8 Unicode strings, the "string" is a sequence of 8-bit elements. For UTF-16 Unicode strings, it's a sequence of 16-bit elements. And for UTF-32, it's a sequence of 32-bit elements. These elements are code units. They're primarily useful for thinking of a "string" as a contiguous in-memory block of data, and you would index into the string by code units. The width of the code unit depends on the particular UTF-* encoding we're talking about.
They're also useful for determining the total size (in bytes) of the in-memory representation of the string. If a UTF-8 string consists of 17 code units, its total size is 17 bytes. If a UTF-16 string consists of 11 code units, its total size is 22 bytes. And if a UTF-32 string consists of 9 code units, its total size is 36 bytes. It's a typical totalByteSizeOf(T[] t) = t.ElementCount * sizeof(T);
calculation.
(This is also the definition of char
- it's the elemental type of our UTF-16 string
type. Therefore a char
is also a UTF-16 code unit.)
Since code units are really just arbitrary integers of a given width, they can't always be treated as scalar values. Consider what was outlined earlier in this document: a single char
(UTF-16 code unit) might not be sufficient to represent a full Unicode scalar value. Similarly, since a code unit could have any integer value of a given width, there's no guarantee that it's well-formed. For example, the byte 0xFF
is an 8-bit code unit, but the byte 0xFF
can never appear anywhere in well-formed UTF-8 text. That byte is always forbidden. Similarly, the value 0xDEADBEEF
is a 32-bit code unit, but it can never appear anywhere in well-formed UTF-32 text.
A scalar value (Rune
) is guaranteed to exist in the Unicode code space and is guaranteed not to be a reserved UTF-16 surrogate code point. This means that there's a precise, unambiguous, and lossless mapping from Rune
to any given UTF-* code unit sequence. It also means that if you can successfully create a Rune
instance from a given UTF-* code unit sequence, that code unit sequence was well-formed, and you can then query the Rune
for properties about the data it represents. This ability to convert to/from anything and to query it for information makes it a substantially powerful API.
I'm also trying to work an "aBOMination" pun in here somewhere, but as of yet to no avail.
graphic
3 years ago, I wrote an article about Unicode history (Unicode itself and .NET characters) in Japanese. Diagrams/illustrations in the article are drawn by using PowerPoint. I hope this pptx helps you.
I think this is a very good write-up. There's one aspect that I disagree with however, and that's the recommendation to use char
instead when you're sure that the character will be representable as a single UTF-16 code unit. I think this is an unnecessary complication to the mental model, and also makes it harder to switch the backing encoding of a string (say, to a Utf8String
) without breaking code. I think that going forward, it makes more sense to avoid treating char
as an entire character, even when it is known to be. When searching for a character in a string, users shouldn't have to look up whether or not that character is in the BMP when it is simpler to just use Rune
.
This was an excellent - best article I’ve read about .NET Runes all week. I actually don’t have a lot of feedback because it was so informative and complete, but did make a few notes as I was reading through: