This article has moved to the official .NET Docs site.
See https://docs.microsoft.com/dotnet/standard/base-types/character-encoding-introduction.
This article has moved to the official .NET Docs site.
See https://docs.microsoft.com/dotnet/standard/base-types/character-encoding-introduction.
I'm also trying to work an "aBOMination" pun in here somewhere, but as of yet to no avail.
graphic
3 years ago, I wrote an article about Unicode history (Unicode itself and .NET characters) in Japanese. Diagrams/illustrations in the article are drawn by using PowerPoint. I hope this pptx helps you.
I think this is a very good write-up. There's one aspect that I disagree with however, and that's the recommendation to use char
instead when you're sure that the character will be representable as a single UTF-16 code unit. I think this is an unnecessary complication to the mental model, and also makes it harder to switch the backing encoding of a string (say, to a Utf8String
) without breaking code. I think that going forward, it makes more sense to avoid treating char
as an entire character, even when it is known to be. When searching for a character in a string, users shouldn't have to look up whether or not that character is in the BMP when it is simpler to just use Rune
.
Thanks @daveaglick for the comments! I'm not quite sure how to work them into the document just yet, so I'll at least drop answers here so that we can consider this for future drafts.
Oh, definitely. Wikipedia has a similar graphic, and even some of the Unicode Standard (such as Ch. 3, Table 3-5) has useful tables and diagrams that would be good to pull in. When I ran the concept of this article by the docs team a while back I had mentioned that I'd need help creating diagrams. They hopefully have more talent than I do as far as these things go. :)
In the
string
type, you'll always see surrogates instead of supplementary characters. This is a consequence ofchar
being a 16-bit data type, so it can't represent any numeric value beyond 65,535 (0xFFFF
). When reading a file from disk, such as viaFile.ReadAllText
, the runtime will attempt to automatically determine the UTF-* encoding that was used to save the file. By default, we assume the file was saved using UTF-8, but if there's a special marker at the beginning of the file stating that a different UTF-* encoding was used we'll honor that marker instead. Under the covers, what's happening is that the runtime is going through the file, decoding individual Unicode scalar values (Rune
s) from the file contents. TheseRune
instances are then essentially concatenated together and turned into a singlestring
. When generating this finalstring
, anyRune
instances that are within the BMP rangeU+0000..U+FFFF
and which can be represented as a singlechar
will remain a singlechar
in the returnedstring
. AnyRune
instances that are within the supplementary rangeU+10000..U+10FFFF
will get exploded into twochar
s - a UTF-16 surrogate pair - and this pair will be present in the returnedstring
.If you wanted to see this in practice for yourself, check out the
Rune.ToString
method. For BMPRune
s, this method returns a single-char
string
. For supplementaryRune
s, this method returns a two-char
string
whose elements are the UTF-16 surrogate pair.Logically, this means that to form a
string
from a sequence ofRune
values, it's equivalent to callRune.ToString
on each value and to concatenate the intermediate results together into a final result.Example:
Any Unicode string requires some in-memory representation. For UTF-8 Unicode strings, the "string" is a sequence of 8-bit elements. For UTF-16 Unicode strings, it's a sequence of 16-bit elements. And for UTF-32, it's a sequence of 32-bit elements. These elements are code units. They're primarily useful for thinking of a "string" as a contiguous in-memory block of data, and you would index into the string by code units. The width of the code unit depends on the particular UTF-* encoding we're talking about.
They're also useful for determining the total size (in bytes) of the in-memory representation of the string. If a UTF-8 string consists of 17 code units, its total size is 17 bytes. If a UTF-16 string consists of 11 code units, its total size is 22 bytes. And if a UTF-32 string consists of 9 code units, its total size is 36 bytes. It's a typical
totalByteSizeOf(T[] t) = t.ElementCount * sizeof(T);
calculation.(This is also the definition of
char
- it's the elemental type of our UTF-16string
type. Therefore achar
is also a UTF-16 code unit.)Since code units are really just arbitrary integers of a given width, they can't always be treated as scalar values. Consider what was outlined earlier in this document: a single
char
(UTF-16 code unit) might not be sufficient to represent a full Unicode scalar value. Similarly, since a code unit could have any integer value of a given width, there's no guarantee that it's well-formed. For example, the byte0xFF
is an 8-bit code unit, but the byte0xFF
can never appear anywhere in well-formed UTF-8 text. That byte is always forbidden. Similarly, the value0xDEADBEEF
is a 32-bit code unit, but it can never appear anywhere in well-formed UTF-32 text.A scalar value (
Rune
) is guaranteed to exist in the Unicode code space and is guaranteed not to be a reserved UTF-16 surrogate code point. This means that there's a precise, unambiguous, and lossless mapping fromRune
to any given UTF-* code unit sequence. It also means that if you can successfully create aRune
instance from a given UTF-* code unit sequence, that code unit sequence was well-formed, and you can then query theRune
for properties about the data it represents. This ability to convert to/from anything and to query it for information makes it a substantially powerful API.