zyedidia/unicode.md

## unicode.md

      
    Raw
  

              unicode.md
            
          
    Text Encoding

How are text files encoded, and what are the pros and cons of different
encodings? This document introduces the ASCII and Unicode encodings to answer
these questions. Some short pieces of Go code are provided which you can run
yourself to see how strings are being encoded. Go is used because it has
built-in support for Unicode, but you should be able to follow without any
knowledge of Go.
ASCII

ASCII is a character encoding which uses a 7-bit number to encode a single
character. In ASCII, we associate each character with a unique number between 0
and 127. On modern machines, each character is stored as a byte where the most
significant bit is always 0. Let's see it in action!
// Go uses UTF-8 for string encoding, but this string will be encoded in ASCII,
// trust me (and keep reading to learn why!).
hello := "Hello!"
fmt.Println([]byte(hello)) // Prints the string hello as an array of bytes
// Prints:
// [72 101 108 108 111 33]
We can deduce from this test program that the character H is encoded as 72,
e is 101, etc...
For the full ASCII table showing every character and its corresponding
encoding, see man ascii, or http://www.asciitable.com.
The ASCII encoding is very compact, with 1 byte per character, but it only has
space for 128 different characters. This is enough for digits, the English
alphabet (lowercase and uppercase), certain punctuation, and spacing/control
characters. For representing languages other than English, this is not good
enough.
Unicode

Enter Unicode, the one code to rule them all. Unicode is a standard for text
encoding and representation. The Unicode standard defines a way to represent
1,112,064 possible characters. Of these, there are roughly 140,000 characters
in use currently. The Unicode standard defines several different encodings:
UTF-32, UTF-16, and UTF-8 (and some other ones we won't discuss). UTF-8 is the
most widely used, but we'll talk about UTF-32 first.
Terminology

Things get confusing because it's no longer clear what a "character" means. In
C, a char is 1 byte. In text, a character is a symbol, and not necessarily
encoded as one byte. To make things clearer we will use the following
terminology:

A "byte" is one byte (8 bits).
A "code point" or "rune" is one of the 1,112,064 possible symbols defined by
the Unicode standard. To refer to a code point, we use the notation
U+xxxxxx, where xxxxxx is a hexadecimal number referring to that code
point's index in the overall list of code points.
A "grapheme" is one cell of text. For example, as we will see later certain
symbols such as characters with accents can be one grapheme but multiple code
points.

UTF-32

Since we only need to encode 1,112,064 possible code points, the simplest
choice is to use a 32-bit integer to encode a code point. This is clearly large
enough to encode any possible code point. In Go, a "rune" refers to a Unicode
code point, and is exactly the same as the "int32" datatype, which means an
array of runes is using a UTF-32 encoding.
Let's try it out:
symbols := "αβ"
fmt.Println([]rune(symbols)) // Prints symbols as an array of runes/code points
// Prints:
// [945 946]
Great, we can encode any Unicode code point now! Plus, it is a fixed width
encoding, so determining the location of the nth code point in a string is
simple. There are some downsides though:

Every symbol takes 4 bytes to encode, which means every file encoded in
UTF-32 will be 4 times larger than the same file encoded in ASCII (if it's
possible to encode the file in ASCII).
Any file that was encoded in ASCII before will have to be converted to
UTF-32. Likewise, UTF-32 encoded files cannot be viewed by a program that
can only decode ASCII (even if the UTF-32 file only contains code points
that can be represented by ASCII).

UTF-8

With UTF-8, we will try to fix the two downsides incurred by UTF-32. UTF-8 is a
variable-width encoding. This means that some code points are represented using
anywhere between 1 and 4 bytes of space. Let's look at how this works (using a
table from Wikipedia). The table shows how various ranges of code points are
represented in the UTF-8 encoding.


Number of bytes
First code point
Last code point
Byte 1
Byte 2
Byte 3
Byte 4


1
U+0000
U+007F
0xxxxxxx


2
U+0080
U+07FF
110xxxxx
10xxxxxx


3
U+0800
U+FFFF
1110xxxx
10xxxxxx
10xxxxxx


4
U+10000
U+10FFFF
11110xxx
10xxxxxx
10xxxxxx
10xxxxxx


What does this mean? To encode a code point from U+0000 to U+007F (0 to 127),
we just use one byte. This is also extremely convenient because it is the same
encoding as ASCII for the first 127 code points, so any ASCII file will also be
a valid UTF-8 file.
For later code points, we use prefixes on the first byte to know how many bytes
this code point will use for encoding. Since all prefixes uniquely identify the
kind of byte being encoded (e.g. first byte of a 2-byte encoded code point, or
first byte of a 3-byte encoded code point...) there is never ambiguity during
decoding about how many bytes to look at. For code points with multi-byte
encodings, each later byte begins with 10. This is to prevent problems where
starting to read the text in the middle of a code point (because of an error)
would completely throw off the decoding of later code points as well.
Let's see it in action:
symbols := "αβ" // in Go, strings are encoded in UTF-8 by default
fmt.Println([]byte(symbols))
// Prints:
// [206 177 206 178]
For the two symbols αβ, we see that bytes with values greater than 127 are
used. For a string like "Hello", we would see one byte per code point, since
all of the code points in that string are between U+0000 and U+007F.
This is great because it allows our files to be compact in the common case
(most files use English characters), and it is backwards compatible with ASCII!
The downside is that figuring out where the nth code point begins requires
decoding until that code point is found. Finding the number of code points in a
string is also an O(n) algorithm, even if the length in bytes of the string is
known.
Graphemes

One difficulty in Unicode is encoding accents. Should we make a separate code
point for every character with an accent? For example, should é, á, ´
a, e all have one unique code point? Surely they need to be encoded
differently, but dedicating a codepoint for every possible combination of
character and accent starts using up many code points. Unicode solves this with
the concept of a "combining code point." When the code point for a is
followed by the code point for ´, the accent is combined with the character
by the program that is displaying the text. This is because the code point for
the accent is defined to be a "combining" code point.
Let's look at the bytes in Go:
symbols := "X̿" // in Go, strings are encoded in UTF-8 by default
fmt.Println([]byte(symbols))
fmt.Println([]rune(symbols))
// Prints:
// [88 204 191]
// [88 831]
We convert to UTF-32 by casting to a rune array to look at each individual
code point. Here's what we can determine: this string has two code points even
though it displays as one cell, with the accent symbol above the symbol encoded
before it, and the accent symbol's code point is encoded using two bytes. Cool!
Why did I suddenly pick X̿ for my example instead of á like I was talking
about previously? This is because for á the story is more complicated: this
accent is more common, so for space efficiency it actually does get a unique
code point. This means there are two ways of encoding the character á: using
an a and a combining ´, or using the dedicated code point á. Let's see it
in Go:
combiningAccent := "á"
fmt.Println("Combining:")
fmt.Println([]byte(combiningAccent))
fmt.Println([]rune(combiningAccent))
// Prints:
// Combining:
// [97 204 129]
// [97 769]

dedicatedAccent := "á"
fmt.Println("Dedicated:")
fmt.Println([]byte(dedicatedAccent))
fmt.Println([]rune(dedicatedAccent))
// Prints:
// Dedicated:
// [195 161]
// [225]
Even though these two strings look exactly the same when displayed by your
editor, they are encoded differently! To read more about this, look up "text
normalization." If you try pasting this code into your editor, you might see
different results if your computer is automatically changing the combined
accent to the dedicated one. You can try directly printing the bytes from the
combined accent above to see the combining accent in action:
b := []byte{97, 204, 129}
fmt.Println(string(b))
// Prints:
// á
More than one combining character can be applied to form a single grapheme.
For example, here is an interesting string:
interesting := "H̼̥̯͇͙̕͘͞e̸̦̞̠̣̰͙̼̥̦̼̖̬͕͕̰̯̫͇̕ĺ̜̠̩̯̯͙̼̭̠͕̮̞͜l̶͓̫̞̮͈͞ͅo̸͔͙̳̠͈̮̼̳͙̥̲͜͠"
fmt.Printf("Number of bytes: %d\n", len(interesting))
fmt.Printf("Number of code points: %d\n", utf8.RuneCountInString(interesting))
fmt.Printf("Number of graphemes: %d\n", uniseg.GraphemeClusterCount(interesting))
// Prints:
// Number of bytes: 132
// Number of code points: 68
// Number of graphemes: 5
Wow! This string is 132 bytes, 68 code points, and only 5 graphemes! To count
the number of graphemes, I used the uniseg library, which you can find at
https://github.com/rivo/uniseg.
There are other ways characters can be marked to display differently from
normal. Some characters are "double-width" which means they take up two
text cells when displayed. For example, many chinese characters are double
width: ⾖.
UTF-16

UTF-16 is an encoding similar to UTF-8 but it uses blocks of 16 bits instead of
8 bits. Back when Unicode only supported fewer than 65536 code points, this
worked fairly well, but since then the standard has grown.  This means it
incurs most of the downsides of both UTF-32 and UTF-8. It still sees use as a
legacy encoding format because it existed before UTF-8.
Number of bytes	First code point	Last code point	Byte 1	Byte 2	Byte 3	Byte 4
1	U+0000	U+007F	0xxxxxxx
2	U+0080	U+07FF	110xxxxx	10xxxxxx
3	U+0800	U+FFFF	1110xxxx	10xxxxxx	10xxxxxx
4	U+10000	U+10FFFF	11110xxx	10xxxxxx	10xxxxxx	10xxxxxx