Skip to content

Instantly share code, notes, and snippets.

@fasiha
Last active December 30, 2016 01:08
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save fasiha/baee693104f081fb2bfd5312ad2220b8 to your computer and use it in GitHub Desktop.
Save fasiha/baee693104f081fb2bfd5312ad2220b8 to your computer and use it in GitHub Desktop.
A quick refresher on UTF-8 for those familiar with its general outlines, with examples

Aficionados of information theory will recognize UTF-8 as a self-synchronizing prefix-free code—where “code” means a way to encode letters of an alphabet with bits. The alphabet consists of the tens of thousands of Unicode code points, or symbols, each of which is a “letter” in this alphabet.

Going off of this highly-voted StackOverflow answer and the Base-122 writeup, here’s how UTF-8 works. If the first half-byte of a byte (the first hexadecimal digit) is:

  • between 0x0 and 0x7, this byte encodes a one-byte code point (ASCII). Seven bits are used for the actual code point, which can be between U+0 and U+7F.
  • 0xC or 0xD (12 or 13), this is the start of a two-byte code point. 5+6 or eleven bits (of sixteen) are used for the actual code point, which lives in [U+080, U+7FF].
  • 0xE (14), this is the start of a three-byte code point. 4+6+6 or sixteen bits (of twenty-four) are used for the actual code point, which lives in [U+800, U+FFFF].
  • 0xF (15) and the second half-byte is ≤0x7, this is the start of a four-byte code point. 3+6+6+6 or twenty-one bits (of thirty-two) are used for the actual code point, which lives in [U+10000, U+10FFFF].

(In the above, whenever I’ve said, e.g., “3+6+6+6” bits, each of those numbers represent how many bits of each byte combine to yield the code point. And those bits are from the least significant end—the most significant bits of each byte is taken up by a prelude.)

So, for example:

character UTF-8 bytes Unicode code point
x 78 \U{78}
Å C3 85 \U{C5}
E6 9C 88 \U{6708}
😊 F0 9F 98 8A \U{1F60A}
Ā̂ C4 80 CC 82 \U{100}\U{302}
👍 F0 9F 91 8D \U{1F44D}
👍🏽️ F0 9F 91 8D F0 9F 8F BD EF B8 8F \U{1F44D}\U{1F3FD}\U{FE0F}

(Experiment with the code that generated this at the Rust Playground!)

The first four rows show examples of code points that, in UTF-8, take up 1–4 bytes. The rest show that Unicode is more complicated than can be expressed in this gist 😜.

// Run live at https://is.gd/YUiDHj
fn show(s: &str) {
println!("| {} | {} | {} |",
s,
s.as_bytes().iter().map(|x| format!("{:02X} ", x)).collect::<String>().trim(),
s.chars()
.map(|c| c.escape_unicode().collect::<String>())
.collect::<String>()
.to_uppercase());
}
fn main() {
let strings = "x,Å,月,😊,Ā̂,👍,👍🏽️".split(",");
for s in strings {
show(s);
}
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment