Skip to content

Instantly share code, notes, and snippets.

@barend
Last active May 29, 2017 19:18
Show Gist options
  • Save barend/de5eef180d95f4834676304711676322 to your computer and use it in GitHub Desktop.
Save barend/de5eef180d95f4834676304711676322 to your computer and use it in GitHub Desktop.
How does UTF8 work, anyway?
The following is the Black Female Astronaut emoji as encoded
in UTF8, shown in hex:
F0 9F 91 A9 F0 9F 8F BF E2 80 8D F0 9F 9A 80 byte value
1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 byte number
As you can tell it's fifteen bytes. If you express the hex
digits in binary you can see how UTF8 encoding works, and
you can see it's made up of four characters.
F0 9F 91 A9 F0 9F 8F BF E2 80 8D F0 9F 9A 80
11110000 | | |
10011111 | | |
10010001 | | |
10101001 | | |
11110000 | |
10011111 | |
10001111 | |
10111111 | |
11100010 |
10000000 |
10001101 |
11110000
10011111
10011010
10000000
For every multi-byte UTF8 character, the leading 1-bits of
the first byte tell you how many total bytes the character
spans. The nul-byte and the 127 characters of the original
7-bit ASCII set take up one byte.
All single-byte UTF8 characters have a 0 for the first bit.
These are the fileformat.info pages for the four characters
shown above:
http://www.fileformat.info/info/unicode/char/1f469/index.htm
http://www.fileformat.info/info/unicode/char/1f3ff/index.htm
http://www.fileformat.info/info/unicode/char/200d/index.htm
http://www.fileformat.info/info/unicode/char/1f680/index.htm
That's woman, modifier-fitzpatrick-type-6*, joiner, rocket.
Fitzpatrick Type, you say?
https://en.wikipedia.org/wiki/Fitzpatrick_scale
👩🏿‍🚀
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment