Skip to content

Instantly share code, notes, and snippets.

@uupaa
Created December 1, 2012 21:41
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save uupaa/4185267 to your computer and use it in GitHub Desktop.
Save uupaa/4185267 to your computer and use it in GitHub Desktop.
RFC3629 - UTF-8, a transformation format of ISO 10646
// RFC3629 - UTF-8, a transformation format of ISO 10646 -
// +------+-----------+-------------------------------------+-----------+-----------+-----------+-----------+
// | Rule | UTF-16 | Representation | UTF-8 1st | UTF-8 2nd | UTF-8 3rd | UTF-8 4th |
// +------+-----------+-------------------------------------+-----------+-----------+-----------+-----------+
// | [1] | | 00000000 0zzzzzzz | 0zzz zzzz | | | |
// | | +-------------------------------------+ | | | |
// | | 0x000000~ | 00000000 00000000 | 0000 0000 | | | | (0x00)~
// | | 0x00007F | 00000000 01111111 | 0111 1111 | | | | (0x7F)
// +------+-----------+-------------------------------------+-----------+-----------+-----------+-----------+
// | [2] | | 00000yyy yyzzzzzz | 110y yyyy | 10zz zzzz | | |
// | | +-------------------------------------+ | | | |
// | | 0x000080~ | 00000000 10000000 | 1100 0010 | 1000 0000 | | | (0xC2,0x80)~
// | | 0x0007FF | 00000111 11111111 | 1101 1111 | 1011 1111 | | | (0xDF,0xBF)
// +------+-----------+-------------------------------------+-----------+-----------+-----------+-----------+
// | [3] | | xxxxyyyy yyzzzzzz | 1110 xxxx | 10yy yyyy | 10zz zzzz | |
// | | +-------------------------------------+ | | | |
// | | 0x000800~ | 00001000 00000000 | 1110 0000 | 1010 0000 | 1000 0000 | | (0xE0,0xA0,0x80)~
// | | 0x000FFF | 00001111 11111111 | 1110 0000 | 1011 1111 | 1011 1111 | | (0xE0,0xBF,0xBF)
// | | | | | | | |
// | | 0x001000~ | 00010000 00000000~| 1110 0001 | 1000 0000 | 1000 0000 | | (0xE1,0x80,0x80)~
// | | 0x00C7FF | 10000111 11111111 | 1110 1100 | 1011 1111 | 1011 1111 | | (0xEC,0xBF,0xBF)
// | | | | | | | |
// | | 0x00E000~ | 11100000 00000000~| 1110 1110 | 1000 0000 | 1000 0000 | | (0xEE,0x80,0x80)~
// | | 0x00FFFF | 11111111 11111111 | 1110 1111 | 1011 1111 | 1011 1111 | | (0xEF,0xBF,0xBF)
// | | 0x00FEFF | (BOM) 11111110 11111111 | 1110 1111 | 1011 1011 | 1011 1111 | | (0xEF,0xBB,0xBF)
// +------+-----------+-------------------------------------+-----------+-----------+-----------+-----------+
// | [4] | | 000wwwxx xxxxyyyy yyzzzzzz | 1111 0www | 10xx xxxx | 10yy yyyy | 10zz zzzz |
// | | +-------------------------------------+ | | | |
// | | 0x010000~ | 00000001 00000000 00000000 | 1111 0000 | 1001 0000 | 1000 0000 | 1000 0000 | (0xF0,0x90,0x80,0x80)~
// | | 0x10FFFF | 00010000 11111111 11111111 | 1111 0100 | 1000 1111 | 1011 1111 | 1011 1111 | (0xF4,0x8F,0xBF,0xBF)
// +------+-----------+-------------------------------------+-----------+-----------+-----------+-----------+
//
// First UTF-16 data(0xD800 ~ 0xDBFF) and second UTF-16 data(0xDC00 ~ 0xDFFF) is Surrogate Pairs.
//
// +------+-----------+-------------------------------------+-----------+-----------+-----------+-----------+
// | Rule | UTF-16 | Surrogate Pairs to UTF-8 | UTF-8 1st | UTF-8 2nd | UTF-8 3rd | UTF-8 4th |
// +------+-----------+-------------------------------------+-----------+-----------+-----------+-----------+
// | [5] | | | | | | |
// | | | 110110UU UUwwwwxx 110111yy yyzzzzzz | 1111 0uuu | 10uu wwww | 10xx yyyy | 10zz zzzz |
// | | +-------------------------------------+ | | | |
// | | D800,DC00 | 11011000 00000000 11011100 00000000 | 1111 0000 | 1001 0000 | 1000 0000 | 1000 0000 | (0xF0,0x90,0x80,0x80)~
// | | | | | | | | | |
// | | DBFF,DFFF | 11011011 11111111 11011111 11111111 | 1111 0100 | 1000 1111 | 1011 1111 | 1011 1111 | (0xF4,0x8F,0xBF,0xBF)
// +------+-----------+-------------------------------------+-----------+-----------+-----------+-----------+
//
// Surrogate Pairs convert to UTF-8 steps.
//
// step1. read two UTF-16 data
// step2. (0xUUUU + 1) to 0xuuuuu
// step3. Apply Rule[4] to value(0x010000 ~ 0x10FFFF)
//
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment