Created
December 1, 2012 21:41
-
-
Save uupaa/4185267 to your computer and use it in GitHub Desktop.
RFC3629 - UTF-8, a transformation format of ISO 10646
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
// RFC3629 - UTF-8, a transformation format of ISO 10646 - | |
// +------+-----------+-------------------------------------+-----------+-----------+-----------+-----------+ | |
// | Rule | UTF-16 | Representation | UTF-8 1st | UTF-8 2nd | UTF-8 3rd | UTF-8 4th | | |
// +------+-----------+-------------------------------------+-----------+-----------+-----------+-----------+ | |
// | [1] | | 00000000 0zzzzzzz | 0zzz zzzz | | | | | |
// | | +-------------------------------------+ | | | | | |
// | | 0x000000~ | 00000000 00000000 | 0000 0000 | | | | (0x00)~ | |
// | | 0x00007F | 00000000 01111111 | 0111 1111 | | | | (0x7F) | |
// +------+-----------+-------------------------------------+-----------+-----------+-----------+-----------+ | |
// | [2] | | 00000yyy yyzzzzzz | 110y yyyy | 10zz zzzz | | | | |
// | | +-------------------------------------+ | | | | | |
// | | 0x000080~ | 00000000 10000000 | 1100 0010 | 1000 0000 | | | (0xC2,0x80)~ | |
// | | 0x0007FF | 00000111 11111111 | 1101 1111 | 1011 1111 | | | (0xDF,0xBF) | |
// +------+-----------+-------------------------------------+-----------+-----------+-----------+-----------+ | |
// | [3] | | xxxxyyyy yyzzzzzz | 1110 xxxx | 10yy yyyy | 10zz zzzz | | | |
// | | +-------------------------------------+ | | | | | |
// | | 0x000800~ | 00001000 00000000 | 1110 0000 | 1010 0000 | 1000 0000 | | (0xE0,0xA0,0x80)~ | |
// | | 0x000FFF | 00001111 11111111 | 1110 0000 | 1011 1111 | 1011 1111 | | (0xE0,0xBF,0xBF) | |
// | | | | | | | | | |
// | | 0x001000~ | 00010000 00000000~| 1110 0001 | 1000 0000 | 1000 0000 | | (0xE1,0x80,0x80)~ | |
// | | 0x00C7FF | 10000111 11111111 | 1110 1100 | 1011 1111 | 1011 1111 | | (0xEC,0xBF,0xBF) | |
// | | | | | | | | | |
// | | 0x00E000~ | 11100000 00000000~| 1110 1110 | 1000 0000 | 1000 0000 | | (0xEE,0x80,0x80)~ | |
// | | 0x00FFFF | 11111111 11111111 | 1110 1111 | 1011 1111 | 1011 1111 | | (0xEF,0xBF,0xBF) | |
// | | 0x00FEFF | (BOM) 11111110 11111111 | 1110 1111 | 1011 1011 | 1011 1111 | | (0xEF,0xBB,0xBF) | |
// +------+-----------+-------------------------------------+-----------+-----------+-----------+-----------+ | |
// | [4] | | 000wwwxx xxxxyyyy yyzzzzzz | 1111 0www | 10xx xxxx | 10yy yyyy | 10zz zzzz | | |
// | | +-------------------------------------+ | | | | | |
// | | 0x010000~ | 00000001 00000000 00000000 | 1111 0000 | 1001 0000 | 1000 0000 | 1000 0000 | (0xF0,0x90,0x80,0x80)~ | |
// | | 0x10FFFF | 00010000 11111111 11111111 | 1111 0100 | 1000 1111 | 1011 1111 | 1011 1111 | (0xF4,0x8F,0xBF,0xBF) | |
// +------+-----------+-------------------------------------+-----------+-----------+-----------+-----------+ | |
// | |
// First UTF-16 data(0xD800 ~ 0xDBFF) and second UTF-16 data(0xDC00 ~ 0xDFFF) is Surrogate Pairs. | |
// | |
// +------+-----------+-------------------------------------+-----------+-----------+-----------+-----------+ | |
// | Rule | UTF-16 | Surrogate Pairs to UTF-8 | UTF-8 1st | UTF-8 2nd | UTF-8 3rd | UTF-8 4th | | |
// +------+-----------+-------------------------------------+-----------+-----------+-----------+-----------+ | |
// | [5] | | | | | | | | |
// | | | 110110UU UUwwwwxx 110111yy yyzzzzzz | 1111 0uuu | 10uu wwww | 10xx yyyy | 10zz zzzz | | |
// | | +-------------------------------------+ | | | | | |
// | | D800,DC00 | 11011000 00000000 11011100 00000000 | 1111 0000 | 1001 0000 | 1000 0000 | 1000 0000 | (0xF0,0x90,0x80,0x80)~ | |
// | | | | | | | | | | | |
// | | DBFF,DFFF | 11011011 11111111 11011111 11111111 | 1111 0100 | 1000 1111 | 1011 1111 | 1011 1111 | (0xF4,0x8F,0xBF,0xBF) | |
// +------+-----------+-------------------------------------+-----------+-----------+-----------+-----------+ | |
// | |
// Surrogate Pairs convert to UTF-8 steps. | |
// | |
// step1. read two UTF-16 data | |
// step2. (0xUUUU + 1) to 0xuuuuu | |
// step3. Apply Rule[4] to value(0x010000 ~ 0x10FFFF) | |
// | |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment