Created
October 25, 2011 17:02
-
-
Save meijeru/1313498 to your computer and use it in GitHub Desktop.
UTF-8 string to block of Unicode codepoints conversion: REBOL
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
REBOL [] | |
utf8-to-cps: func [ ; yields a block of integers >= 0 and < 1114112 | |
; coding errors are skipped | |
u [binary!] | |
/local bcp b1 b2 b3 b4 | |
][ | |
bcp: make block! length? u ; overestimated | |
while [not tail? u][ | |
b1: u/1 | |
case [ | |
b1 < 128 [ | |
insert tail bcp b1 | |
u: skip u 1 | |
] | |
b1 < 192 [ | |
u: skip u 1 | |
] | |
b1 < 224 [ | |
either all [ | |
not tail? skip u 1 | |
(b2: u/2) >= 128 b2 < 192 | |
][ | |
insert tail bcp (shift/left b1 - 192 6) or (b2 - 128) | |
u: skip u 2 | |
][ | |
u: skip u 1 | |
] | |
] | |
b1 < 240 [ | |
either all [ | |
not tail? skip u 2 | |
(b2: u/2) >= 128 b2 < 192 | |
(b3: u/3) >= 128 b3 < 192 | |
][ | |
insert tail bcp (shift/left b1 - 224 12) | |
or (shift/left b2 - 128 6) or (b3 - 128) | |
u: skip u 3 | |
][ | |
u: skip u 1 | |
] | |
] | |
b1 < 248 [ | |
either all [ | |
not tail? skip u 3 | |
(b2: u/2) >= 128 b2 < 192 | |
(b3: u/3) >= 128 b3 < 192 | |
(b3: u/4) >= 128 b4 < 192 | |
][ | |
insert tail bcp (shift/left b1 - 240 18) | |
or (shift/left b2 - 128 12) | |
or (shift/left b3 - 128 6) or (b4 - 128) | |
u: skip u 4 | |
][ | |
u: skip u 1 | |
] | |
] | |
] | |
] | |
bcp | |
] |
Overlong sequences are blocked by the lexer, so I guess that such function should not expect them in the input.
This routine is a prelude to a general decoding routine written in Red/System that could become part of the runtime system and would support the native compiler. In that scenario, it should cope with everything. In a next Gist I will publish such a Red/System routine.
See git://gist.github.com/1325840.git for a (hopefully compelte) Red/System version.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Critical remarks: this simple-minded algorithm accepts so-called overlong sequences, e.g. C080 and E08080 which both encode U+0000. Also, it accepts 4-byte sequences which encode points beyond U+10FFFF, up to (hypothetical) U+1FFFFF.