Created
October 25, 2011 17:02
-
-
Save meijeru/1313498 to your computer and use it in GitHub Desktop.
UTF-8 string to block of Unicode codepoints conversion: REBOL
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
REBOL [] | |
utf8-to-cps: func [ ; yields a block of integers >= 0 and < 1114112 | |
; coding errors are skipped | |
u [binary!] | |
/local bcp b1 b2 b3 b4 | |
][ | |
bcp: make block! length? u ; overestimated | |
while [not tail? u][ | |
b1: u/1 | |
case [ | |
b1 < 128 [ | |
insert tail bcp b1 | |
u: skip u 1 | |
] | |
b1 < 192 [ | |
u: skip u 1 | |
] | |
b1 < 224 [ | |
either all [ | |
not tail? skip u 1 | |
(b2: u/2) >= 128 b2 < 192 | |
][ | |
insert tail bcp (shift/left b1 - 192 6) or (b2 - 128) | |
u: skip u 2 | |
][ | |
u: skip u 1 | |
] | |
] | |
b1 < 240 [ | |
either all [ | |
not tail? skip u 2 | |
(b2: u/2) >= 128 b2 < 192 | |
(b3: u/3) >= 128 b3 < 192 | |
][ | |
insert tail bcp (shift/left b1 - 224 12) | |
or (shift/left b2 - 128 6) or (b3 - 128) | |
u: skip u 3 | |
][ | |
u: skip u 1 | |
] | |
] | |
b1 < 248 [ | |
either all [ | |
not tail? skip u 3 | |
(b2: u/2) >= 128 b2 < 192 | |
(b3: u/3) >= 128 b3 < 192 | |
(b3: u/4) >= 128 b4 < 192 | |
][ | |
insert tail bcp (shift/left b1 - 240 18) | |
or (shift/left b2 - 128 12) | |
or (shift/left b3 - 128 6) or (b4 - 128) | |
u: skip u 4 | |
][ | |
u: skip u 1 | |
] | |
] | |
] | |
] | |
bcp | |
] |
See git://gist.github.com/1325840.git for a (hopefully compelte) Red/System version.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
This routine is a prelude to a general decoding routine written in Red/System that could become part of the runtime system and would support the native compiler. In that scenario, it should cope with everything. In a next Gist I will publish such a Red/System routine.