Skip to content

Instantly share code, notes, and snippets.

@meijeru
Created October 25, 2011 17:02
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save meijeru/1313498 to your computer and use it in GitHub Desktop.
Save meijeru/1313498 to your computer and use it in GitHub Desktop.
UTF-8 string to block of Unicode codepoints conversion: REBOL
REBOL []
utf8-to-cps: func [ ; yields a block of integers >= 0 and < 1114112
; coding errors are skipped
u [binary!]
/local bcp b1 b2 b3 b4
][
bcp: make block! length? u ; overestimated
while [not tail? u][
b1: u/1
case [
b1 < 128 [
insert tail bcp b1
u: skip u 1
]
b1 < 192 [
u: skip u 1
]
b1 < 224 [
either all [
not tail? skip u 1
(b2: u/2) >= 128 b2 < 192
][
insert tail bcp (shift/left b1 - 192 6) or (b2 - 128)
u: skip u 2
][
u: skip u 1
]
]
b1 < 240 [
either all [
not tail? skip u 2
(b2: u/2) >= 128 b2 < 192
(b3: u/3) >= 128 b3 < 192
][
insert tail bcp (shift/left b1 - 224 12)
or (shift/left b2 - 128 6) or (b3 - 128)
u: skip u 3
][
u: skip u 1
]
]
b1 < 248 [
either all [
not tail? skip u 3
(b2: u/2) >= 128 b2 < 192
(b3: u/3) >= 128 b3 < 192
(b3: u/4) >= 128 b4 < 192
][
insert tail bcp (shift/left b1 - 240 18)
or (shift/left b2 - 128 12)
or (shift/left b3 - 128 6) or (b4 - 128)
u: skip u 4
][
u: skip u 1
]
]
]
]
bcp
]
@meijeru
Copy link
Author

meijeru commented Oct 25, 2011

Critical remarks: this simple-minded algorithm accepts so-called overlong sequences, e.g. C080 and E08080 which both encode U+0000. Also, it accepts 4-byte sequences which encode points beyond U+10FFFF, up to (hypothetical) U+1FFFFF.

@dockimbel
Copy link

Overlong sequences are blocked by the lexer, so I guess that such function should not expect them in the input.

@meijeru
Copy link
Author

meijeru commented Oct 26, 2011

This routine is a prelude to a general decoding routine written in Red/System that could become part of the runtime system and would support the native compiler. In that scenario, it should cope with everything. In a next Gist I will publish such a Red/System routine.

@meijeru
Copy link
Author

meijeru commented Oct 30, 2011

See git://gist.github.com/1325840.git for a (hopefully compelte) Red/System version.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment