Skip to content

Instantly share code, notes, and snippets.

@dockimbel
Created October 11, 2011 17:13
Show Gist options
  • Save dockimbel/1278718 to your computer and use it in GitHub Desktop.
Save dockimbel/1278718 to your computer and use it in GitHub Desktop.
UTF-8 codepoints validation rules
overlong: charset "^(C0)^(C1)"
invalid-bytes: union overlong charset [#"^(F5)" - #"^(FF)"]
cont-byte: charset [#"^(80)" - #"^(BF)"]
one-byte-codepoint: charset [#"^(00)" - #"^(7F)"]
two-bytes-codepoint: reduce [
charset [#"^(C2)" - #"^(DF)"]
cont-byte
]
three-bytes-codepoint: reduce [
charset [#"^(E0)" - #"^(EF)"]
charset [#"^(A0)" - #"^(BF)"]
cont-byte
]
four-bytes-codepoint: compose/deep [
[
#{F0} (charset [#"^(90)" - #"^(FF)"])
| [#{F1} | #{F2}] skip
| #{F4} (charset [#"^(00)" - #"^(8F)"])
]
(cont-byte)
(cont-byte)
]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment