Skip to content

Instantly share code, notes, and snippets.

@dmolesUC
Created December 18, 2017 23:12
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save dmolesUC/b986608b6887b0d73706c37f6955179b to your computer and use it in GitHub Desktop.
Save dmolesUC/b986608b6887b0d73706c37f6955179b to your computer and use it in GitHub Desktop.
Match Unicode supplementary characters (U+10000–U+EFFFF) with regex in Java

A naive regex to match supplementary characters in the range U+10000–U+EFFFF produces nonsensical results:

jshell> Pattern.matches("[\u10000-\uEFFFF]+", "abc")
    ==> true

Likewise the version using regex escapes rather than character escapes:

jshell> Pattern.matches("[\\u10000-\\uEFFFF]+", "abc")
    ==> true

This is presumably because interprets the first four digits as the character code, and the final digit as a separate character:

jshell> "\u10000".toCharArray()
    ==> char[2] { 'က', '0' } // '\u1000', '0'

jshell> "\uEFFFF".toCharArray()
    ==> char[2] { '', 'F' } // '\uEFFF', 'F'

According to Supplementary Characters in the Java Platform, the proper way to escape surrogate characters is with UTF-16 code units.

In UTF-16, U+10000 is 0xD800 0xDC00, and U+EFFFF is 0xDB7F 0xDFFF. This gives us the regex "[\uD800\uDC00-\uDB7F\uDFFF]":

jshell> Pattern.matches("[\uD800\uDC00-\uDB7F\uDFFF]", "1")
    ==> false

jshell> Pattern.matches("[\uD800\uDC00-\uDB7F\uDFFF]", "\uD9BF\uDFFF") // U+7FFFF ==> true

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment