A naive regex to match supplementary characters in the range U+10000–U+EFFFF produces nonsensical results:
jshell> Pattern.matches("[\u10000-\uEFFFF]+", "abc")
==> true
Likewise the version using regex escapes rather than character escapes:
jshell> Pattern.matches("[\\u10000-\\uEFFFF]+", "abc")
==> true
This is presumably because interprets the first four digits as the character code, and the final digit as a separate character:
jshell> "\u10000".toCharArray()
==> char[2] { 'က', '0' } // '\u1000', '0'
jshell> "\uEFFFF".toCharArray()
==> char[2] { '', 'F' } // '\uEFFF', 'F'
According to Supplementary Characters in the Java Platform, the proper way to escape surrogate characters is with UTF-16 code units.
In UTF-16, U+10000 is 0xD800 0xDC00, and U+EFFFF is 0xDB7F 0xDFFF. This gives us the regex "[\uD800\uDC00-\uDB7F\uDFFF]"
:
jshell> Pattern.matches("[\uD800\uDC00-\uDB7F\uDFFF]", "1")
==> false
jshell> Pattern.matches("[\uD800\uDC00-\uDB7F\uDFFF]", "\uD9BF\uDFFF") // U+7FFFF ==> true