dmolesUC/supplementary.md

## supplementary.md

      
    Raw
  

              supplementary.md
            
          
    A naive regex to match supplementary characters in the range U+10000–U+EFFFF produces nonsensical results:
jshell> Pattern.matches("[\u10000-\uEFFFF]+", "abc")
    ==> true

Likewise the version using regex escapes rather than character escapes:
jshell> Pattern.matches("[\\u10000-\\uEFFFF]+", "abc")
    ==> true

This is presumably because interprets the first four digits as the character code, and the final digit as a separate character:
jshell> "\u10000".toCharArray()
    ==> char[2] { 'က', '0' } // '\u1000', '0'

jshell> "\uEFFFF".toCharArray()
    ==> char[2] { '', 'F' } // '\uEFFF', 'F'

According to Supplementary Characters in the Java Platform, the proper way to escape surrogate characters is with UTF-16 code units.
In UTF-16, U+10000 is 0xD800 0xDC00, and U+EFFFF is 0xDB7F 0xDFFF. This gives us the regex "[\uD800\uDC00-\uDB7F\uDFFF]":
jshell> Pattern.matches("[\uD800\uDC00-\uDB7F\uDFFF]", "1")
    ==> false

jshell> Pattern.matches("[\uD800\uDC00-\uDB7F\uDFFF]", "\uD9BF\uDFFF") // U+7FFFF
==> true