Skip to content

Instantly share code, notes, and snippets.

@michael-simons
Last active November 11, 2020 22:46
Show Gist options
  • Star 3 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save michael-simons/4586038 to your computer and use it in GitHub Desktop.
Save michael-simons/4586038 to your computer and use it in GitHub Desktop.
Java and handling of UTF-8 codepoints in the supplementary range
public class UTF8 {
public static void main(String[] args) {
// See http://www.fileformat.info/info/unicode/char/1f4a9/index.htm
final String poo = "A pile of poo: 💩.";
System.out.println(poo);
// Length of chars doesn't equals the "real" length, that is: the number of actual codepoints
System.out.println(poo.length() + " vs " + poo.codePointCount(0, poo.length()));
// Iterating over all chars
for(int i=0; i<poo.length();++i) {
char c = poo.charAt(i);
// If there's a char left, we chan check if the current and the next char
// form a surrogate pair
if(i<poo.length()-1 && Character.isSurrogatePair(c, poo.charAt(i+1))) {
// if so, the codepoint must be stored on a 32bit int as char is only 16bit
int codePoint = poo.codePointAt(i);
// show the code point and the char
System.out.println(String.format("%6d:%s", codePoint, new String(new int[]{codePoint}, 0, 1)));
++i;
}
// else this can only be a "normal" char
else
System.out.println(String.format("%6d:%s", (int)c, c));
}
// constructing a string constant with two \\u unicode escape sequences
System.out.println("\ud83d\udca9".equals("💩"));
}
}
@sharmasourabh
Copy link

Line #8 would print following in Java version "1.8.0_161":

System.out.println(poo.length() + " vs " + poo.codePointCount(0, poo.length()));
20 vs 20

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment