Skip to content

Instantly share code, notes, and snippets.

@vahidhedayati
Last active July 24, 2022 11:30
Show Gist options
  • Save vahidhedayati/12df118d02a3d0d30a847fd6ad8baf40 to your computer and use it in GitHub Desktop.
Save vahidhedayati/12df118d02a3d0d30a847fd6ad8baf40 to your computer and use it in GitHub Desktop.
How to capture Unicode Special Characters for DB storage when DB can't store special characters.

There are a few different approaches to fix work around such an issue, I will walk through what I think is the best approach

  1. The simplest is to use convert special characters to it's integer value and get the hex string of the integer value to produce a the underlying unicode.

You will need to convert at point of call to store

String testString ="well hello there ™ and then © at there ™ ©  ™ © ™ © ";
List<String> specials =new ArrayList<>();
specials.add("©");
specials.add("™");
Map<String,String> specialCharacters = new HashMap<>();
for (String entry:specials) {
  specialCharacters.put (entry, String.format("\\\\u%04x",(int)entry.charAt(0)));
}
System.out.println("----------------------------------- before "+testString);
for(Map.Entry<String, String> entry : specialCharacters.entrySet()){
  testString = testString.replaceAll(entry.getKey(), entry.getValue());
  System.out.println("- replaced "+entry.getKey()+" with "+entry.getValue());
}
System.out.println("----------------------------------- after "+testString);

This produces

----------------------------------- before well hello there ™ and then © at there ™ ©  ™ © ™ © 
- replaced ™ with \\u2122
- replaced © with \\u00a9
----------------------------------- after well hello there \u2122 and then \u00a9 at there \u2122 \u00a9  \u2122 \u00a9 \u2122 \u00a9 

At the point of storing your record you could runs something like above on a given string that you expect specical characters on. The converted string is as shown on final line above with special \u unicode representation of the underlying unicode character which will save with no issues on a database that can't typically store specical unicode characters.

The easiest way to convert back is to to use apache commons-text, in your pom.xml

<dependency>
    <groupId>org.apache.commons</groupId>
    <artifactId>commons-text</artifactId>
    <version>1.9</version>
</dependency>
public String getSomeField() {
    return StringEscapeUtils.unescapeJava(someField);
}

When it goes through it's conversion before and after converted:

----------------------------------------someObject well hello there ™ and then © at there
#After converted:
---------------------------------------someObject  well hello there \u2122 and then \u00a9 at there

The newly converted string can be stored without issues to DB with issues and getter converts back to original value

If you can't add the additional library, something like this also would convert back:

  private String forceUtf8Coding(String str) {
        str = str.replace("\\","");
        String[] arr = str.split("u");
        StringBuilder text = new StringBuilder();
        for(int i = 1; i < arr.length; i++){
            String a = arr[i];
            String b = "";
            if (arr[i].length() > 4){
                a = arr[i].substring(0, 4);
                b = arr[i].substring(4);
            }
            int hexVal = Integer.parseInt(a, 16);
            text.append((char) hexVal).append(b);
        }
        return text.toString();
    }

when attempting to display you could run the stored string on db through above function :

  System.out.println("----------------------------------- after "+forceUtf8Coding(testString));
----------------------------------- after ™ and then © at there ™ ©  ™ © ™ © 
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment