Skip to content

Instantly share code, notes, and snippets.

@jlbelmonte
Last active December 17, 2015 03:29
Show Gist options
  • Save jlbelmonte/5543882 to your computer and use it in GitHub Desktop.
Save jlbelmonte/5543882 to your computer and use it in GitHub Desktop.
tiny util method to verify if a text had an issue with encodings. Only works for texts with a high incidence of the issue.
public static boolean isProbablyWrongEncoded(String s){
int charNum = s.length();
float wordCount = 1.0f;
float auxCharCount =0.0f;
float ctrlCharCount =0.0f;
for (int i =0; i < charNum; i++) {
int cp = Character.codePointAt(s, i);
if (s.charAt(i) == ' ') {
wordCount ++;
continue;
}
if (cp >= 0x00A1 && cp<=0x00FF) {
auxCharCount ++;
} else if (cp >= 0x0080 && cp< 0x00A1) {
ctrlCharCount ++;
}
}
float latinDensity = auxCharCount/charNum;
boolean containsTooManyExtenderChars = 0.1 < latinDensity;
boolean mayContainTooManyControlChars = 0.01 < (ctrlCharCount / charNum);
boolean containsTooManyControlChars = 0.03 <= (ctrlCharCount / charNum);
boolean isSuspiciousLen = 12 < charNum / wordCount;
return containsTooManyExtenderChars
|| containsTooManyControlChars
|| (mayContainTooManyControlChars && isSuspiciousLen);
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment