Skip to content

Instantly share code, notes, and snippets.

@bosturbo
Created May 10, 2023 15:40
Show Gist options
  • Save bosturbo/3c276b5bc7612154da768dd310468b2f to your computer and use it in GitHub Desktop.
Save bosturbo/3c276b5bc7612154da768dd310468b2f to your computer and use it in GitHub Desktop.
Japanese language detection
const segmenter = new Intl.Segmenter(/*'ja', {granularity: 'grapheme'}*/);
const sufficientJapaneseCharactorRatio = 0.5; // XXX about 0.3 is preferable
function isJapaneseIntl(text: string) {
let chars = 0;
let japanese = 0;
for (const seg of segmenter.segment(text)) {
const char = seg.segment;
// console.log(char);
if (/[\u3040-\u309f\u30a0-\u30ff\u4e00-\u9faf\uff61-\uff9f]/.test(char)) japanese++;
chars++;
}
return japanese / chars >= sufficientJapaneseCharactorRatio;
}
const texts: Array<string> = [
"天才かっΣ(゚Д゚)",
"家族👨‍👩‍👧‍👦",
"(´~`)モグモグ",
];
for (const text of texts) {
console.log("testing: " + text);
console.log(" isJapaneseIntl() : " + isJapaneseIntl(text));
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment