Skip to content

Instantly share code, notes, and snippets.

@phiresky
Last active September 24, 2021 20:22
Show Gist options
  • Save phiresky/8b1b831dd1705945cc60b63e74ee7d2f to your computer and use it in GitHub Desktop.
Save phiresky/8b1b831dd1705945cc60b63e74ee7d2f to your computer and use it in GitHub Desktop.
CJK Tokenization is hard or impossible to do perfectly, especially if you don't know the language or don't want to load megabytes of dictionaries. Here's a simple solution that gets you most of the way.
// cjk tokenization snippet:
"hello test 德国 をクリックしてください 안녕하세요 세계"
.replace(/(\p{Script=Han}|\p{Script=Hiragana}|\p{Script=Katakana}|\p{Script=Hang})/ug, "$1 ")
// inserts spaces after every CJK character, so a normal tokenizer will pick up each character as
// a separate "word". there doesn't seem to be a smarter alternative to this.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment