Last active
September 24, 2021 20:22
-
-
Save phiresky/8b1b831dd1705945cc60b63e74ee7d2f to your computer and use it in GitHub Desktop.
CJK Tokenization is hard or impossible to do perfectly, especially if you don't know the language or don't want to load megabytes of dictionaries. Here's a simple solution that gets you most of the way.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
// cjk tokenization snippet: | |
"hello test 德国 をクリックしてください 안녕하세요 세계" | |
.replace(/(\p{Script=Han}|\p{Script=Hiragana}|\p{Script=Katakana}|\p{Script=Hang})/ug, "$1 ") | |
// inserts spaces after every CJK character, so a normal tokenizer will pick up each character as | |
// a separate "word". there doesn't seem to be a smarter alternative to this. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment