phiresky/cjk-tokenization.js

## cjk-tokenization.js
// cjk tokenization snippet:

"hello test 德国 をクリックしてください 안녕하세요 세계"
    .replace(/(\p{Script=Han}|\p{Script=Hiragana}|\p{Script=Katakana}|\p{Script=Hang})/ug, "$1 ")

// inserts spaces after every CJK character, so a normal tokenizer will pick up each character as
// a separate "word". there doesn't seem to be a smarter alternative to this.
	// cjk tokenization snippet:

	"hello test 德国をクリックしてください 안녕하세요 세계"
	.replace(/(\p{Script=Han}\|\p{Script=Hiragana}\|\p{Script=Katakana}\|\p{Script=Hang})/ug, "$1 ")

	// inserts spaces after every CJK character, so a normal tokenizer will pick up each character as
	// a separate "word". there doesn't seem to be a smarter alternative to this.