Skip to content

Instantly share code, notes, and snippets.

@korakot
Last active Aug 10, 2022
Embed
What would you like to do?
Thai Word Segmentation in JavaScript using V8 Break Iterator
function* gen_words(text){
const it = new Intl.v8BreakIterator(['th'])
it.adoptText(text)
let start = it.first()
while (true) {
let end = it.next()
if (end === -1) break
yield text.slice(start, end);
start = end
}
}
// Usage:
// segment('สวัสดีครับ สบายดีไหม')
// ['สวัสดี', 'ครับ', ' ', 'สบาย', 'ดี', 'ไหม']
function segment(text){
return [...gen_words(text)]
}
// Internally, it uses ICU
// https://chromium.googlesource.com/external/v8-i18n/+/refs/heads/master/src/break-iterator.cc
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment