Skip to content

Instantly share code, notes, and snippets.

@korakot
Last active August 10, 2022 07:09
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save korakot/451926b68bc0baab5c9ddb1e28448289 to your computer and use it in GitHub Desktop.
Save korakot/451926b68bc0baab5c9ddb1e28448289 to your computer and use it in GitHub Desktop.
Thai Word Segmentation in JavaScript using V8 Break Iterator
function* gen_words(text){
const it = new Intl.v8BreakIterator(['th'])
it.adoptText(text)
let start = it.first()
while (true) {
let end = it.next()
if (end === -1) break
yield text.slice(start, end);
start = end
}
}
// Usage:
// segment('สวัสดีครับ สบายดีไหม')
// ['สวัสดี', 'ครับ', ' ', 'สบาย', 'ดี', 'ไหม']
function segment(text){
return [...gen_words(text)]
}
// Internally, it uses ICU
// https://chromium.googlesource.com/external/v8-i18n/+/refs/heads/master/src/break-iterator.cc
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment