Skip to content

Instantly share code, notes, and snippets.

@korakot
Last active January 18, 2020 06:43
Show Gist options
  • Save korakot/1547acb30dc98d967fdf6aff193f8824 to your computer and use it in GitHub Desktop.
Save korakot/1547acb30dc98d967fdf6aff193f8824 to your computer and use it in GitHub Desktop.
Longest matching Thai word tokenization
from marisa_trie import Trie
# wordlist = ...
trie = Trie(wordlist)
def lmcut(text):
for w in reversed(trie.prefixes(text)):
if w==text:
yield [w]
else:
for ww in lmcut(text[len(w):]):
yield [w]+ww
words = next(lmcut("สวัสดีครับคุณครู"))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment