Skip to content

Instantly share code, notes, and snippets.

@korakot
Last active January 19, 2023 04:12
Show Gist options
  • Save korakot/746da650f0b293136b1db7fef66a3b56 to your computer and use it in GitHub Desktop.
Save korakot/746da650f0b293136b1db7fef66a3b56 to your computer and use it in GitHub Desktop.
Customize word tokenization: add and remove words from trie
!pip install pythainlp
from pythainlp import word_tokenize
from pythainlp.tokenize import DEFAULT_DICT_TRIE as trie
# default behavior
print(word_tokenize('ฝนตกทั่วฟ้า')) # ['ฝนตก', 'ทั่ว', 'ฟ้า']
# modify behavior
trie.remove('ฝนตก')
trie.add('ทั่วฟ้า')
word_tokenize('ฝนตกทั่วฟ้า') # ['ฝน', 'ตก', 'ทั่วฟ้า']
from pythainlp.tokenize import word_tokenize, newmm
from pythainlp.corpus import ttc
from pythainlp.util import Trie
words = [w for w,_ in ttc.word_freqs()]
newmm.DEFAULT_WORD_DICT_TRIE = Trie(words)
word_tokenize('ฝนตกทั่วฟ้า') # ['ฝน', 'ตก', 'ทั่วฟ้า']
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment