Skip to content

Instantly share code, notes, and snippets.

@garfieldnate
Created July 14, 2022 09:42
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save garfieldnate/af522ed545a11fbc170a725be2b7a735 to your computer and use it in GitHub Desktop.
Save garfieldnate/af522ed545a11fbc170a725be2b7a735 to your computer and use it in GitHub Desktop.
Create custom user dictionary for use in MeCab through Fugashi
# Generate custom MeCab dictionary to be used with unidic-lite
import sys
from fugashi.fugashi import build_dictionary
import unidic_lite
args = (
sys.argv[0]
+ f" -f utf8 -t utf8 -d {unidic_lite.DICDIR} -u custom.dic custom_entries.csv"
)
print(args)
build_dictionary(args)
We can make this file beautiful and searchable if this error is corrected: It looks like row 3 should actually have 1 column, instead of 21. in line 2.
# REMOVE COMMENTS FIRST (MeCab doesn't allow comments)
# see https://twitter.com/zakki/status/920977351059554304 for a description of the fields
かい,830,830,6319,助詞,終助詞,*,*,*,*,カイ,かい,かい,カイ,かい,カイ,和,*,*,*,*
from fugashi import Tagger # type: ignore
# use the generated custom dictionary (plus unidic-lite)
TAGGER = Tagger("-Owakati -u custom.dic")
print(TAGGER("皆様こんにちは。本日はですね とっても特別なお客様にお伺いしたいと思いま〜す"))
# expected output: [皆, 様, こんにちは, 。, 本日, は, です, ね,  , とっても, 特別, な, お, 客, 様, に, お, 伺い, し, たい, と, 思い, ま, 〜, す]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment