Skip to content

Instantly share code, notes, and snippets.

@naotokui
Created May 9, 2017 01:44
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save naotokui/1846dc328ee6d723f761efb8e3bb023d to your computer and use it in GitHub Desktop.
Save naotokui/1846dc328ee6d723f761efb8e3bb023d to your computer and use it in GitHub Desktop.
splitting Japanese word - 日本語単語分かち書き
import MeCab
mt = MeCab.Tagger("-Ochasen")
def wakati_text_mecab(text):
res = mt.parseToNode(text.encode("utf-8"))
words = []
try:
while res:
surface = res.surface
part = res.feature.split(",")[0]
if part != "BOS/EOS":
words.append(surface)
res = res.next
except Exception as ex:
print ex
return ' '.join(words)
wakati = wakati_text_mecab(u"原子番号92のウランより重い元素は全て人工的に合成され、118番まで発見の報告がある.")
print wakati
# 原子 番号 9 2 の ウラン より 重い 元素 は 全て 人工 的 に 合成 さ れ 、 1 1 8 番 まで 発見 の 報告 が ある 。
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment