Skip to content

Instantly share code, notes, and snippets.

@wakusei-meron-
Created March 13, 2020 12:15
Show Gist options
  • Save wakusei-meron-/a2e52dc0a66bb2350eea29976ca422aa to your computer and use it in GitHub Desktop.
Save wakusei-meron-/a2e52dc0a66bb2350eea29976ca422aa to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"形態素解析といえば[MeCab]が有名ですが、解析機に用いる辞書やら、やや処理が重い等の話があり、辞書が内包されてるという[janome](https://mocobeta.github.io/janome/)を使ってみる\n",
"\n",
"https://sitest.jp/blog/?p=6812:embed:cite"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"すもも\t名詞,一般,*,*,*,*,すもも,スモモ,スモモ\n",
"も\t助詞,係助詞,*,*,*,*,も,モ,モ\n",
"もも\t名詞,一般,*,*,*,*,もも,モモ,モモ\n",
"も\t助詞,係助詞,*,*,*,*,も,モ,モ\n",
"もも\t名詞,一般,*,*,*,*,もも,モモ,モモ\n",
"の\t助詞,連体化,*,*,*,*,の,ノ,ノ\n",
"うち\t名詞,非自立,副詞可能,*,*,*,うち,ウチ,ウチ\n"
]
}
],
"source": [
"# 形態素解析やってみる\n",
"from janome.tokenizer import Tokenizer\n",
"t = Tokenizer()\n",
"\n",
"for token in t.tokenize(\"すもももももももものうち\"):\n",
" print(token)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Analyzerフレームワーク\n",
"形態素解析の前処理や後処理を行う\n",
"* 文字の正規化などの前処理→CharFilter\n",
"* 小文字化, 品詞によるトークンのフィルタリング→ TokenFilter\n",
"* CharFilter, Tokenizer, TokenFilterを組み合わせたカスタム解析フロー→ Analyzer\n",
"\n",
"下記の例ではユニコード正規化, 正規表現による文字の置き換えを行い、形態素解析事項語に、名刺の連続をまとめ上げ、品詞によるフィルタリング, 表層系の小文字化の後処理を行っている"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"janome\t名詞,固有名詞,組織,*,*,*,janome,*,*\n",
"pure\t名詞,固有名詞,組織,*,*,*,pure,*,*\n",
"python\t名詞,一般,*,*,*,*,python,*,*\n",
"な\t助動詞,*,*,*,特殊・ダ,体言接続,だ,ナ,ナ\n",
"形態素解析器\t名詞,複合,*,*,*,*,形態素解析器,ケイタイソカイセキキ,ケイタイソカイセキキ\n",
"です\t助動詞,*,*,*,特殊・デス,基本形,です,デス,デス\n"
]
}
],
"source": [
"from janome.analyzer import Analyzer\n",
"from janome.charfilter import *\n",
"from janome.tokenfilter import *\n",
"\n",
"text = \"蛇の目はPure Pythonな形態素解析器です。\"\n",
"\n",
"char_filters = [UnicodeNormalizeCharFilter(), RegexReplaceCharFilter(\"蛇の目\", \"janome\")]\n",
"token_filters = [CompoundNounFilter(), POSStopFilter([\"記号\", \"助詞\"]), LowerCaseFilter()]\n",
"a = Analyzer(char_filters, t, token_filters)\n",
"\n",
"for token in a.analyze(text):\n",
" print(token)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## ワードカウント\n",
"TokenCountFilterを使うと入力文字列中の単語品種頻度を数えることが可能\n",
"下記の例では、名詞のみフィルタリングし、単語数をカウントしている"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"すもも, 1\n",
"もも, 2\n",
"うち, 1\n"
]
}
],
"source": [
"token_filters = [POSKeepFilter([\"名詞\"]), TokenCountFilter()]\n",
"a = Analyzer(token_filters=token_filters)\n",
"text = \"すもももももももものうち\"\n",
"for k, v in a.analyze(text):\n",
" print(\"{}, {}\".format(k, v))\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"sorted=Trueにすると出現回数の多い順に並び替えることが可能"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"もも, 2\n",
"すもも, 1\n",
"うち, 1\n"
]
}
],
"source": [
"token_filters = [POSKeepFilter([\"名詞\"]), TokenCountFilter(sorted=True)]\n",
"a = Analyzer(token_filters=token_filters)\n",
"text = \"すもももももももものうち\"\n",
"for k, v in a.analyze(text):\n",
" print(\"{}, {}\".format(k, v))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 分かち書きモード\n",
"文章の間に文字を入れることを分かち書きといい、tokenizeのタイミングでwakati=Trueにすると分かち書きになる"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['すもも', 'も', 'もも', 'も', 'もも', 'の', 'うち']"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"t.tokenize(\"すもももももももものうち\", wakati=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## コマンドラインでも使用可\n",
"```\n",
"$ janome\n",
"猫は液体である\n",
"猫\t名詞,一般,*,*,*,*,猫,ネコ,ネコ\n",
"は\t助詞,係助詞,*,*,*,*,は,ハ,ワ\n",
"液体\t名詞,一般,*,*,*,*,液体,エキタイ,エキタイ\n",
"で\t助動詞,*,*,*,特殊・ダ,連用形,だ,デ,デ\n",
"ある\t助動詞,*,*,*,五段・ラ行アル,基本形,ある,アル,アル\n",
"\n",
"hoge\n",
"hoge\t名詞,固有名詞,組織,*,*,*,hoge,*,*\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.1"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment