Created
March 13, 2020 12:15
-
-
Save wakusei-meron-/a2e52dc0a66bb2350eea29976ca422aa to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"形態素解析といえば[MeCab]が有名ですが、解析機に用いる辞書やら、やや処理が重い等の話があり、辞書が内包されてるという[janome](https://mocobeta.github.io/janome/)を使ってみる\n", | |
"\n", | |
"https://sitest.jp/blog/?p=6812:embed:cite" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 1, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"すもも\t名詞,一般,*,*,*,*,すもも,スモモ,スモモ\n", | |
"も\t助詞,係助詞,*,*,*,*,も,モ,モ\n", | |
"もも\t名詞,一般,*,*,*,*,もも,モモ,モモ\n", | |
"も\t助詞,係助詞,*,*,*,*,も,モ,モ\n", | |
"もも\t名詞,一般,*,*,*,*,もも,モモ,モモ\n", | |
"の\t助詞,連体化,*,*,*,*,の,ノ,ノ\n", | |
"うち\t名詞,非自立,副詞可能,*,*,*,うち,ウチ,ウチ\n" | |
] | |
} | |
], | |
"source": [ | |
"# 形態素解析やってみる\n", | |
"from janome.tokenizer import Tokenizer\n", | |
"t = Tokenizer()\n", | |
"\n", | |
"for token in t.tokenize(\"すもももももももものうち\"):\n", | |
" print(token)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Analyzerフレームワーク\n", | |
"形態素解析の前処理や後処理を行う\n", | |
"* 文字の正規化などの前処理→CharFilter\n", | |
"* 小文字化, 品詞によるトークンのフィルタリング→ TokenFilter\n", | |
"* CharFilter, Tokenizer, TokenFilterを組み合わせたカスタム解析フロー→ Analyzer\n", | |
"\n", | |
"下記の例ではユニコード正規化, 正規表現による文字の置き換えを行い、形態素解析事項語に、名刺の連続をまとめ上げ、品詞によるフィルタリング, 表層系の小文字化の後処理を行っている" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 2, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"janome\t名詞,固有名詞,組織,*,*,*,janome,*,*\n", | |
"pure\t名詞,固有名詞,組織,*,*,*,pure,*,*\n", | |
"python\t名詞,一般,*,*,*,*,python,*,*\n", | |
"な\t助動詞,*,*,*,特殊・ダ,体言接続,だ,ナ,ナ\n", | |
"形態素解析器\t名詞,複合,*,*,*,*,形態素解析器,ケイタイソカイセキキ,ケイタイソカイセキキ\n", | |
"です\t助動詞,*,*,*,特殊・デス,基本形,です,デス,デス\n" | |
] | |
} | |
], | |
"source": [ | |
"from janome.analyzer import Analyzer\n", | |
"from janome.charfilter import *\n", | |
"from janome.tokenfilter import *\n", | |
"\n", | |
"text = \"蛇の目はPure Pythonな形態素解析器です。\"\n", | |
"\n", | |
"char_filters = [UnicodeNormalizeCharFilter(), RegexReplaceCharFilter(\"蛇の目\", \"janome\")]\n", | |
"token_filters = [CompoundNounFilter(), POSStopFilter([\"記号\", \"助詞\"]), LowerCaseFilter()]\n", | |
"a = Analyzer(char_filters, t, token_filters)\n", | |
"\n", | |
"for token in a.analyze(text):\n", | |
" print(token)\n" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## ワードカウント\n", | |
"TokenCountFilterを使うと入力文字列中の単語品種頻度を数えることが可能\n", | |
"下記の例では、名詞のみフィルタリングし、単語数をカウントしている" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 3, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"すもも, 1\n", | |
"もも, 2\n", | |
"うち, 1\n" | |
] | |
} | |
], | |
"source": [ | |
"token_filters = [POSKeepFilter([\"名詞\"]), TokenCountFilter()]\n", | |
"a = Analyzer(token_filters=token_filters)\n", | |
"text = \"すもももももももものうち\"\n", | |
"for k, v in a.analyze(text):\n", | |
" print(\"{}, {}\".format(k, v))\n" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"sorted=Trueにすると出現回数の多い順に並び替えることが可能" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 4, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"もも, 2\n", | |
"すもも, 1\n", | |
"うち, 1\n" | |
] | |
} | |
], | |
"source": [ | |
"token_filters = [POSKeepFilter([\"名詞\"]), TokenCountFilter(sorted=True)]\n", | |
"a = Analyzer(token_filters=token_filters)\n", | |
"text = \"すもももももももものうち\"\n", | |
"for k, v in a.analyze(text):\n", | |
" print(\"{}, {}\".format(k, v))" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## 分かち書きモード\n", | |
"文章の間に文字を入れることを分かち書きといい、tokenizeのタイミングでwakati=Trueにすると分かち書きになる" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 5, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"['すもも', 'も', 'もも', 'も', 'もも', 'の', 'うち']" | |
] | |
}, | |
"execution_count": 5, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"t.tokenize(\"すもももももももものうち\", wakati=True)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## コマンドラインでも使用可\n", | |
"```\n", | |
"$ janome\n", | |
"猫は液体である\n", | |
"猫\t名詞,一般,*,*,*,*,猫,ネコ,ネコ\n", | |
"は\t助詞,係助詞,*,*,*,*,は,ハ,ワ\n", | |
"液体\t名詞,一般,*,*,*,*,液体,エキタイ,エキタイ\n", | |
"で\t助動詞,*,*,*,特殊・ダ,連用形,だ,デ,デ\n", | |
"ある\t助動詞,*,*,*,五段・ラ行アル,基本形,ある,アル,アル\n", | |
"\n", | |
"hoge\n", | |
"hoge\t名詞,固有名詞,組織,*,*,*,hoge,*,*\n", | |
"```" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [], | |
"source": [] | |
} | |
], | |
"metadata": { | |
"kernelspec": { | |
"display_name": "Python 3", | |
"language": "python", | |
"name": "python3" | |
}, | |
"language_info": { | |
"codemirror_mode": { | |
"name": "ipython", | |
"version": 3 | |
}, | |
"file_extension": ".py", | |
"mimetype": "text/x-python", | |
"name": "python", | |
"nbconvert_exporter": "python", | |
"pygments_lexer": "ipython3", | |
"version": "3.8.1" | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 4 | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment