Skip to content

Instantly share code, notes, and snippets.

@pon-x
Last active November 9, 2020 02:31
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save pon-x/1ffe7456371d57db1c97c2ba5df5682a to your computer and use it in GitHub Desktop.
Save pon-x/1ffe7456371d57db1c97c2ba5df5682a to your computer and use it in GitHub Desktop.
青空文庫を形態素解析してみる。
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 自然言語処理\n",
"\n",
"## MeCabをつかって、形態素解析しよう。"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"青空文庫より、<br>\n",
"吾輩は猫である → https://www.aozora.gr.jp/cards/000148/card789.html<br>\n",
"こころ     → https://www.aozora.gr.jp/cards/000148/card773.html<br>\n",
"のテキストファイルをダウンロードして以下を行ってください。<br>\n",
"<br>\n",
"1.ファイルを取り込み、形態素解析を行ってください。<br>\n",
"2.品詞は名詞のみ、ただし1文字の単語は除外してください。<br>\n",
"3.出現頻度をカウントして、上位10件を下り順にソートし、グラフ化してください。<br>\n",
"<br>"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import MeCab\n",
"import requests\n",
"import pandas as pd\n",
"import matplotlib.pyplot as plt\n",
"\n",
"class Aozora:\n",
" \n",
" def __init__(self ,path): \n",
" \n",
" with open(path) as f: # ファイルを読み込みます。\n",
" self.text = f.read()\n",
" \n",
" self.doc = [] # 出現ワードをためておく箱を初期化する\n",
" \n",
" def words(self):\n",
" \n",
" tagger = MeCab.Tagger()\n",
" \n",
" for s in tagger.parse(self.text).split(\"\\n\"):\n",
" \n",
" word = s.split(\"\\t\")[0]\n",
"\n",
" if word == \"EOS\":\n",
" break\n",
" else:\n",
" part = s.split(\"\\t\")[1].split(\",\")[0]\n",
" if part in [\"名詞\"]: # 名詞である\n",
" if len(word) > 1: # 2文字以上である\n",
" self.doc.append(word) # ワードを保存\n",
"\n",
"\n",
" dic = {} # 辞書を作る\n",
" for s in self.doc:\n",
" if s in dic:\n",
" dic[s] += 1 #辞書に既にあればインクリメント\n",
" else:\n",
" dic[s] = 1 #辞書になければデータを追加\n",
"\n",
" dic_s = pd.Series(dic)\n",
" \n",
" dic_s = dic_s.sort_values()[-10:] # 辞書内の頻度トップ10を抽出\n",
" \n",
" return dic_s.sort_values()\n",
"\n",
" def graph(self, dic_s): # グラフ描画\n",
" \n",
" fig = plt.figure(figsize=(10, 6))\n",
" ax = fig.add_subplot()\n",
"\n",
" ax.barh(dic_s.keys(),dic_s.tolist())\n",
"\n",
" plt.show()"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"path = \"wagahaiwa_nekodearu.txt\"\n",
"neko = Aozora(path)\n",
"w_neko = neko.words()\n",
"\n",
"path = \"kokoro.txt\"\n",
"kokoro = Aozora(path)\n",
"w_kokoro = kokoro.words()"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 720x432 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
},
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 720x432 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"neko.graph(w_neko)\n",
"\n",
"kokoro.graph(w_kokoro)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.4"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment