Skip to content

Instantly share code, notes, and snippets.

@yssymmt
Created February 18, 2023 04:37
Show Gist options
  • Save yssymmt/670003bddc5a28a3a52a5957ecb4f316 to your computer and use it in GitHub Desktop.
Save yssymmt/670003bddc5a28a3a52a5957ecb4f316 to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"id": "3053d54b",
"metadata": {},
"source": [
"#04: janome"
]
},
{
"cell_type": "markdown",
"id": "944a887d",
"metadata": {},
"source": [
"####パッケージの読み込み"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "05a9311e",
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"from sqlalchemy import create_engine\n",
"import teradatasqlalchemy\n",
"from janome.tokenizer import Tokenizer"
]
},
{
"cell_type": "markdown",
"id": "f430a4cd",
"metadata": {},
"source": [
"####Teradataへの接続、sqlalchemy エンジンを作成"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "c7c403ef",
"metadata": {},
"outputs": [],
"source": [
"host = \"192.168.999.999\"\n",
"user = \"jumbo\"\n",
"password = \"mambo\"\n",
"connstr = \"teradatasql://{user}:{password}@{host}\".format(host=host, user=user, password=password)\n",
"engine = create_engine(connstr)"
]
},
{
"cell_type": "markdown",
"id": "e7871f2d",
"metadata": {},
"source": [
"####データを取得"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "96066a14",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>docid</th>\n",
" <th>cat</th>\n",
" <th>docdesc</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>19</td>\n",
" <td>春日</td>\n",
" <td>ぼる塾の人と「まあねぇ」と「トゥース!」の掛け合いは面白かった</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>17</td>\n",
" <td>若林</td>\n",
" <td>山里亮太にはツッコミでは敵わないと思っている</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>7</td>\n",
" <td>若林</td>\n",
" <td>藤井青銅「ピンクのベストじゃない方がしゃべれるんだよ」</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>15</td>\n",
" <td>春日</td>\n",
" <td>普段は靴下を履かないので、足の裏が象のようになっている</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>5</td>\n",
" <td>若林</td>\n",
" <td>プライベートのバスケットで足を怪我した</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>13</td>\n",
" <td>春日</td>\n",
" <td>ピンクのセーターを着た後輩の芸人から、すいません、ピンク着させてもらってますと挨拶された</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>3</td>\n",
" <td>若林</td>\n",
" <td>ナナメの夕暮れ他、本を出している</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>11</td>\n",
" <td>春日</td>\n",
" <td>六本木の社長からモンクレールのダウンをもらっていた</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>1</td>\n",
" <td>若林</td>\n",
" <td>若槻千夏「幾つかのテレビの番組で司会を務めるが、本番以外では人見知りで話さない」</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9</th>\n",
" <td>9</td>\n",
" <td>春日</td>\n",
" <td>茶々という名前のチワワ犬を飼っている</td>\n",
" </tr>\n",
" <tr>\n",
" <th>10</th>\n",
" <td>20</td>\n",
" <td>春日</td>\n",
" <td>スベる芸風なのに、スベるのを怖いと思っている</td>\n",
" </tr>\n",
" <tr>\n",
" <th>11</th>\n",
" <td>16</td>\n",
" <td>春日</td>\n",
" <td>バカリズム「存在が面白い。ウケるスベるとかじゃない」</td>\n",
" </tr>\n",
" <tr>\n",
" <th>12</th>\n",
" <td>18</td>\n",
" <td>若林</td>\n",
" <td>入船出身なのに築地出身ですと嘘をついたら、地元の人にお前入船だろとツッコミされた</td>\n",
" </tr>\n",
" <tr>\n",
" <th>13</th>\n",
" <td>14</td>\n",
" <td>春日</td>\n",
" <td>漫才ではボケを担当するが、ラジオやテレビでは全然ボケない</td>\n",
" </tr>\n",
" <tr>\n",
" <th>14</th>\n",
" <td>8</td>\n",
" <td>若林</td>\n",
" <td>mc.wakaとして、日本武道館、横浜アリーナなどで人の歌にラップで茶々を入れている</td>\n",
" </tr>\n",
" <tr>\n",
" <th>15</th>\n",
" <td>12</td>\n",
" <td>春日</td>\n",
" <td>ピンクベストを着て胸を張っていて、トゥースと大声で叫ぶ</td>\n",
" </tr>\n",
" <tr>\n",
" <th>16</th>\n",
" <td>6</td>\n",
" <td>若林</td>\n",
" <td>星野源「日本、テレビ界の希望だと思う」</td>\n",
" </tr>\n",
" <tr>\n",
" <th>17</th>\n",
" <td>10</td>\n",
" <td>春日</td>\n",
" <td>結婚直前に浮気がばれた</td>\n",
" </tr>\n",
" <tr>\n",
" <th>18</th>\n",
" <td>4</td>\n",
" <td>若林</td>\n",
" <td>深夜に一人でバスケットボールのスリーポイントを練習している</td>\n",
" </tr>\n",
" <tr>\n",
" <th>19</th>\n",
" <td>2</td>\n",
" <td>若林</td>\n",
" <td>漫才ではツッコミを担当するが、「たりないふたり」ではボケを担当していた</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" docid cat docdesc\n",
"0 19 春日 ぼる塾の人と「まあねぇ」と「トゥース!」の掛け合いは面白かった\n",
"1 17 若林 山里亮太にはツッコミでは敵わないと思っている\n",
"2 7 若林 藤井青銅「ピンクのベストじゃない方がしゃべれるんだよ」\n",
"3 15 春日 普段は靴下を履かないので、足の裏が象のようになっている\n",
"4 5 若林 プライベートのバスケットで足を怪我した\n",
"5 13 春日 ピンクのセーターを着た後輩の芸人から、すいません、ピンク着させてもらってますと挨拶された\n",
"6 3 若林 ナナメの夕暮れ他、本を出している\n",
"7 11 春日 六本木の社長からモンクレールのダウンをもらっていた\n",
"8 1 若林 若槻千夏「幾つかのテレビの番組で司会を務めるが、本番以外では人見知りで話さない」\n",
"9 9 春日 茶々という名前のチワワ犬を飼っている\n",
"10 20 春日 スベる芸風なのに、スベるのを怖いと思っている\n",
"11 16 春日 バカリズム「存在が面白い。ウケるスベるとかじゃない」\n",
"12 18 若林 入船出身なのに築地出身ですと嘘をついたら、地元の人にお前入船だろとツッコミされた\n",
"13 14 春日 漫才ではボケを担当するが、ラジオやテレビでは全然ボケない\n",
"14 8 若林 mc.wakaとして、日本武道館、横浜アリーナなどで人の歌にラップで茶々を入れている\n",
"15 12 春日 ピンクベストを着て胸を張っていて、トゥースと大声で叫ぶ\n",
"16 6 若林 星野源「日本、テレビ界の希望だと思う」\n",
"17 10 春日 結婚直前に浮気がばれた\n",
"18 4 若林 深夜に一人でバスケットボールのスリーポイントを練習している\n",
"19 2 若林 漫才ではツッコミを担当するが、「たりないふたり」ではボケを担当していた"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"with engine.connect() as conn:\n",
" df = pd.read_sql(\"\"\"\n",
" select *\n",
" from jumbo.aud03_neologdn\n",
" \"\"\", conn)\n",
"df"
]
},
{
"cell_type": "markdown",
"id": "6338ef23",
"metadata": {},
"source": [
"####配列に"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "9966e4a8",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([[19, 'ぼる塾の人と「まあねぇ」と「トゥース!」の掛け合いは面白かった'],\n",
" [17, '山里亮太にはツッコミでは敵わないと思っている'],\n",
" [7, '藤井青銅「ピンクのベストじゃない方がしゃべれるんだよ」'],\n",
" [15, '普段は靴下を履かないので、足の裏が象のようになっている'],\n",
" [5, 'プライベートのバスケットで足を怪我した'],\n",
" [13, 'ピンクのセーターを着た後輩の芸人から、すいません、ピンク着させてもらってますと挨拶された'],\n",
" [3, 'ナナメの夕暮れ他、本を出している'],\n",
" [11, '六本木の社長からモンクレールのダウンをもらっていた'],\n",
" [1, '若槻千夏「幾つかのテレビの番組で司会を務めるが、本番以外では人見知りで話さない」'],\n",
" [9, '茶々という名前のチワワ犬を飼っている'],\n",
" [20, 'スベる芸風なのに、スベるのを怖いと思っている'],\n",
" [16, 'バカリズム「存在が面白い。ウケるスベるとかじゃない」'],\n",
" [18, '入船出身なのに築地出身ですと嘘をついたら、地元の人にお前入船だろとツッコミされた'],\n",
" [14, '漫才ではボケを担当するが、ラジオやテレビでは全然ボケない'],\n",
" [8, 'mc.wakaとして、日本武道館、横浜アリーナなどで人の歌にラップで茶々を入れている'],\n",
" [12, 'ピンクベストを着て胸を張っていて、トゥースと大声で叫ぶ'],\n",
" [6, '星野源「日本、テレビ界の希望だと思う」'],\n",
" [10, '結婚直前に浮気がばれた'],\n",
" [4, '深夜に一人でバスケットボールのスリーポイントを練習している'],\n",
" [2, '漫才ではツッコミを担当するが、「たりないふたり」ではボケを担当していた']], dtype=object)"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dfa = df.loc[:, ['docid','docdesc']].values\n",
"dfa"
]
},
{
"cell_type": "markdown",
"id": "9f45a0d1",
"metadata": {},
"source": [
"####形態素解析処理の関数"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "85d06564",
"metadata": {},
"outputs": [],
"source": [
"def wakati(text):\n",
" t = Tokenizer()\n",
" results = t.tokenize(text)\n",
" words = []\n",
" for token in results:\n",
" words.append(token.surface) \n",
" words.append(token.part_of_speech)\n",
" words.append(token.base_form)\n",
" words.append(\"★\")\n",
" return words"
]
},
{
"cell_type": "markdown",
"id": "a0622cf7",
"metadata": {},
"source": [
"####最終的な結果を出力するための空Data Frameを作成する"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "6ab8b5dd",
"metadata": {},
"outputs": [],
"source": [
"df1 = pd.DataFrame( columns=['docid','docdesc'] )"
]
},
{
"cell_type": "markdown",
"id": "2f2dddc6",
"metadata": {},
"source": [
"####形態素解析"
]
},
{
"cell_type": "code",
"execution_count": 18,
"id": "c92dab38",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>docid</th>\n",
" <th>docdesc</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>19</td>\n",
" <td>[ぼる, 動詞,自立,*,*, ぼる, ★, 塾, 名詞,一般,*,*, 塾, ★, の, ...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>17</td>\n",
" <td>[山里, 名詞,固有名詞,人名,姓, 山里, ★, 亮太, 名詞,固有名詞,人名,名, 亮太...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>7</td>\n",
" <td>[藤井, 名詞,固有名詞,人名,姓, 藤井, ★, 青銅, 名詞,一般,*,*, 青銅, ★...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>15</td>\n",
" <td>[普段, 名詞,副詞可能,*,*, 普段, ★, は, 助詞,係助詞,*,*, は, ★, ...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>5</td>\n",
" <td>[プライベート, 名詞,一般,*,*, プライベート, ★, の, 助詞,連体化,*,*, ...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>13</td>\n",
" <td>[ピンク, 名詞,一般,*,*, ピンク, ★, の, 助詞,連体化,*,*, の, ★, ...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>3</td>\n",
" <td>[ナナメ, 名詞,一般,*,*, ナナメ, ★, の, 助詞,連体化,*,*, の, ★, ...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>11</td>\n",
" <td>[六本木, 名詞,固有名詞,地域,一般, 六本木, ★, の, 助詞,連体化,*,*, の,...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>1</td>\n",
" <td>[若槻, 名詞,固有名詞,人名,姓, 若槻, ★, 千夏, 名詞,固有名詞,人名,名, 千夏...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9</th>\n",
" <td>9</td>\n",
" <td>[茶々, 名詞,一般,*,*, 茶々, ★, という, 助詞,格助詞,連語,*, という, ...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>10</th>\n",
" <td>20</td>\n",
" <td>[スベ, 名詞,一般,*,*, スベ, ★, る, 助動詞,*,*,*, り, ★, 芸風,...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>11</th>\n",
" <td>16</td>\n",
" <td>[バカ, 接頭詞,名詞接続,*,*, バカ, ★, リズム, 名詞,一般,*,*, リズム,...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>12</th>\n",
" <td>18</td>\n",
" <td>[入船, 名詞,一般,*,*, 入船, ★, 出身, 名詞,一般,*,*, 出身, ★, な...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>13</th>\n",
" <td>14</td>\n",
" <td>[漫才, 名詞,一般,*,*, 漫才, ★, で, 助詞,格助詞,一般,*, で, ★, は...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>14</th>\n",
" <td>8</td>\n",
" <td>[mc, 名詞,固有名詞,組織,*, mc, ★, ., 名詞,サ変接続,*,*, ., ★...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>15</th>\n",
" <td>12</td>\n",
" <td>[ピンク, 名詞,一般,*,*, ピンク, ★, ベスト, 名詞,一般,*,*, ベスト, ...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>16</th>\n",
" <td>6</td>\n",
" <td>[星野, 名詞,固有名詞,人名,姓, 星野, ★, 源, 名詞,固有名詞,人名,名, 源, ...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>17</th>\n",
" <td>10</td>\n",
" <td>[結婚, 名詞,サ変接続,*,*, 結婚, ★, 直前, 名詞,一般,*,*, 直前, ★,...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>18</th>\n",
" <td>4</td>\n",
" <td>[深夜, 名詞,副詞可能,*,*, 深夜, ★, に, 助詞,格助詞,一般,*, に, ★,...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>19</th>\n",
" <td>2</td>\n",
" <td>[漫才, 名詞,一般,*,*, 漫才, ★, で, 助詞,格助詞,一般,*, で, ★, は...</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" docid docdesc\n",
"0 19 [ぼる, 動詞,自立,*,*, ぼる, ★, 塾, 名詞,一般,*,*, 塾, ★, の, ...\n",
"1 17 [山里, 名詞,固有名詞,人名,姓, 山里, ★, 亮太, 名詞,固有名詞,人名,名, 亮太...\n",
"2 7 [藤井, 名詞,固有名詞,人名,姓, 藤井, ★, 青銅, 名詞,一般,*,*, 青銅, ★...\n",
"3 15 [普段, 名詞,副詞可能,*,*, 普段, ★, は, 助詞,係助詞,*,*, は, ★, ...\n",
"4 5 [プライベート, 名詞,一般,*,*, プライベート, ★, の, 助詞,連体化,*,*, ...\n",
"5 13 [ピンク, 名詞,一般,*,*, ピンク, ★, の, 助詞,連体化,*,*, の, ★, ...\n",
"6 3 [ナナメ, 名詞,一般,*,*, ナナメ, ★, の, 助詞,連体化,*,*, の, ★, ...\n",
"7 11 [六本木, 名詞,固有名詞,地域,一般, 六本木, ★, の, 助詞,連体化,*,*, の,...\n",
"8 1 [若槻, 名詞,固有名詞,人名,姓, 若槻, ★, 千夏, 名詞,固有名詞,人名,名, 千夏...\n",
"9 9 [茶々, 名詞,一般,*,*, 茶々, ★, という, 助詞,格助詞,連語,*, という, ...\n",
"10 20 [スベ, 名詞,一般,*,*, スベ, ★, る, 助動詞,*,*,*, り, ★, 芸風,...\n",
"11 16 [バカ, 接頭詞,名詞接続,*,*, バカ, ★, リズム, 名詞,一般,*,*, リズム,...\n",
"12 18 [入船, 名詞,一般,*,*, 入船, ★, 出身, 名詞,一般,*,*, 出身, ★, な...\n",
"13 14 [漫才, 名詞,一般,*,*, 漫才, ★, で, 助詞,格助詞,一般,*, で, ★, は...\n",
"14 8 [mc, 名詞,固有名詞,組織,*, mc, ★, ., 名詞,サ変接続,*,*, ., ★...\n",
"15 12 [ピンク, 名詞,一般,*,*, ピンク, ★, ベスト, 名詞,一般,*,*, ベスト, ...\n",
"16 6 [星野, 名詞,固有名詞,人名,姓, 星野, ★, 源, 名詞,固有名詞,人名,名, 源, ...\n",
"17 10 [結婚, 名詞,サ変接続,*,*, 結婚, ★, 直前, 名詞,一般,*,*, 直前, ★,...\n",
"18 4 [深夜, 名詞,副詞可能,*,*, 深夜, ★, に, 助詞,格助詞,一般,*, に, ★,...\n",
"19 2 [漫才, 名詞,一般,*,*, 漫才, ★, で, 助詞,格助詞,一般,*, で, ★, は..."
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"for i in range(dfa.shape[0]):\n",
" x = dfa[i][0]\n",
" w = wakati(dfa[i][1].replace(' ', ''))\n",
" df2 = pd.DataFrame({'docid':[x], 'docdesc':[w]})\n",
" df1 = pd.concat([df1,df2],ignore_index=True)\n",
"df1"
]
},
{
"cell_type": "markdown",
"id": "eedf4109",
"metadata": {},
"source": [
"####データフレーム内の列型変換"
]
},
{
"cell_type": "code",
"execution_count": 19,
"id": "ede63611",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"docid int64\n",
"docdesc object\n",
"dtype: object"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df1[\"docid\"]=df1[\"docid\"].astype('int64')\n",
"df1[\"docdesc\"]=df1[\"docdesc\"].astype('str')\n",
"df1.dtypes"
]
},
{
"cell_type": "markdown",
"id": "bde0b41e",
"metadata": {},
"source": [
"####最大文字数を確認"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "f051fd63",
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"786"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"max(map(len, df1['docdesc']))"
]
},
{
"cell_type": "markdown",
"id": "a0e83e46",
"metadata": {},
"source": [
"####格納用テーブルを用意"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "0642df48",
"metadata": {},
"outputs": [],
"source": [
"with engine.connect() as conn:\n",
" x1 = pd.read_sql(\"\"\"\n",
" create multiset table jumbo.aud05_janome (\n",
" docid integer, \n",
" docdesc varchar(1000) character set unicode \n",
" ) primary index (docid) \n",
" \"\"\", conn)"
]
},
{
"cell_type": "markdown",
"id": "d693a997",
"metadata": {},
"source": [
"####形態素解析後データの格納"
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "d96b63de",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"1"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df1.to_sql('aud05_janome',engine,if_exists='append',index=False)"
]
},
{
"cell_type": "markdown",
"id": "9389ce43",
"metadata": {},
"source": [
"####いったんお掃除"
]
},
{
"cell_type": "code",
"execution_count": 28,
"id": "57af4a7a",
"metadata": {},
"outputs": [],
"source": [
"df1 = df1.drop(range(len(df1)),inplace=True)\n",
"df1"
]
},
{
"cell_type": "code",
"execution_count": 29,
"id": "61a208d4",
"metadata": {},
"outputs": [],
"source": [
"df2 = df2.drop(range(len(df2)),inplace=True)\n",
"df2"
]
},
{
"cell_type": "markdown",
"id": "d14b8f39",
"metadata": {},
"source": [
"####ユーザー辞書の利用1: MeCab IPADIC フォーマット(ぼる塾を追加)"
]
},
{
"cell_type": "code",
"execution_count": 24,
"id": "ec790da8",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>docid</th>\n",
" <th>docdesc</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>19</td>\n",
" <td>[ぼる塾, 名詞,固有名詞,一般,*, ぼる塾, ★, の, 名詞,非自立,一般,*, の,...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>17</td>\n",
" <td>[山里, 名詞,固有名詞,人名,姓, 山里, ★, 亮太, 名詞,固有名詞,人名,名, 亮太...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>7</td>\n",
" <td>[藤井, 名詞,固有名詞,人名,姓, 藤井, ★, 青銅, 名詞,一般,*,*, 青銅, ★...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>15</td>\n",
" <td>[普段, 名詞,副詞可能,*,*, 普段, ★, は, 助詞,係助詞,*,*, は, ★, ...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>5</td>\n",
" <td>[プライベート, 名詞,一般,*,*, プライベート, ★, の, 助詞,連体化,*,*, ...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>13</td>\n",
" <td>[ピンク, 名詞,一般,*,*, ピンク, ★, の, 助詞,連体化,*,*, の, ★, ...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>3</td>\n",
" <td>[ナナメ, 名詞,一般,*,*, ナナメ, ★, の, 助詞,連体化,*,*, の, ★, ...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>11</td>\n",
" <td>[六本木, 名詞,固有名詞,地域,一般, 六本木, ★, の, 助詞,連体化,*,*, の,...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>1</td>\n",
" <td>[若槻, 名詞,固有名詞,人名,姓, 若槻, ★, 千夏, 名詞,固有名詞,人名,名, 千夏...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9</th>\n",
" <td>9</td>\n",
" <td>[茶々, 名詞,一般,*,*, 茶々, ★, という, 助詞,格助詞,連語,*, という, ...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>10</th>\n",
" <td>20</td>\n",
" <td>[スベ, 名詞,一般,*,*, スベ, ★, る, 助動詞,*,*,*, り, ★, 芸風,...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>11</th>\n",
" <td>16</td>\n",
" <td>[バカ, 接頭詞,名詞接続,*,*, バカ, ★, リズム, 名詞,一般,*,*, リズム,...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>12</th>\n",
" <td>18</td>\n",
" <td>[入船, 名詞,一般,*,*, 入船, ★, 出身, 名詞,一般,*,*, 出身, ★, な...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>13</th>\n",
" <td>14</td>\n",
" <td>[漫才, 名詞,一般,*,*, 漫才, ★, で, 助詞,格助詞,一般,*, で, ★, は...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>14</th>\n",
" <td>8</td>\n",
" <td>[mc, 名詞,固有名詞,組織,*, mc, ★, ., 名詞,サ変接続,*,*, ., ★...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>15</th>\n",
" <td>12</td>\n",
" <td>[ピンク, 名詞,一般,*,*, ピンク, ★, ベスト, 名詞,一般,*,*, ベスト, ...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>16</th>\n",
" <td>6</td>\n",
" <td>[星野, 名詞,固有名詞,人名,姓, 星野, ★, 源, 名詞,固有名詞,人名,名, 源, ...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>17</th>\n",
" <td>10</td>\n",
" <td>[結婚, 名詞,サ変接続,*,*, 結婚, ★, 直前, 名詞,一般,*,*, 直前, ★,...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>18</th>\n",
" <td>4</td>\n",
" <td>[深夜, 名詞,副詞可能,*,*, 深夜, ★, に, 助詞,格助詞,一般,*, に, ★,...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>19</th>\n",
" <td>2</td>\n",
" <td>[漫才, 名詞,一般,*,*, 漫才, ★, で, 助詞,格助詞,一般,*, で, ★, は...</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" docid docdesc\n",
"0 19 [ぼる塾, 名詞,固有名詞,一般,*, ぼる塾, ★, の, 名詞,非自立,一般,*, の,...\n",
"1 17 [山里, 名詞,固有名詞,人名,姓, 山里, ★, 亮太, 名詞,固有名詞,人名,名, 亮太...\n",
"2 7 [藤井, 名詞,固有名詞,人名,姓, 藤井, ★, 青銅, 名詞,一般,*,*, 青銅, ★...\n",
"3 15 [普段, 名詞,副詞可能,*,*, 普段, ★, は, 助詞,係助詞,*,*, は, ★, ...\n",
"4 5 [プライベート, 名詞,一般,*,*, プライベート, ★, の, 助詞,連体化,*,*, ...\n",
"5 13 [ピンク, 名詞,一般,*,*, ピンク, ★, の, 助詞,連体化,*,*, の, ★, ...\n",
"6 3 [ナナメ, 名詞,一般,*,*, ナナメ, ★, の, 助詞,連体化,*,*, の, ★, ...\n",
"7 11 [六本木, 名詞,固有名詞,地域,一般, 六本木, ★, の, 助詞,連体化,*,*, の,...\n",
"8 1 [若槻, 名詞,固有名詞,人名,姓, 若槻, ★, 千夏, 名詞,固有名詞,人名,名, 千夏...\n",
"9 9 [茶々, 名詞,一般,*,*, 茶々, ★, という, 助詞,格助詞,連語,*, という, ...\n",
"10 20 [スベ, 名詞,一般,*,*, スベ, ★, る, 助動詞,*,*,*, り, ★, 芸風,...\n",
"11 16 [バカ, 接頭詞,名詞接続,*,*, バカ, ★, リズム, 名詞,一般,*,*, リズム,...\n",
"12 18 [入船, 名詞,一般,*,*, 入船, ★, 出身, 名詞,一般,*,*, 出身, ★, な...\n",
"13 14 [漫才, 名詞,一般,*,*, 漫才, ★, で, 助詞,格助詞,一般,*, で, ★, は...\n",
"14 8 [mc, 名詞,固有名詞,組織,*, mc, ★, ., 名詞,サ変接続,*,*, ., ★...\n",
"15 12 [ピンク, 名詞,一般,*,*, ピンク, ★, ベスト, 名詞,一般,*,*, ベスト, ...\n",
"16 6 [星野, 名詞,固有名詞,人名,姓, 星野, ★, 源, 名詞,固有名詞,人名,名, 源, ...\n",
"17 10 [結婚, 名詞,サ変接続,*,*, 結婚, ★, 直前, 名詞,一般,*,*, 直前, ★,...\n",
"18 4 [深夜, 名詞,副詞可能,*,*, 深夜, ★, に, 助詞,格助詞,一般,*, に, ★,...\n",
"19 2 [漫才, 名詞,一般,*,*, 漫才, ★, で, 助詞,格助詞,一般,*, で, ★, は..."
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"def mwakati(text):\n",
" t = Tokenizer(\"janomedic1.csv\", udic_enc=\"utf8\")\n",
" results = t.tokenize(text)\n",
" words = []\n",
" for token in results:\n",
" words.append(token.surface) \n",
" words.append(token.part_of_speech)\n",
" words.append(token.base_form)\n",
" words.append(\"★\")\n",
" return words\n",
"\n",
"for i in range(dfa.shape[0]):\n",
" x = dfa[i][0]\n",
" w = mwakati(dfa[i][1].replace(' ', ''))\n",
" df2 = pd.DataFrame({'docid':[x], 'docdesc':[w]})\n",
" df1 = pd.concat([df1,df2],ignore_index=True)\n",
"df1"
]
},
{
"cell_type": "markdown",
"id": "a4aac982",
"metadata": {},
"source": [
"####ユーザー辞書の利用2: 簡略辞書フォーマット(山里亮太を追加)"
]
},
{
"cell_type": "code",
"execution_count": 30,
"id": "301cc99f",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>docid</th>\n",
" <th>docdesc</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>19</td>\n",
" <td>[ぼる, 動詞,自立,*,*, ぼる, ★, 塾, 名詞,一般,*,*, 塾, ★, の, ...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>17</td>\n",
" <td>[山里亮太, カスタム名詞,*,*,*, 山里亮太, ★, に, 助詞,格助詞,一般,*, ...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>7</td>\n",
" <td>[藤井, 名詞,固有名詞,人名,姓, 藤井, ★, 青銅, 名詞,一般,*,*, 青銅, ★...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>15</td>\n",
" <td>[普段, 名詞,副詞可能,*,*, 普段, ★, は, 助詞,係助詞,*,*, は, ★, ...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>5</td>\n",
" <td>[プライベート, 名詞,一般,*,*, プライベート, ★, の, 助詞,連体化,*,*, ...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>13</td>\n",
" <td>[ピンク, 名詞,一般,*,*, ピンク, ★, の, 助詞,連体化,*,*, の, ★, ...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>3</td>\n",
" <td>[ナナメ, 名詞,一般,*,*, ナナメ, ★, の, 助詞,連体化,*,*, の, ★, ...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>11</td>\n",
" <td>[六本木, 名詞,固有名詞,地域,一般, 六本木, ★, の, 助詞,連体化,*,*, の,...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>1</td>\n",
" <td>[若槻, 名詞,固有名詞,人名,姓, 若槻, ★, 千夏, 名詞,固有名詞,人名,名, 千夏...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9</th>\n",
" <td>9</td>\n",
" <td>[茶々, 名詞,一般,*,*, 茶々, ★, という, 助詞,格助詞,連語,*, という, ...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>10</th>\n",
" <td>20</td>\n",
" <td>[スベ, 名詞,一般,*,*, スベ, ★, る, 助動詞,*,*,*, り, ★, 芸風,...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>11</th>\n",
" <td>16</td>\n",
" <td>[バカ, 接頭詞,名詞接続,*,*, バカ, ★, リズム, 名詞,一般,*,*, リズム,...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>12</th>\n",
" <td>18</td>\n",
" <td>[入船, 名詞,一般,*,*, 入船, ★, 出身, 名詞,一般,*,*, 出身, ★, な...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>13</th>\n",
" <td>14</td>\n",
" <td>[漫才, 名詞,一般,*,*, 漫才, ★, で, 助詞,格助詞,一般,*, で, ★, は...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>14</th>\n",
" <td>8</td>\n",
" <td>[mc, 名詞,固有名詞,組織,*, mc, ★, ., 名詞,サ変接続,*,*, ., ★...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>15</th>\n",
" <td>12</td>\n",
" <td>[ピンク, 名詞,一般,*,*, ピンク, ★, ベスト, 名詞,一般,*,*, ベスト, ...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>16</th>\n",
" <td>6</td>\n",
" <td>[星野, 名詞,固有名詞,人名,姓, 星野, ★, 源, 名詞,固有名詞,人名,名, 源, ...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>17</th>\n",
" <td>10</td>\n",
" <td>[結婚, 名詞,サ変接続,*,*, 結婚, ★, 直前, 名詞,一般,*,*, 直前, ★,...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>18</th>\n",
" <td>4</td>\n",
" <td>[深夜, 名詞,副詞可能,*,*, 深夜, ★, に, 助詞,格助詞,一般,*, に, ★,...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>19</th>\n",
" <td>2</td>\n",
" <td>[漫才, 名詞,一般,*,*, 漫才, ★, で, 助詞,格助詞,一般,*, で, ★, は...</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" docid docdesc\n",
"0 19 [ぼる, 動詞,自立,*,*, ぼる, ★, 塾, 名詞,一般,*,*, 塾, ★, の, ...\n",
"1 17 [山里亮太, カスタム名詞,*,*,*, 山里亮太, ★, に, 助詞,格助詞,一般,*, ...\n",
"2 7 [藤井, 名詞,固有名詞,人名,姓, 藤井, ★, 青銅, 名詞,一般,*,*, 青銅, ★...\n",
"3 15 [普段, 名詞,副詞可能,*,*, 普段, ★, は, 助詞,係助詞,*,*, は, ★, ...\n",
"4 5 [プライベート, 名詞,一般,*,*, プライベート, ★, の, 助詞,連体化,*,*, ...\n",
"5 13 [ピンク, 名詞,一般,*,*, ピンク, ★, の, 助詞,連体化,*,*, の, ★, ...\n",
"6 3 [ナナメ, 名詞,一般,*,*, ナナメ, ★, の, 助詞,連体化,*,*, の, ★, ...\n",
"7 11 [六本木, 名詞,固有名詞,地域,一般, 六本木, ★, の, 助詞,連体化,*,*, の,...\n",
"8 1 [若槻, 名詞,固有名詞,人名,姓, 若槻, ★, 千夏, 名詞,固有名詞,人名,名, 千夏...\n",
"9 9 [茶々, 名詞,一般,*,*, 茶々, ★, という, 助詞,格助詞,連語,*, という, ...\n",
"10 20 [スベ, 名詞,一般,*,*, スベ, ★, る, 助動詞,*,*,*, り, ★, 芸風,...\n",
"11 16 [バカ, 接頭詞,名詞接続,*,*, バカ, ★, リズム, 名詞,一般,*,*, リズム,...\n",
"12 18 [入船, 名詞,一般,*,*, 入船, ★, 出身, 名詞,一般,*,*, 出身, ★, な...\n",
"13 14 [漫才, 名詞,一般,*,*, 漫才, ★, で, 助詞,格助詞,一般,*, で, ★, は...\n",
"14 8 [mc, 名詞,固有名詞,組織,*, mc, ★, ., 名詞,サ変接続,*,*, ., ★...\n",
"15 12 [ピンク, 名詞,一般,*,*, ピンク, ★, ベスト, 名詞,一般,*,*, ベスト, ...\n",
"16 6 [星野, 名詞,固有名詞,人名,姓, 星野, ★, 源, 名詞,固有名詞,人名,名, 源, ...\n",
"17 10 [結婚, 名詞,サ変接続,*,*, 結婚, ★, 直前, 名詞,一般,*,*, 直前, ★,...\n",
"18 4 [深夜, 名詞,副詞可能,*,*, 深夜, ★, に, 助詞,格助詞,一般,*, に, ★,...\n",
"19 2 [漫才, 名詞,一般,*,*, 漫才, ★, で, 助詞,格助詞,一般,*, で, ★, は..."
]
},
"execution_count": 30,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"def swakati(text):\n",
" t = Tokenizer(\"janomedic2.csv\", udic_type=\"simpledic\", udic_enc=\"utf8\")\n",
" results = t.tokenize(text)\n",
" words = []\n",
" for token in results:\n",
" words.append(token.surface) \n",
" words.append(token.part_of_speech)\n",
" words.append(token.base_form)\n",
" words.append(\"★\")\n",
" return words\n",
"\n",
"for i in range(dfa.shape[0]):\n",
" x = dfa[i][0]\n",
" w = swakati(dfa[i][1].replace(' ', ''))\n",
" df2 = pd.DataFrame({'docid':[x], 'docdesc':[w]})\n",
" df1 = pd.concat([df1,df2],ignore_index=True)\n",
"df1"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.12"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment