Created
February 12, 2021 10:27
-
-
Save sorami/801466d6e95d0dabb00b1a8bb87235b0 to your computer and use it in GitHub Desktop.
e-Gov法令検索データからSudachi同義語辞書を作成 / データ出典: "略称法令名一覧 | e-Gov法令検索" https://elaws.e-gov.go.jp/abb/ / c.f. https://zenn.dev/sorami/articles/60131682bfa34f/
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"id": "biblical-accused", | |
"metadata": {}, | |
"source": [ | |
"# 法令名略称同義語辞書" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "baking-citizenship", | |
"metadata": {}, | |
"source": [ | |
"e-Gov法令検索の「登録略称法令名一覧」情報から、Sudachiのための同義語辞書を作成する。\n", | |
"\n", | |
"- データ出典: [略称法令名一覧 | e-Gov法令検索](https://elaws.e-gov.go.jp/abb/)\n", | |
"- [利用規約 | e-Gov法令検索](https://elaws.e-gov.go.jp/terms/)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 1, | |
"id": "cellular-jason", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"import csv\n", | |
"\n", | |
"import requests\n", | |
"from bs4 import BeautifulSoup" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "actual-slovakia", | |
"metadata": {}, | |
"source": [ | |
"## 略称を取得" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "retained-constitution", | |
"metadata": {}, | |
"source": [ | |
"[略称法令名一覧 | e-Gov法令検索](https://elaws.e-gov.go.jp/abb/)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 2, | |
"id": "fitted-teach", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"CPU times: user 26.7 ms, sys: 7.49 ms, total: 34.2 ms\n", | |
"Wall time: 24.6 s\n" | |
] | |
} | |
], | |
"source": [ | |
"%%time\n", | |
"r = requests.get(\"https://elaws.e-gov.go.jp/abb/\")" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 3, | |
"id": "knowing-indie", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"2451" | |
] | |
}, | |
"execution_count": 3, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"soup = BeautifulSoup(r.content)\n", | |
"abbr_table = soup.find(\"table\", id=\"abbreviationTable\")\n", | |
"tr_list = abbr_table(\"tr\")\n", | |
"len(tr_list)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 4, | |
"id": "needed-pacific", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"<tr>\n", | |
"<th class=\"lawNameCol\">正式法令名</th>\n", | |
"<th class=\"lawNoCol\">法令番号</th>\n", | |
"<th class=\"abbrLawNameCol\">略称法令名1</th>\n", | |
"<th class=\"abbrLawNameCol\">略称法令名2</th>\n", | |
"<th class=\"abbrLawNameCol\">略称法令名3</th>\n", | |
"<th class=\"abbrLawNameCol\">略称法令名4</th>\n", | |
"</tr>" | |
] | |
}, | |
"execution_count": 4, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"tr_list[0] # header" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 5, | |
"id": "independent-approach", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"td_classes = [['lawNameCol'], ['lawNoCol'],\n", | |
" ['abbrLawNameCol'], ['abbrLawNameCol'], ['abbrLawNameCol'], ['abbrLawNameCol']]" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 6, | |
"id": "distant-thickness", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"abbr_dict = {}\n", | |
"\n", | |
"for tr in tr_list[1:]:\n", | |
" td_list = tr(\"td\")\n", | |
" assert [td.attrs[\"class\"] for td in td_list] == td_classes\n", | |
" \n", | |
" law_name = td_list[0].text\n", | |
" law_no = td_list[1].text\n", | |
" abbr_names = [td.text.strip() for td in td_list[2:] if td.text.strip()]\n", | |
" assert len(abbr_names) > 0\n", | |
" abbr_dict[law_name, law_no] = abbr_names\n", | |
" \n", | |
"assert len(abbr_dict) == len(tr_list) - 1" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 7, | |
"id": "double-madrid", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"['トラ退治法', '酔っぱらい防止法', '酩酊防止法']" | |
] | |
}, | |
"execution_count": 7, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"abbr_dict[\"酒に酔つて公衆に迷惑をかける行為の防止等に関する法律\", \"昭和三十六年法律第百三号\"]" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "controversial-chapel", | |
"metadata": {}, | |
"source": [ | |
"## Sudachi同義語辞書形式のファイルを作成" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "horizontal-newton", | |
"metadata": {}, | |
"source": [ | |
"Sudachi同義語辞書詳細: https://github.com/WorksApplications/SudachiDict/blob/develop/docs/synonyms.md\n", | |
"\n", | |
"\n", | |
"カラム\n", | |
"```\n", | |
"0 : グループ番号\n", | |
"1 : 体言/用言フラグ (省略可)\n", | |
"2 : 展開制御フラグ (省略可)\n", | |
"3 : グループ内の語彙番号 (省略可)\n", | |
"4 : 同一語彙素内での語形種別 (省略可)\n", | |
"5 : 同じ語形の語の中での略語情報 (省略可)\n", | |
"6 : 同じ語形の語の中での表記ゆれ情報 (省略可)\n", | |
"7 : 分野情報 (省略可)\n", | |
"8 : 見出し\n", | |
"9 : 予約\n", | |
"10 : 予約\n", | |
"```" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "earlier-county", | |
"metadata": {}, | |
"source": [ | |
"本家Sudachi同義語辞書のデータを参考に詳細情報を付与;\n", | |
"\n", | |
"例\n", | |
"```\n", | |
"001351,1,0,1,0,0,0,(法律),私的独占の禁止及び公正取引の確保に関する法律,,\n", | |
"001351,1,0,1,0,2,0,(法律),独占禁止法,,\n", | |
"001351,1,0,1,0,2,0,(法律),独禁法,,\n", | |
"```" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 8, | |
"id": "capable-mayor", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"def create_row(group_id, heading, is_repr_form):\n", | |
" # 同じ語形の語の中での表記ゆれ情報\n", | |
" # 0=代表表記\n", | |
" # 2=(代表語から見て) 別称 (通称・愛称等)\n", | |
" if is_repr_form:\n", | |
" col5 = 0\n", | |
" else:\n", | |
" col5 = 2\n", | |
" \n", | |
" row = (\n", | |
" group_id,\n", | |
" 1, # 体言/用言フラグ: 1=体言\n", | |
" 0, # 展開制御フラグ: 0=常に展開に使用する\n", | |
" 1, # グループ内の語彙番号: 略称・別称は同じ語彙素と見なすため1で固定\n", | |
" 0, # 同一語彙素内での語形種別: 0=代表語\n", | |
" col5, # 同じ語形の語の中での略語・略称情報\n", | |
" 0, # 同じ語形の語の中での表記揺れ情報: 0=代表表記\n", | |
" \"(法律)\", # 分野情報\n", | |
" heading, # 見出し\n", | |
" None, # (予約)\n", | |
" None, # (予約)\n", | |
" )\n", | |
"\n", | |
" return row" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 9, | |
"id": "adapted-isolation", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"5795" | |
] | |
}, | |
"execution_count": 9, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"synonym_rows = []\n", | |
"for group_id, ((law_name, law_no), abbr_names) in enumerate(abbr_dict.items(), start=900000):\n", | |
" # 代表表記\n", | |
" synonym_rows.append(create_row(group_id, law_name, True))\n", | |
" \n", | |
" # 略称\n", | |
" for abbr_name in abbr_names:\n", | |
" synonym_rows.append(create_row(group_id, abbr_name, False))\n", | |
" \n", | |
"len(synonym_rows)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 10, | |
"id": "loose-corpus", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"with open(\"law_name_synonyms.txt\", \"w\") as f:\n", | |
" writer = csv.writer(f)\n", | |
" writer.writerows(synonym_rows)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 11, | |
"id": "passing-wichita", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
" 5795 law_name_synonyms.txt\r\n" | |
] | |
} | |
], | |
"source": [ | |
"!wc -l law_name_synonyms.txt" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 12, | |
"id": "found-musician", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"902446,1,0,1,0,0,0,(法律),アイヌの人々の誇りが尊重される社会を実現するための施策の推進に関する法律施行規則,,\r", | |
"\r\n", | |
"902446,1,0,1,0,2,0,(法律),アイヌ施策推進法施行規則,,\r", | |
"\r\n", | |
"902447,1,0,1,0,0,0,(法律),行政手続における特定の個人を識別するための番号の利用等に関する法律第四十五条の二第一項の法務省令で定める情報を定める省令,,\r", | |
"\r\n", | |
"902447,1,0,1,0,2,0,(法律),マイナンバー法第四十五条の二第一項の法務省令,,\r", | |
"\r\n", | |
"902447,1,0,1,0,2,0,(法律),個人番号法第四十五条の二第一項の法務省令,,\r", | |
"\r\n", | |
"902447,1,0,1,0,2,0,(法律),番号法第四十五条の二第一項の法務省令,,\r", | |
"\r\n", | |
"902448,1,0,1,0,0,0,(法律),表題部所有者不明土地の登記及び管理の適正化に関する法律施行規則,,\r", | |
"\r\n", | |
"902448,1,0,1,0,2,0,(法律),表題部所有者不明土地法施行規則,,\r", | |
"\r\n", | |
"902449,1,0,1,0,0,0,(法律),防衛省関係重要施設の周辺地域の上空における小型無人機等の飛行の禁止に関する法律施行規則,,\r", | |
"\r\n", | |
"902449,1,0,1,0,2,0,(法律),小型無人機等飛行禁止法施行規則,,\r", | |
"\r\n" | |
] | |
} | |
], | |
"source": [ | |
"!tail law_name_synonyms.txt" | |
] | |
} | |
], | |
"metadata": { | |
"kernelspec": { | |
"display_name": "Python 3", | |
"language": "python", | |
"name": "python3" | |
}, | |
"language_info": { | |
"codemirror_mode": { | |
"name": "ipython", | |
"version": 3 | |
}, | |
"file_extension": ".py", | |
"mimetype": "text/x-python", | |
"name": "python", | |
"nbconvert_exporter": "python", | |
"pygments_lexer": "ipython3", | |
"version": "3.8.7" | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 5 | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment