Skip to content

Instantly share code, notes, and snippets.

Created February 12, 2021 10:27
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save sorami/801466d6e95d0dabb00b1a8bb87235b0 to your computer and use it in GitHub Desktop.
Save sorami/801466d6e95d0dabb00b1a8bb87235b0 to your computer and use it in GitHub Desktop.
e-Gov法令検索データからSudachi同義語辞書を作成 / データ出典: "略称法令名一覧 | e-Gov法令検索" / c.f.
Display the source blob
Display the rendered blob
"cells": [
"cell_type": "markdown",
"id": "biblical-accused",
"metadata": {},
"source": [
"# 法令名略称同義語辞書"
"cell_type": "markdown",
"id": "baking-citizenship",
"metadata": {},
"source": [
"- データ出典: [略称法令名一覧 | e-Gov法令検索](\n",
"- [利用規約 | e-Gov法令検索]("
"cell_type": "code",
"execution_count": 1,
"id": "cellular-jason",
"metadata": {},
"outputs": [],
"source": [
"import csv\n",
"import requests\n",
"from bs4 import BeautifulSoup"
"cell_type": "markdown",
"id": "actual-slovakia",
"metadata": {},
"source": [
"## 略称を取得"
"cell_type": "markdown",
"id": "retained-constitution",
"metadata": {},
"source": [
"[略称法令名一覧 | e-Gov法令検索]("
"cell_type": "code",
"execution_count": 2,
"id": "fitted-teach",
"metadata": {},
"outputs": [
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 26.7 ms, sys: 7.49 ms, total: 34.2 ms\n",
"Wall time: 24.6 s\n"
"source": [
"r = requests.get(\"\")"
"cell_type": "code",
"execution_count": 3,
"id": "knowing-indie",
"metadata": {},
"outputs": [
"data": {
"text/plain": [
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
"source": [
"soup = BeautifulSoup(r.content)\n",
"abbr_table = soup.find(\"table\", id=\"abbreviationTable\")\n",
"tr_list = abbr_table(\"tr\")\n",
"cell_type": "code",
"execution_count": 4,
"id": "needed-pacific",
"metadata": {},
"outputs": [
"data": {
"text/plain": [
"<th class=\"lawNameCol\">正式法令名</th>\n",
"<th class=\"lawNoCol\">法令番号</th>\n",
"<th class=\"abbrLawNameCol\">略称法令名1</th>\n",
"<th class=\"abbrLawNameCol\">略称法令名2</th>\n",
"<th class=\"abbrLawNameCol\">略称法令名3</th>\n",
"<th class=\"abbrLawNameCol\">略称法令名4</th>\n",
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
"source": [
"tr_list[0] # header"
"cell_type": "code",
"execution_count": 5,
"id": "independent-approach",
"metadata": {},
"outputs": [],
"source": [
"td_classes = [['lawNameCol'], ['lawNoCol'],\n",
" ['abbrLawNameCol'], ['abbrLawNameCol'], ['abbrLawNameCol'], ['abbrLawNameCol']]"
"cell_type": "code",
"execution_count": 6,
"id": "distant-thickness",
"metadata": {},
"outputs": [],
"source": [
"abbr_dict = {}\n",
"for tr in tr_list[1:]:\n",
" td_list = tr(\"td\")\n",
" assert [td.attrs[\"class\"] for td in td_list] == td_classes\n",
" \n",
" law_name = td_list[0].text\n",
" law_no = td_list[1].text\n",
" abbr_names = [td.text.strip() for td in td_list[2:] if td.text.strip()]\n",
" assert len(abbr_names) > 0\n",
" abbr_dict[law_name, law_no] = abbr_names\n",
" \n",
"assert len(abbr_dict) == len(tr_list) - 1"
"cell_type": "code",
"execution_count": 7,
"id": "double-madrid",
"metadata": {},
"outputs": [
"data": {
"text/plain": [
"['トラ退治法', '酔っぱらい防止法', '酩酊防止法']"
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
"source": [
"abbr_dict[\"酒に酔つて公衆に迷惑をかける行為の防止等に関する法律\", \"昭和三十六年法律第百三号\"]"
"cell_type": "markdown",
"id": "controversial-chapel",
"metadata": {},
"source": [
"## Sudachi同義語辞書形式のファイルを作成"
"cell_type": "markdown",
"id": "horizontal-newton",
"metadata": {},
"source": [
"0 : グループ番号\n",
"1 : 体言/用言フラグ (省略可)\n",
"2 : 展開制御フラグ (省略可)\n",
"3 : グループ内の語彙番号 (省略可)\n",
"4 : 同一語彙素内での語形種別 (省略可)\n",
"5 : 同じ語形の語の中での略語情報 (省略可)\n",
"6 : 同じ語形の語の中での表記ゆれ情報 (省略可)\n",
"7 : 分野情報 (省略可)\n",
"8 : 見出し\n",
"9 : 予約\n",
"10 : 予約\n",
"cell_type": "markdown",
"id": "earlier-county",
"metadata": {},
"source": [
"cell_type": "code",
"execution_count": 8,
"id": "capable-mayor",
"metadata": {},
"outputs": [],
"source": [
"def create_row(group_id, heading, is_repr_form):\n",
" # 同じ語形の語の中での表記ゆれ情報\n",
" # 0=代表表記\n",
" # 2=(代表語から見て) 別称 (通称・愛称等)\n",
" if is_repr_form:\n",
" col5 = 0\n",
" else:\n",
" col5 = 2\n",
" \n",
" row = (\n",
" group_id,\n",
" 1, # 体言/用言フラグ: 1=体言\n",
" 0, # 展開制御フラグ: 0=常に展開に使用する\n",
" 1, # グループ内の語彙番号: 略称・別称は同じ語彙素と見なすため1で固定\n",
" 0, # 同一語彙素内での語形種別: 0=代表語\n",
" col5, # 同じ語形の語の中での略語・略称情報\n",
" 0, # 同じ語形の語の中での表記揺れ情報: 0=代表表記\n",
" \"(法律)\", # 分野情報\n",
" heading, # 見出し\n",
" None, # (予約)\n",
" None, # (予約)\n",
" )\n",
" return row"
"cell_type": "code",
"execution_count": 9,
"id": "adapted-isolation",
"metadata": {},
"outputs": [
"data": {
"text/plain": [
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
"source": [
"synonym_rows = []\n",
"for group_id, ((law_name, law_no), abbr_names) in enumerate(abbr_dict.items(), start=900000):\n",
" # 代表表記\n",
" synonym_rows.append(create_row(group_id, law_name, True))\n",
" \n",
" # 略称\n",
" for abbr_name in abbr_names:\n",
" synonym_rows.append(create_row(group_id, abbr_name, False))\n",
" \n",
"cell_type": "code",
"execution_count": 10,
"id": "loose-corpus",
"metadata": {},
"outputs": [],
"source": [
"with open(\"law_name_synonyms.txt\", \"w\") as f:\n",
" writer = csv.writer(f)\n",
" writer.writerows(synonym_rows)"
"cell_type": "code",
"execution_count": 11,
"id": "passing-wichita",
"metadata": {},
"outputs": [
"name": "stdout",
"output_type": "stream",
"text": [
" 5795 law_name_synonyms.txt\r\n"
"source": [
"!wc -l law_name_synonyms.txt"
"cell_type": "code",
"execution_count": 12,
"id": "found-musician",
"metadata": {},
"outputs": [
"name": "stdout",
"output_type": "stream",
"text": [
"source": [
"!tail law_name_synonyms.txt"
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.7"
"nbformat": 4,
"nbformat_minor": 5
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment