Skip to content

Instantly share code, notes, and snippets.

@hideojoho
Last active September 13, 2023 15:53
Show Gist options
  • Save hideojoho/2810770463a57089ad5946bb654322a8 to your computer and use it in GitHub Desktop.
Save hideojoho/2810770463a57089ad5946bb654322a8 to your computer and use it in GitHub Desktop.
PyTerrier on NTCIR-1

PyTerrier on NTCIR-1

NTCIR-1 テストコレクション

情報検索用テストコレクションとして、文書データ(学会発表データベース著者抄録(1988-1997)=国内65学会の発表論文著者抄録約33万件。半数以上は日英対訳)、検索課題83件(日本語)、正解判定を含みます。日本語検索、日->英の言語横断検索、日->日本語+英語 の検索の実験に使用できます。用語抽出研究用コレクションとして、情報検索用テストコレクションから抽出した日本語文書データ2000件に言語タグを付加したものを含みます。テストコレクション全体をNIIから研究目的で提供します。

Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"id": "de8538e8-df2f-43e5-a688-d7589c2f023c",
"metadata": {},
"source": [
"# NTCIR-1 テストコレクションの前処理\n",
"\n",
"- 出典:Kando, et al. (1999). [Overview of IR Tasks at the First NTCIR Workshop](http://research.nii.ac.jp/ntcir/workshop/OnlineProceedings/IR-overview.pdf). In: Proceedings of the First NTCIR Workshop on Research in Japanese Text Retrieval and Term Recognition, August 30 - September 1, 1999, pp.11-44.\n",
"- 概要と入手方法:[テストコレクション利用手続き・覚書(研究目的用)](http://research.nii.ac.jp/ntcir/permission/perm-ja.html#ntcir-1)\n",
"> 情報検索用テストコレクションとして、文書データ(学会発表データベース著者抄録(1988-1997)=国内65学会の発表論文著者抄録約33万件。半数以上は日英対訳)、検索課題83件(日本語)、正解判定を含みます。日本語検索、日->英の言語横断検索、日->日本語+英語 の検索の実験に使用できます。用語抽出研究用コレクションとして、情報検索用テストコレクションから抽出した日本語文書データ2000件に言語タグを付加したものを含みます。テストコレクション全体をNIIから研究目的で提供します。\n",
"\n",
"## フォルダ構成"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "7b70692b-5154-452f-8f1c-76c435e2f80b",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"total 344516\n",
"-rwxr-xr-x 1 1026 users 187244010 Nov 11 06:32 ADHOC.TGZ\n",
"-rwxr-xr-x 1 1026 users 42368 Nov 11 06:32 AGREEM-E.PDF\n",
"-rwxr-xr-x 1 1026 users 148609 Nov 11 06:32 AGREEM-J.PDF\n",
"-rwxr-xr-x 1 1026 users 57816322 Nov 11 06:32 CLIR.TGZ\n",
"-rwxr-xr-x 1 1026 users 95679 Nov 11 06:32 CORRECTION-E-130709.pdf\n",
"-rwxr-xr-x 1 1026 users 88571 Nov 11 06:32 CORRECTION-J-130705.pdf\n",
"-rwxr-xr-x 1 1026 users 551281 Nov 11 06:32 MANUAL-E.PDF\n",
"-rwxr-xr-x 1 1026 users 407929 Nov 11 06:32 MANUAL-J.PDF\n",
"-rwxr-xr-x 1 1026 users 102421659 Nov 11 06:32 MLIR.TGZ\n",
"-rwxr-xr-x 1 1026 users 54821 Nov 11 06:32 README-E-REVISED-130709.pdf\n",
"-rwxr-xr-x 1 1026 users 8641 Nov 11 06:32 README-E.TXT\n",
"-rwxr-xr-x 1 1026 users 160994 Nov 11 06:32 README-J.PDF\n",
"-rwxr-xr-x 1 1026 users 37169 Nov 11 06:32 README-J-REVISED-130705.pdf\n",
"-rwxr-xr-x 1 1026 users 6357 Nov 11 06:32 README-J.TXT\n",
"-rwxr-xr-x 1 1026 users 45211 Nov 11 06:32 TAGREE-E.PDF\n",
"-rwxr-xr-x 1 1026 users 164770 Nov 11 06:32 TAGREE-J.PDF\n",
"-rwxr-xr-x 1 1026 users 3418941 Nov 11 06:32 TMREC.TGZ\n",
"-rwxr-xr-x 1 1026 users 26005 Nov 11 06:32 TOPICS.TGZ\n"
]
}
],
"source": [
"!ls -l /home/jovyan/shared/Datasets/NTCIR/NTCIR-1"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "7d85fe6a-9d16-42ef-80cf-1e4070670a10",
"metadata": {},
"outputs": [],
"source": [
"# ダウンロードしたデータセットのパスを↓に設定する\n",
"DATA_DIR = '/home/jovyan/shared/Datasets/NTCIR/NTCIR-1'"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "c6adcca9-aac0-47c2-9a69-86f4067a96e8",
"metadata": {},
"outputs": [],
"source": [
"# 作業フォルダの作成\n",
"!mkdir -p ../datasets/ntcir/ntcir-1"
]
},
{
"cell_type": "markdown",
"id": "76c30e80-28e5-4233-9e6d-c6a4d9654756",
"metadata": {},
"source": [
"---\n",
"## トピックファイルの処理"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "6a3fc0e9-7980-4695-b5bc-5d5a55933ba9",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"topics/\n",
"topics/topic0001-0030\n",
"topics/topic0031-0083\n"
]
}
],
"source": [
"# クエリファイルの解凍\n",
"!tar xvfz /home/jovyan/shared/Datasets/NTCIR/NTCIR-1/TOPICS.TGZ -C ../datasets/ntcir/ntcir-1"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "8e4227f8-5e3b-4577-aebd-37a466525b63",
"metadata": {},
"outputs": [],
"source": [
"!head ../datasets/ntcir/ntcir-1/topics/topic0001-0030"
]
},
{
"cell_type": "code",
"execution_count": 16,
"id": "5eba11b2-84b7-4eb1-9c23-bfee43818635",
"metadata": {},
"outputs": [],
"source": [
"# 検索トピックファイルの文字変換(-c オプション)\n",
"!iconv -f EUC-JP -t UTF-8 -c ../datasets/ntcir/ntcir-1/topics/topic0001-0030 > ../datasets/ntcir/ntcir-1/topics/topic0001-0030.utf8\n",
"!iconv -f EUC-JP -t UTF-8 -c ../datasets/ntcir/ntcir-1/topics/topic0031-0083 > ../datasets/ntcir/ntcir-1/topics/topic0031-0083.utf8"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "270577f8-165b-457d-8895-37773cab2f9b",
"metadata": {},
"outputs": [],
"source": [
"!head -23 ../datasets/ntcir/ntcir-1/topics/topic0001-0030.utf8"
]
},
{
"cell_type": "code",
"execution_count": 38,
"id": "69dd8f45-9129-420e-b943-372f93fa560d",
"metadata": {},
"outputs": [],
"source": [
"# jsonl形式に変換(トピックID、タイトル、Descriptionのみ)\n",
"import re\n",
"from janome.tokenizer import Tokenizer\n",
"t = Tokenizer(wakati=True)\n",
"def convert_jsonl_topic(in_file):\n",
" out_file = in_file + '.jsonl'\n",
" out_file2 = in_file + '.janome.jsonl'\n",
" with open(in_file, 'r') as f:\n",
" s = f.read()\n",
" qid = re.findall('<TOPIC q=(.*)>', s)\n",
" titl = re.findall('<TITLE>\\n(.*)\\n</TITLE>', s)\n",
" desc = re.findall('<DESCRIPTION>\\n(.*)\\n</DESCRIPTION>', s)\n",
" with open(out_file, 'w') as f, open(out_file2, 'w') as f2:\n",
" for i in range(len(qid)):\n",
" f.write(f'{{ \"qid\": \"{qid[i]}\", \"title\": \"{titl[i]}\", \"desc\": \"{desc[i]}\" }}\\n')\n",
" titl_atok = ' '.join(list(t.tokenize(titl[i])))\n",
" desc_atok = ' '.join(list(t.tokenize(desc[i])))\n",
" f2.write(f'{{ \"qid\": \"{qid[i]}\", \"title\": \"{titl_atok}\", \"desc\": \"{desc_atok}\" }}\\n')"
]
},
{
"cell_type": "code",
"execution_count": 39,
"id": "6fe32c4c-a45f-4215-b998-90b55a65e11a",
"metadata": {},
"outputs": [],
"source": [
"convert_jsonl_topic('../datasets/ntcir/ntcir-1/topics/topic0001-0030.utf8')\n",
"convert_jsonl_topic('../datasets/ntcir/ntcir-1/topics/topic0031-0083.utf8')"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d6cb1fc1-e74e-4669-b6a2-ce1b72fa1151",
"metadata": {},
"outputs": [],
"source": [
"!head -3 ../datasets/ntcir/ntcir-1/topics/topic0001-0030.utf8.jsonl"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "81de9098-63d3-4cfb-ae89-3cbb0443aeff",
"metadata": {},
"outputs": [],
"source": [
"!head -3 ../datasets/ntcir/ntcir-1/topics/topic0001-0030.utf8.janome.jsonl"
]
},
{
"cell_type": "markdown",
"id": "49b01092-27d5-47db-9d5f-3a5bfcc44493",
"metadata": {},
"source": [
"---\n",
"## 文書ファイルの処理"
]
},
{
"cell_type": "code",
"execution_count": 18,
"id": "2335bb0a-d817-4c71-a072-3c306e22e2d4",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"mlir/\n",
"mlir/ntc1-j1\n",
"mlir/rel1_ntc1-j1_0001-0030\n",
"mlir/rel2_ntc1-j1_0001-0030\n",
"mlir/rel1_ntc1-j1_0031-0083\n",
"mlir/rel2_ntc1-j1_0031-0083\n"
]
}
],
"source": [
"# 文書ファイルの解凍\n",
"!tar xvfz /home/jovyan/shared/Datasets/NTCIR/NTCIR-1/MLIR.TGZ -C ../datasets/ntcir/ntcir-1"
]
},
{
"cell_type": "code",
"execution_count": 19,
"id": "cee8f47d-0f3d-4863-8f1d-4013dc5f0ef0",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"total 326M\n",
"drwxr-xr-x 1 1024 users 190 Nov 1 1999 .\n",
"drwxr-xr-x 1 1024 users 20 Dec 17 05:01 ..\n",
"-rw-r--r-- 1 1024 users 312M Oct 22 1999 ntc1-j1\n",
"-rw-r--r-- 1 1024 users 3.5M Oct 22 1999 rel1_ntc1-j1_0001-0030\n",
"-rw-r--r-- 1 1024 users 3.7M Nov 1 1999 rel1_ntc1-j1_0031-0083\n",
"-rw-r--r-- 1 1024 users 3.5M Oct 22 1999 rel2_ntc1-j1_0001-0030\n",
"-rw-r--r-- 1 1024 users 3.7M Nov 1 1999 rel2_ntc1-j1_0031-0083\n"
]
}
],
"source": [
"!ls -lha ../datasets/ntcir/ntcir-1/mlir"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "2dcf1ac2-6483-44d6-af9d-fd9d3596fa2f",
"metadata": {},
"outputs": [],
"source": [
"!head ../datasets/ntcir/ntcir-1/mlir/ntc1-j1"
]
},
{
"cell_type": "code",
"execution_count": 28,
"id": "38318937-52a8-4954-add9-a4ee46df5a36",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"332918\n"
]
}
],
"source": [
"!grep \"^<ACCN\" ../datasets/ntcir/ntcir-1/mlir/ntc1-j1 | wc -l"
]
},
{
"cell_type": "code",
"execution_count": 31,
"id": "6a9252dc-8b67-49db-8ff2-6f95a2e98fa7",
"metadata": {},
"outputs": [],
"source": [
"# 文書ファイルの文字コード変換(-c オプション)\n",
"!iconv -f EUC-JP -t UTF-8 -c ../datasets/ntcir/ntcir-1/mlir/ntc1-j1 > ../datasets/ntcir/ntcir-1/mlir/ntc1-j1.utf8"
]
},
{
"cell_type": "code",
"execution_count": 32,
"id": "d93edc80-4d6e-4bd6-bd7f-b61ce03081e0",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"332918\n"
]
}
],
"source": [
"!grep \"^<ACCN\" ../datasets/ntcir/ntcir-1/mlir/ntc1-j1.utf8 | wc -l"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "46ff2020-a1d8-4d53-b959-147f0d51bc88",
"metadata": {},
"outputs": [],
"source": [
"!head ../datasets/ntcir/ntcir-1/mlir/ntc1-j1.utf8"
]
},
{
"cell_type": "code",
"execution_count": 43,
"id": "c542133e-f0c4-4f4e-bd8b-3bc733f7ed4a",
"metadata": {},
"outputs": [],
"source": [
"# jsonl形式に変換\n",
"import re\n",
"from janome.tokenizer import Tokenizer\n",
"t = Tokenizer(wakati=True)\n",
"def convert_jsonl_doc(in_file):\n",
" out_file = in_file + '.jsonl'\n",
" out_file2 = in_file + '.janome.jsonl'\n",
" with open(in_file, 'r') as f:\n",
" s = f.read()\n",
" s = re.sub('<ABST.P>|</ABST.P>', '', s)\n",
" s = re.sub(r'\\\\', r'\\\\\\\\', s)\n",
" s = re.sub('\"', '\\\\\"', s)\n",
"\n",
" accn = re.findall('<ACCN.*?>(.*)</ACCN>', s)\n",
" titl = re.findall('<TITL.*?>(.*)</TITL>', s)\n",
" abst = re.findall('<ABST.*?>(.*)</ABST>', s)\n",
"\n",
" with open(out_file, 'w') as f, open(out_file2, 'w') as f2:\n",
" for i in range(len(accn)):\n",
" f.write(f'{{ \"id\": \"{accn[i]}\", \"contents\": \"{titl[i]} {abst[i]}\" }}\\n')\n",
" titl_atok = ' '.join(list(t.tokenize(titl[i])))\n",
" abst_atok = ' '.join(list(t.tokenize(abst[i])))\n",
" f2.write(f'{{ \"id\": \"{accn[i]}\", \"contents\": \"{titl_atok} {abst_atok}\" }}\\n')"
]
},
{
"cell_type": "code",
"execution_count": 44,
"id": "e94594ad-abb0-404c-a484-3efd9bd46993",
"metadata": {},
"outputs": [],
"source": [
"convert_jsonl_doc('../datasets/ntcir/ntcir-1/mlir/ntc1-j1.utf8')"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "fcca39fc-aecc-4e69-8368-e5f9edd9aa70",
"metadata": {},
"outputs": [],
"source": [
"!head -3 ../datasets/ntcir/ntcir-1/mlir/ntc1-j1.utf8.jsonl"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e09a4a82-668d-4715-a310-d4b02b42dfba",
"metadata": {},
"outputs": [],
"source": [
"!head -3 ../datasets/ntcir/ntcir-1/mlir/ntc1-j1.utf8.janome.jsonl"
]
},
{
"cell_type": "markdown",
"id": "dd1b0abb-4861-4cc2-a332-5bb68a465ab5",
"metadata": {},
"source": [
"---\n",
"## Qrelの処理\n",
"- NTCIR-1は第2コラムに多段階判定コードがある(A: Relevant, B: Partially Relevant, C: Not Relevant)\n",
"- ここでは、A→適合性スコア2, B→スコア1, C→スコア0 に変換して使用"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "70b6507f-4798-4f0b-84c2-761af60ef9dd",
"metadata": {},
"outputs": [],
"source": [
"# Qrel\n",
"!head ../datasets/ntcir/ntcir-1/mlir/rel2_ntc1-j1_0001-0030"
]
},
{
"cell_type": "code",
"execution_count": 59,
"id": "53423ad5-3b4a-4530-8ec6-78fe724d3dde",
"metadata": {},
"outputs": [],
"source": [
"# Qrelファイルの文字コード変換(-c オプション)\n",
"!iconv -f EUC-JP -t UTF-8 -c ../datasets/ntcir/ntcir-1/mlir/rel2_ntc1-j1_0001-0030 > ../datasets/ntcir/ntcir-1/mlir/rel2_ntc1-j1_0001-0030.utf8\n",
"!iconv -f EUC-JP -t UTF-8 -c ../datasets/ntcir/ntcir-1/mlir/rel2_ntc1-j1_0031-0083 > ../datasets/ntcir/ntcir-1/mlir/rel2_ntc1-j1_0031-0083.utf8"
]
},
{
"cell_type": "code",
"execution_count": 60,
"id": "62d4d065-6b0b-4d30-9617-ed965bb50bb0",
"metadata": {},
"outputs": [],
"source": [
"def convert_jsonl_qrel(in_file):\n",
" out_file = in_file + '.jsonl'\n",
" with open(in_file, 'r') as f, open(out_file, 'w') as f2:\n",
" for line in f:\n",
" line = line.rstrip()\n",
" flds = line.split('\\t')\n",
" if flds[1] == 'A':\n",
" f2.write(f'{{ \"qid\": \"{flds[0]}\", \"docno\": \"{flds[2]}\", \"rel\": \"2\" }}\\n')\n",
" if flds[1] == 'B':\n",
" f2.write(f'{{ \"qid\": \"{flds[0]}\", \"docno\": \"{flds[2]}\", \"rel\": \"1\" }}\\n')\n",
" if flds[1] == 'C':\n",
" f2.write(f'{{ \"qid\": \"{flds[0]}\", \"docno\": \"{flds[2]}\", \"rel\": \"0\" }}\\n')"
]
},
{
"cell_type": "code",
"execution_count": 61,
"id": "28a86dcc-ac7a-46c7-a281-51b55ed8e593",
"metadata": {},
"outputs": [],
"source": [
"convert_jsonl_qrel('../datasets/ntcir/ntcir-1/mlir/rel2_ntc1-j1_0001-0030.utf8')\n",
"convert_jsonl_qrel('../datasets/ntcir/ntcir-1/mlir/rel2_ntc1-j1_0031-0083.utf8')"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7edf8f95-84b9-4e80-acbc-d00276a552f2",
"metadata": {},
"outputs": [],
"source": [
"!head ../datasets/ntcir/ntcir-1/mlir/rel2_ntc1-j1_0001-0030.utf8.jsonl"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.5"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment