hideojoho/How-to-run-PyTerrier-on-NTCIR1.md

## How-to-run-PyTerrier-on-NTCIR1.md

      
    Raw
  

              How-to-run-PyTerrier-on-NTCIR1.md
            
          
    PyTerrier on NTCIR-1


前処理
検索と評価

NTCIR-1 テストコレクション


出典：Kando, et al. (1999). Overview of IR Tasks at the First NTCIR Workshop. In: Proceedings of the First NTCIR Workshop on Research in Japanese Text Retrieval and Term Recognition, August 30 - September 1, 1999, pp.11-44.
概要と入手方法：テストコレクション利用手続き・覚書（研究目的用）


情報検索用テストコレクションとして、文書データ（学会発表データベース著者抄録(1988-1997)=国内65学会の発表論文著者抄録約33万件。半数以上は日英対訳）、検索課題83件(日本語）、正解判定を含みます。日本語検索、日->英の言語横断検索、日->日本語＋英語 の検索の実験に使用できます。用語抽出研究用コレクションとして、情報検索用テストコレクションから抽出した日本語文書データ2000件に言語タグを付加したものを含みます。テストコレクション全体をNIIから研究目的で提供します。


## NTCIR1_Preprocessing.ipynb

      
Display the source blob

    
Display the rendered blob

    
    Raw
  

              NTCIR1_Preprocessing.ipynb
            
          
        Loading

      Sorry, something went wrong. Reload?
      Sorry, we cannot display this file.
      Sorry, this file is invalid so it cannot be displayed.
      
          Viewer requires iframe.
      
    
## PyTerrier_NTCIR1.ipynb
{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "cbf8f009-a632-4e89-a2cc-5ff4ec2c7939",
   "metadata": {},
   "source": [
    "# PyTerrier on NTCIR-1\n",
    "\n",
    "## NTCIR-1 テストコレクション\n",
    "\n",
    "- 出典：Kando, et al. (1999). [Overview of IR Tasks at the First NTCIR Workshop](http://research.nii.ac.jp/ntcir/workshop/OnlineProceedings/IR-overview.pdf). In: Proceedings of the First NTCIR Workshop on Research in Japanese Text Retrieval and Term Recognition, August 30 - September 1, 1999, pp.11-44.\n",
    "- 概要と入手方法：[テストコレクション利用手続き・覚書（研究目的用）](http://research.nii.ac.jp/ntcir/permission/perm-ja.html#ntcir-1)\n",
    "> 情報検索用テストコレクションとして、文書データ（学会発表データベース著者抄録(1988-1997)=国内65学会の発表論文著者抄録約33万件。半数以上は日英対訳）、検索課題83件(日本語）、正解判定を含みます。日本語検索、日->英の言語横断検索、日->日本語＋英語 の検索の実験に使用できます。用語抽出研究用コレクションとして、情報検索用テストコレクションから抽出した日本語文書データ2000件に言語タグを付加したものを含みます。テストコレクション全体をNIIから研究目的で提供します。\n",
    "\n",
    "## 前処理\n",
    "\n",
    "- NTCIR1_Preprocessing.ipynb\n",
    "\n",
    "## 必要なもの\n",
    "\n",
    "- JDK\n",
    "- GCC"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "73266f7b-4420-4e86-96c0-2b39e3f1493d",
   "metadata": {},
   "source": [
    "---\n",
    "## フォルダ構成"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 55,
   "id": "36948764-f97e-412b-b3d0-166066ef4c6a",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "total 0\n",
      "drwxr-xr-x 1 1024 users  10 Dec 17 04:50 datasets\n",
      "drwxr-xr-x 1 1024 users  10 Dec 17 05:53 indexes\n",
      "drwxr-xr-x 1 1024 users 326 Dec 29 23:54 notebooks\n",
      "drwxr-xr-x 1 1024 users  92 Dec 17 05:36 vendors\n"
     ]
    }
   ],
   "source": [
    "!ls -l ../"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "50d60c0b-19c5-4c27-b2cb-feb855f8cf96",
   "metadata": {},
   "source": [
    "---\n",
    "## データセットの確認"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 56,
   "id": "fd800fc3-e3d2-40ce-a11d-1deaa1e620de",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "total 224\n",
      "-rw-r--r-- 1 1024 users 19925 Oct 13  1999 topic0001-0030\n",
      "-rw-r--r-- 1 1024 users 27132 Dec 17 04:58 topic0001-0030.utf8\n",
      "-rw-r--r-- 1 1024 users  4307 Dec 29 14:49 topic0001-0030.utf8.janome.jsonl\n",
      "-rw-r--r-- 1 1024 users  3987 Dec 29 14:49 topic0001-0030.utf8.jsonl\n",
      "-rw-r--r-- 1 1024 users 59905 Nov  1  1999 topic0031-0083\n",
      "-rw-r--r-- 1 1024 users 80749 Dec 17 04:58 topic0031-0083.utf8\n",
      "-rw-r--r-- 1 1024 users 10090 Dec 29 14:49 topic0031-0083.utf8.janome.jsonl\n",
      "-rw-r--r-- 1 1024 users  9084 Dec 29 14:49 topic0031-0083.utf8.jsonl\n"
     ]
    }
   ],
   "source": [
    "!ls -l ../datasets/ntcir/ntcir-1/topics"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 57,
   "id": "3dff1047-b4bf-49a8-abcf-33bb52bdb9a2",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "total 1434324\n",
      "drwxr-xr-x 1 1024 users        86 Dec 29 12:56 jsonl\n",
      "-rw-r--r-- 1 1024 users 326589786 Oct 22  1999 ntc1-j1\n",
      "-rw-r--r-- 1 1024 users 435408365 Dec 17 05:12 ntc1-j1.utf8\n",
      "-rw-r--r-- 1 1024 users 363608976 Dec 29 16:09 ntc1-j1.utf8.janome.jsonl\n",
      "-rw-r--r-- 1 1024 users 305321021 Dec 29 16:09 ntc1-j1.utf8.jsonl\n",
      "-rw-r--r-- 1 1024 users      3603 Dec 29 12:41 ntc1-j1.utf8.sample\n",
      "-rw-r--r-- 1 1024 users   3570750 Oct 22  1999 rel1_ntc1-j1_0001-0030\n",
      "-rw-r--r-- 1 1024 users   3777038 Nov  1  1999 rel1_ntc1-j1_0031-0083\n",
      "-rw-r--r-- 1 1024 users   3570750 Oct 22  1999 rel2_ntc1-j1_0001-0030\n",
      "-rw-r--r-- 1 1024 users   3570750 Dec 29 23:43 rel2_ntc1-j1_0001-0030.utf8\n",
      "-rw-r--r-- 1 1024 users   7935000 Dec 29 23:43 rel2_ntc1-j1_0001-0030.utf8.jsonl\n",
      "-rw-r--r-- 1 1024 users   3777038 Nov  1  1999 rel2_ntc1-j1_0031-0083\n",
      "-rw-r--r-- 1 1024 users   3855273 Dec 29 23:43 rel2_ntc1-j1_0031-0083.utf8\n",
      "-rw-r--r-- 1 1024 users   7735380 Dec 29 23:43 rel2_ntc1-j1_0031-0083.utf8.jsonl\n"
     ]
    }
   ],
   "source": [
    "!ls -l ../datasets/ntcir/ntcir-1/mlir"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f09b0841-c379-4063-9fc8-5ddaeddbafa3",
   "metadata": {},
   "source": [
    "---\n",
    "## PyTerrierのインストール"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 58,
   "id": "fcf203a1-d98a-4639-8bb2-c178131de7bc",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'/usr/lib/jvm/java-11-openjdk-amd64'"
      ]
     },
     "execution_count": 58,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import os\n",
    "JAVA_HOME = '/usr/lib/jvm/java-11-openjdk-amd64'\n",
    "os.environ['JAVA_HOME'] = JAVA_HOME\n",
    "os.getenv('JAVA_HOME')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7082012b-bb4c-46db-b792-2708e9ecdc34",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "import sys\n",
    "!{sys.executable} -m pip install python-terrier"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "97880798-65be-4bfd-ae5c-582e647f2a08",
   "metadata": {},
   "source": [
    "---\n",
    "## 文書コーパスの索引付け\n",
    "\n",
    "- janomeで分かち書き済みのjsonlファイルをPandasデータフレーム形式で読み込む（コーパスがメモリに収まる規模のときのみ）\n",
    "- コラム名を変更\n",
    "- PyTerrierで索引付け"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 61,
   "id": "602d90c2-a46b-4486-ac36-50021969ff05",
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd  \n",
    "import json\n",
    "in_file = '../datasets/ntcir/ntcir-1/mlir/ntc1-j1.utf8.janome.jsonl'\n",
    "df = pd.read_json(in_file, orient='records', lines=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 62,
   "id": "5d93e9b7-173b-4c0d-9ba5-f9a196f6719f",
   "metadata": {},
   "outputs": [],
   "source": [
    "# コラム名の変更\n",
    "df = df.rename(columns={'id': 'docno', 'contents': 'text'})"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5e17e019-78e8-4160-be4f-87990412a349",
   "metadata": {},
   "outputs": [],
   "source": [
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "id": "f9718054-882c-42d9-ba1b-05b3945dec1c",
   "metadata": {},
   "outputs": [],
   "source": [
    "!mkdir -p ../indexes/ntcir/ntcir-1/mlir/pyterrier"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 65,
   "id": "1d0a7ca6-dbc5-4fae-b304-82b3d796a1dd",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "00:00:16.413 [main] WARN org.terrier.structures.indexing.Indexer - Adding an empty document to the index (gakkai-0000119075) - further warnings are suppressed\n",
      "00:01:36.979 [main] WARN org.terrier.structures.indexing.Indexer - Indexed 1 empty documents\n"
     ]
    }
   ],
   "source": [
    "import pyterrier as pt\n",
    "if not pt.started():\n",
    "  pt.init()\n",
    "pd_indexer = pt.DFIndexer(\"../indexes/ntcir/ntcir-1/mlir/pyterrier\")\n",
    "pd_indexer.setProperty(\"tokeniser\", \"UTFTokeniser\")\n",
    "pd_indexer.setProperty(\"termpipelines\", \"\")\n",
    "indexref = pd_indexer.index(df[\"text\"], df[\"docno\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "263d9a9f-06ab-4390-8ca3-6f77b1c7a780",
   "metadata": {},
   "source": [
    "---\n",
    "## 検索（クエリ）"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6bcd5c34-47ad-4838-8bb2-c31f4ff4c568",
   "metadata": {},
   "outputs": [],
   "source": [
    "bm25_nostem = pt.BatchRetrieve(indexref, properties={\"tokeniser\": \"UTFTokeniser\", \"termpipelines\": \"\"})\n",
    "bm25_nostem.setControl(\"wmodel\", \"BM25\")\n",
    "bm25_nostem.search('特徴 次元 リダクション')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "adfcb661-ba71-40b6-9fa4-f55677bd0de6",
   "metadata": {},
   "source": [
    "---\n",
    "## 検索（トピックファイル）\n",
    "- こちらもコーパス同様にjanomeで分かち書き済みのファイルを読み込む"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 204,
   "id": "5c064d22-2f95-4606-9884-1310e91d64e8",
   "metadata": {},
   "outputs": [],
   "source": [
    "TOPIC_FILES = [\n",
    "    '../datasets/ntcir/ntcir-1/topics/topic0001-0030.utf8.janome.jsonl',\n",
    "    '../datasets/ntcir/ntcir-1/topics/topic0031-0083.utf8.janome.jsonl'\n",
    "]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 212,
   "id": "964a3d8f-5165-4fca-be41-063e06368d91",
   "metadata": {},
   "outputs": [],
   "source": [
    "df = pd.DataFrame(columns=['qid', 'title', 'desc'])\n",
    "for file in TOPIC_FILES:\n",
    "    tmp_df = pd.read_json(file, orient='records', lines=True, dtype=str)\n",
    "    df = pd.concat([df, tmp_df])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 213,
   "id": "a8469bdc-9094-4340-8441-f19ec73cc0d0",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "83"
      ]
     },
     "execution_count": 213,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(df)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "82f82ed2-e037-4d2b-ab63-1559724117a5",
   "metadata": {},
   "outputs": [],
   "source": [
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 215,
   "id": "846bb783-78b6-4fa4-9eed-bbb4a3a31889",
   "metadata": {},
   "outputs": [],
   "source": [
    "# titleのみ使用する場合\n",
    "topics = df[['qid','title']]\n",
    "# コラム名の変更\n",
    "topics = topics.rename(columns={'qid': 'qid', 'title': 'query'})"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "362640a7-4a75-4f3d-a1c0-d60a411106c3",
   "metadata": {},
   "outputs": [],
   "source": [
    "topics.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 217,
   "id": "3f2a338a-6e4c-4afd-963b-7cad986f3678",
   "metadata": {},
   "outputs": [],
   "source": [
    "# PyTerrierが嫌う記号の削除\n",
    "import re\n",
    "code = re.compile('[!\"#$%&\\'\\\\\\\\()*+,-./:;<=>?@[\\\\]^_`{|}~「」〔〕“”〈〉『』【】＆＊・（）＄＃＠。、？！｀＋￥％]')\n",
    "for index, row in enumerate(topics.itertuples()):\n",
    "    topics.iloc[index, 1] = code.sub('', row.query)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5f28300e-70fb-47d1-bf3a-c3e8399ae85f",
   "metadata": {},
   "outputs": [],
   "source": [
    "topics"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 219,
   "id": "0fe031c7-9f8e-4709-a062-6a13f1fc738d",
   "metadata": {},
   "outputs": [],
   "source": [
    "bm25_nostem = pt.BatchRetrieve(indexref, properties={\"tokeniser\": \"UTFTokeniser\", \"termpipelines\": \"\"})\n",
    "bm25_nostem.setControl(\"wmodel\", \"BM25\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 220,
   "id": "d25f7add-cba7-4e97-8004-2f12ea2b6e07",
   "metadata": {},
   "outputs": [],
   "source": [
    "res = pd.DataFrame(columns=['qid', 'docid', 'docno', 'rank', 'score', 'query'])\n",
    "for index, row in topics.iterrows():\n",
    "    tmp_res = bm25_nostem.search(row.query)\n",
    "    # qidの設定\n",
    "    for index2, row2 in enumerate(tmp_res.itertuples()):\n",
    "        tmp_res.iloc[index2, 0] = row.qid\n",
    "    res = pd.concat([res, tmp_res])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8207d356-efbb-4035-9345-867da5b417ec",
   "metadata": {},
   "outputs": [],
   "source": [
    "res"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "23e1e37a-faa4-4648-b3f7-1c0638e69a2f",
   "metadata": {},
   "source": [
    "---\n",
    "## 評価\n",
    "- Qrelファイルは、nDCGを使うためにNTCIRのABC判定を適合性スコア2, 1, 0にそれぞれ変換している（要検討）"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 223,
   "id": "41ddba0d-87c3-4d77-8628-9d9bee8b59fa",
   "metadata": {},
   "outputs": [],
   "source": [
    "QREL_FILES = [\n",
    "    '../datasets/ntcir/ntcir-1/mlir/rel2_ntc1-j1_0001-0030.utf8.jsonl',\n",
    "    '../datasets/ntcir/ntcir-1/mlir/rel2_ntc1-j1_0031-0083.utf8.jsonl'\n",
    "]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 227,
   "id": "de734979-0acd-4345-9561-c065af7103fd",
   "metadata": {},
   "outputs": [],
   "source": [
    "qrels = pd.DataFrame(columns=['qid', 'docno', 'rel'])\n",
    "for file in QREL_FILES:\n",
    "    tmp_df = pd.read_json(file, orient='records', lines=True, dtype={'qid': str, 'docno': str, 'rel': int})\n",
    "    qrels = pd.concat([qrels, tmp_df])\n",
    "# コラム名の変更\n",
    "qrels = qrels.rename(columns={'qid': 'qid', 'docno': 'docno', 'rel': 'label'})"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "bcbfe773-5903-4b6b-8533-4d58fc9b0309",
   "metadata": {},
   "outputs": [],
   "source": [
    "qrels"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 232,
   "id": "4996d27a-ae87-49ce-9486-7cf7a054e7c5",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'map': 0.23468946206807043, 'ndcg': 0.4877188653564257}"
      ]
     },
     "execution_count": 232,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pt.Utils.evaluate(res, qrels, metrics=['map', 'ndcg'])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "32eab1f9-6dbb-4217-b6f5-b243e2f66e33",
   "metadata": {},
   "source": [
    "---\n",
    "## 比較実験\n",
    "- 検索モデル、トピックファイル、Qrelファイルを指定することで検索から評価まで一気にやってしまう\n",
    "- 検索モデルや評価指標は複数設定可能\n",
    "- 複数モデルの比較では統計的検定もしてくれる"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 237,
   "id": "533a58f7-caab-4edb-bbe4-f0662f236f2a",
   "metadata": {},
   "outputs": [],
   "source": [
    "tfidf = pt.BatchRetrieve(indexref, wmodel=\"TF_IDF\", properties={\"tokeniser\": \"UTFTokeniser\", \"termpipelines\": \"\"})\n",
    "bm25 = pt.BatchRetrieve(indexref, wmodel=\"BM25\", properties={\"tokeniser\": \"UTFTokeniser\", \"termpipelines\": \"\"})"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 236,
   "id": "221b132e-8bfc-4493-89c2-09a8a8fbfc1d",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>name</th>\n",
       "      <th>qid</th>\n",
       "      <th>measure</th>\n",
       "      <th>value</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>166</th>\n",
       "      <td>BR(BM25)</td>\n",
       "      <td>0001</td>\n",
       "      <td>map</td>\n",
       "      <td>0.068809</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>167</th>\n",
       "      <td>BR(BM25)</td>\n",
       "      <td>0001</td>\n",
       "      <td>ndcg</td>\n",
       "      <td>0.368975</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>168</th>\n",
       "      <td>BR(BM25)</td>\n",
       "      <td>0002</td>\n",
       "      <td>map</td>\n",
       "      <td>0.620673</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>169</th>\n",
       "      <td>BR(BM25)</td>\n",
       "      <td>0002</td>\n",
       "      <td>ndcg</td>\n",
       "      <td>0.750051</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>170</th>\n",
       "      <td>BR(BM25)</td>\n",
       "      <td>0003</td>\n",
       "      <td>map</td>\n",
       "      <td>0.002815</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>161</th>\n",
       "      <td>BR(TF_IDF)</td>\n",
       "      <td>0081</td>\n",
       "      <td>ndcg</td>\n",
       "      <td>0.220244</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>162</th>\n",
       "      <td>BR(TF_IDF)</td>\n",
       "      <td>0082</td>\n",
       "      <td>map</td>\n",
       "      <td>0.469421</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>163</th>\n",
       "      <td>BR(TF_IDF)</td>\n",
       "      <td>0082</td>\n",
       "      <td>ndcg</td>\n",
       "      <td>0.686337</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>164</th>\n",
       "      <td>BR(TF_IDF)</td>\n",
       "      <td>0083</td>\n",
       "      <td>map</td>\n",
       "      <td>0.185141</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>165</th>\n",
       "      <td>BR(TF_IDF)</td>\n",
       "      <td>0083</td>\n",
       "      <td>ndcg</td>\n",
       "      <td>0.552784</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>332 rows × 4 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "           name   qid measure     value\n",
       "166    BR(BM25)  0001     map  0.068809\n",
       "167    BR(BM25)  0001    ndcg  0.368975\n",
       "168    BR(BM25)  0002     map  0.620673\n",
       "169    BR(BM25)  0002    ndcg  0.750051\n",
       "170    BR(BM25)  0003     map  0.002815\n",
       "..          ...   ...     ...       ...\n",
       "161  BR(TF_IDF)  0081    ndcg  0.220244\n",
       "162  BR(TF_IDF)  0082     map  0.469421\n",
       "163  BR(TF_IDF)  0082    ndcg  0.686337\n",
       "164  BR(TF_IDF)  0083     map  0.185141\n",
       "165  BR(TF_IDF)  0083    ndcg  0.552784\n",
       "\n",
       "[332 rows x 4 columns]"
      ]
     },
     "execution_count": 236,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pt.Experiment(\n",
    "    [tfidf, bm25],\n",
    "    topics,\n",
    "    qrels,\n",
    "    eval_metrics=[\"map\", \"ndcg\"],\n",
    "    perquery=True\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 238,
   "id": "1d80599b-b218-48a3-bdf2-9d871f1927cd",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>name</th>\n",
       "      <th>map</th>\n",
       "      <th>ndcg</th>\n",
       "      <th>map +</th>\n",
       "      <th>map -</th>\n",
       "      <th>map p-value</th>\n",
       "      <th>map reject</th>\n",
       "      <th>map p-value corrected</th>\n",
       "      <th>ndcg +</th>\n",
       "      <th>ndcg -</th>\n",
       "      <th>ndcg p-value</th>\n",
       "      <th>ndcg reject</th>\n",
       "      <th>ndcg p-value corrected</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>BR(TF_IDF)</td>\n",
       "      <td>0.260845</td>\n",
       "      <td>0.512424</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>True</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>True</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>BR(BM25)</td>\n",
       "      <td>0.234689</td>\n",
       "      <td>0.487719</td>\n",
       "      <td>23.0</td>\n",
       "      <td>45.0</td>\n",
       "      <td>0.005328</td>\n",
       "      <td>True</td>\n",
       "      <td>0.010657</td>\n",
       "      <td>22.0</td>\n",
       "      <td>46.0</td>\n",
       "      <td>0.002379</td>\n",
       "      <td>True</td>\n",
       "      <td>0.004758</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "         name       map      ndcg  map +  map -  map p-value  map reject  \\\n",
       "0  BR(TF_IDF)  0.260845  0.512424    NaN    NaN          NaN        True   \n",
       "1    BR(BM25)  0.234689  0.487719   23.0   45.0     0.005328        True   \n",
       "\n",
       "   map p-value corrected  ndcg +  ndcg -  ndcg p-value  ndcg reject  \\\n",
       "0                    NaN     NaN     NaN           NaN         True   \n",
       "1               0.010657    22.0    46.0      0.002379         True   \n",
       "\n",
       "   ndcg p-value corrected  \n",
       "0                     NaN  \n",
       "1                0.004758  "
      ]
     },
     "execution_count": 238,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pt.Experiment(\n",
    "    [tfidf, bm25],\n",
    "    topics,\n",
    "    qrels,\n",
    "    eval_metrics=[\"map\", \"ndcg\"],\n",
    "    baseline=0,\n",
    "    correction='holm'\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b2bc8acb-af3a-4b99-8a5b-4251d96daba5",
   "metadata": {},
   "source": [
    "---\n",
    "- TFIDF強い"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
	{
	"cells": [
	{
	"cell_type": "markdown",
	"id": "cbf8f009-a632-4e89-a2cc-5ff4ec2c7939",
	"metadata": {},
	"source": [
	"# PyTerrier on NTCIR-1\n",
	"\n",
	"## NTCIR-1 テストコレクション\n",
	"\n",
	"- 出典：Kando, et al. (1999). [Overview of IR Tasks at the First NTCIR Workshop](http://research.nii.ac.jp/ntcir/workshop/OnlineProceedings/IR-overview.pdf). In: Proceedings of the First NTCIR Workshop on Research in Japanese Text Retrieval and Term Recognition, August 30 - September 1, 1999, pp.11-44.\n",
	"- 概要と入手方法：[テストコレクション利用手続き・覚書（研究目的用）](http://research.nii.ac.jp/ntcir/permission/perm-ja.html#ntcir-1)\n",
	"> 情報検索用テストコレクションとして、文書データ（学会発表データベース著者抄録(1988-1997)=国内65学会の発表論文著者抄録約33万件。半数以上は日英対訳）、検索課題83件(日本語）、正解判定を含みます。日本語検索、日->英の言語横断検索、日->日本語＋英語の検索の実験に使用できます。用語抽出研究用コレクションとして、情報検索用テストコレクションから抽出した日本語文書データ2000件に言語タグを付加したものを含みます。テストコレクション全体をNIIから研究目的で提供します。\n",
	"\n",
	"## 前処理\n",
	"\n",
	"- NTCIR1_Preprocessing.ipynb\n",
	"\n",
	"## 必要なもの\n",
	"\n",
	"- JDK\n",
	"- GCC"
	]
	},
	{
	"cell_type": "markdown",
	"id": "73266f7b-4420-4e86-96c0-2b39e3f1493d",
	"metadata": {},
	"source": [
	"---\n",
	"## フォルダ構成"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 55,
	"id": "36948764-f97e-412b-b3d0-166066ef4c6a",
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"total 0\n",
	"drwxr-xr-x 1 1024 users 10 Dec 17 04:50 datasets\n",
	"drwxr-xr-x 1 1024 users 10 Dec 17 05:53 indexes\n",
	"drwxr-xr-x 1 1024 users 326 Dec 29 23:54 notebooks\n",
	"drwxr-xr-x 1 1024 users 92 Dec 17 05:36 vendors\n"
	]
	}
	],
	"source": [
	"!ls -l ../"
	]
	},
	{
	"cell_type": "markdown",
	"id": "50d60c0b-19c5-4c27-b2cb-feb855f8cf96",
	"metadata": {},
	"source": [
	"---\n",
	"## データセットの確認"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 56,
	"id": "fd800fc3-e3d2-40ce-a11d-1deaa1e620de",
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"total 224\n",
	"-rw-r--r-- 1 1024 users 19925 Oct 13 1999 topic0001-0030\n",
	"-rw-r--r-- 1 1024 users 27132 Dec 17 04:58 topic0001-0030.utf8\n",
	"-rw-r--r-- 1 1024 users 4307 Dec 29 14:49 topic0001-0030.utf8.janome.jsonl\n",
	"-rw-r--r-- 1 1024 users 3987 Dec 29 14:49 topic0001-0030.utf8.jsonl\n",
	"-rw-r--r-- 1 1024 users 59905 Nov 1 1999 topic0031-0083\n",
	"-rw-r--r-- 1 1024 users 80749 Dec 17 04:58 topic0031-0083.utf8\n",
	"-rw-r--r-- 1 1024 users 10090 Dec 29 14:49 topic0031-0083.utf8.janome.jsonl\n",
	"-rw-r--r-- 1 1024 users 9084 Dec 29 14:49 topic0031-0083.utf8.jsonl\n"
	]
	}
	],
	"source": [
	"!ls -l ../datasets/ntcir/ntcir-1/topics"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 57,
	"id": "3dff1047-b4bf-49a8-abcf-33bb52bdb9a2",
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"total 1434324\n",
	"drwxr-xr-x 1 1024 users 86 Dec 29 12:56 jsonl\n",
	"-rw-r--r-- 1 1024 users 326589786 Oct 22 1999 ntc1-j1\n",
	"-rw-r--r-- 1 1024 users 435408365 Dec 17 05:12 ntc1-j1.utf8\n",
	"-rw-r--r-- 1 1024 users 363608976 Dec 29 16:09 ntc1-j1.utf8.janome.jsonl\n",
	"-rw-r--r-- 1 1024 users 305321021 Dec 29 16:09 ntc1-j1.utf8.jsonl\n",
	"-rw-r--r-- 1 1024 users 3603 Dec 29 12:41 ntc1-j1.utf8.sample\n",
	"-rw-r--r-- 1 1024 users 3570750 Oct 22 1999 rel1_ntc1-j1_0001-0030\n",
	"-rw-r--r-- 1 1024 users 3777038 Nov 1 1999 rel1_ntc1-j1_0031-0083\n",
	"-rw-r--r-- 1 1024 users 3570750 Oct 22 1999 rel2_ntc1-j1_0001-0030\n",
	"-rw-r--r-- 1 1024 users 3570750 Dec 29 23:43 rel2_ntc1-j1_0001-0030.utf8\n",
	"-rw-r--r-- 1 1024 users 7935000 Dec 29 23:43 rel2_ntc1-j1_0001-0030.utf8.jsonl\n",
	"-rw-r--r-- 1 1024 users 3777038 Nov 1 1999 rel2_ntc1-j1_0031-0083\n",
	"-rw-r--r-- 1 1024 users 3855273 Dec 29 23:43 rel2_ntc1-j1_0031-0083.utf8\n",
	"-rw-r--r-- 1 1024 users 7735380 Dec 29 23:43 rel2_ntc1-j1_0031-0083.utf8.jsonl\n"
	]
	}
	],
	"source": [
	"!ls -l ../datasets/ntcir/ntcir-1/mlir"
	]
	},
	{
	"cell_type": "markdown",
	"id": "f09b0841-c379-4063-9fc8-5ddaeddbafa3",
	"metadata": {},
	"source": [
	"---\n",
	"## PyTerrierのインストール"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 58,
	"id": "fcf203a1-d98a-4639-8bb2-c178131de7bc",
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"'/usr/lib/jvm/java-11-openjdk-amd64'"
	]
	},
	"execution_count": 58,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"import os\n",
	"JAVA_HOME = '/usr/lib/jvm/java-11-openjdk-amd64'\n",
	"os.environ['JAVA_HOME'] = JAVA_HOME\n",
	"os.getenv('JAVA_HOME')"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"id": "7082012b-bb4c-46db-b792-2708e9ecdc34",
	"metadata": {
	"tags": []
	},
	"outputs": [],
	"source": [
	"import sys\n",
	"!{sys.executable} -m pip install python-terrier"
	]
	},
	{
	"cell_type": "markdown",
	"id": "97880798-65be-4bfd-ae5c-582e647f2a08",
	"metadata": {},
	"source": [
	"---\n",
	"## 文書コーパスの索引付け\n",
	"\n",
	"- janomeで分かち書き済みのjsonlファイルをPandasデータフレーム形式で読み込む（コーパスがメモリに収まる規模のときのみ）\n",
	"- コラム名を変更\n",
	"- PyTerrierで索引付け"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 61,
	"id": "602d90c2-a46b-4486-ac36-50021969ff05",
	"metadata": {},
	"outputs": [],
	"source": [
	"import pandas as pd \n",
	"import json\n",
	"in_file = '../datasets/ntcir/ntcir-1/mlir/ntc1-j1.utf8.janome.jsonl'\n",
	"df = pd.read_json(in_file, orient='records', lines=True)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 62,
	"id": "5d93e9b7-173b-4c0d-9ba5-f9a196f6719f",
	"metadata": {},
	"outputs": [],
	"source": [
	"# コラム名の変更\n",
	"df = df.rename(columns={'id': 'docno', 'contents': 'text'})"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"id": "5e17e019-78e8-4160-be4f-87990412a349",
	"metadata": {},
	"outputs": [],
	"source": [
	"df.head()"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 21,
	"id": "f9718054-882c-42d9-ba1b-05b3945dec1c",
	"metadata": {},
	"outputs": [],
	"source": [
	"!mkdir -p ../indexes/ntcir/ntcir-1/mlir/pyterrier"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 65,
	"id": "1d0a7ca6-dbc5-4fae-b304-82b3d796a1dd",
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"00:00:16.413 [main] WARN org.terrier.structures.indexing.Indexer - Adding an empty document to the index (gakkai-0000119075) - further warnings are suppressed\n",
	"00:01:36.979 [main] WARN org.terrier.structures.indexing.Indexer - Indexed 1 empty documents\n"
	]
	}
	],
	"source": [
	"import pyterrier as pt\n",
	"if not pt.started():\n",
	" pt.init()\n",
	"pd_indexer = pt.DFIndexer(\"../indexes/ntcir/ntcir-1/mlir/pyterrier\")\n",
	"pd_indexer.setProperty(\"tokeniser\", \"UTFTokeniser\")\n",
	"pd_indexer.setProperty(\"termpipelines\", \"\")\n",
	"indexref = pd_indexer.index(df[\"text\"], df[\"docno\"])"
	]
	},
	{
	"cell_type": "markdown",
	"id": "263d9a9f-06ab-4390-8ca3-6f77b1c7a780",
	"metadata": {},
	"source": [
	"---\n",
	"## 検索（クエリ）"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"id": "6bcd5c34-47ad-4838-8bb2-c31f4ff4c568",
	"metadata": {},
	"outputs": [],
	"source": [
	"bm25_nostem = pt.BatchRetrieve(indexref, properties={\"tokeniser\": \"UTFTokeniser\", \"termpipelines\": \"\"})\n",
	"bm25_nostem.setControl(\"wmodel\", \"BM25\")\n",
	"bm25_nostem.search('特徴次元リダクション')"
	]
	},
	{
	"cell_type": "markdown",
	"id": "adfcb661-ba71-40b6-9fa4-f55677bd0de6",
	"metadata": {},
	"source": [
	"---\n",
	"## 検索（トピックファイル）\n",
	"- こちらもコーパス同様にjanomeで分かち書き済みのファイルを読み込む"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 204,
	"id": "5c064d22-2f95-4606-9884-1310e91d64e8",
	"metadata": {},
	"outputs": [],
	"source": [
	"TOPIC_FILES = [\n",
	" '../datasets/ntcir/ntcir-1/topics/topic0001-0030.utf8.janome.jsonl',\n",
	" '../datasets/ntcir/ntcir-1/topics/topic0031-0083.utf8.janome.jsonl'\n",
	"]"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 212,
	"id": "964a3d8f-5165-4fca-be41-063e06368d91",
	"metadata": {},
	"outputs": [],
	"source": [
	"df = pd.DataFrame(columns=['qid', 'title', 'desc'])\n",
	"for file in TOPIC_FILES:\n",
	" tmp_df = pd.read_json(file, orient='records', lines=True, dtype=str)\n",
	" df = pd.concat([df, tmp_df])"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 213,
	"id": "a8469bdc-9094-4340-8441-f19ec73cc0d0",
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"83"
	]
	},
	"execution_count": 213,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"len(df)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"id": "82f82ed2-e037-4d2b-ab63-1559724117a5",
	"metadata": {},
	"outputs": [],
	"source": [
	"df.head()"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 215,
	"id": "846bb783-78b6-4fa4-9eed-bbb4a3a31889",
	"metadata": {},
	"outputs": [],
	"source": [
	"# titleのみ使用する場合\n",
	"topics = df[['qid','title']]\n",
	"# コラム名の変更\n",
	"topics = topics.rename(columns={'qid': 'qid', 'title': 'query'})"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"id": "362640a7-4a75-4f3d-a1c0-d60a411106c3",
	"metadata": {},
	"outputs": [],
	"source": [
	"topics.head()"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 217,
	"id": "3f2a338a-6e4c-4afd-963b-7cad986f3678",
	"metadata": {},
	"outputs": [],
	"source": [
	"# PyTerrierが嫌う記号の削除\n",
	"import re\n",
	"code = re.compile('[!\"#$%&\\'\\\\\\\\()*+,-./:;<=>?@[\\\\]^_`{\|}~「」〔〕“”〈〉『』【】＆＊・（）＄＃＠。、？！｀＋￥％]')\n",
	"for index, row in enumerate(topics.itertuples()):\n",
	" topics.iloc[index, 1] = code.sub('', row.query)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"id": "5f28300e-70fb-47d1-bf3a-c3e8399ae85f",
	"metadata": {},
	"outputs": [],
	"source": [
	"topics"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 219,
	"id": "0fe031c7-9f8e-4709-a062-6a13f1fc738d",
	"metadata": {},
	"outputs": [],
	"source": [
	"bm25_nostem = pt.BatchRetrieve(indexref, properties={\"tokeniser\": \"UTFTokeniser\", \"termpipelines\": \"\"})\n",
	"bm25_nostem.setControl(\"wmodel\", \"BM25\")"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 220,
	"id": "d25f7add-cba7-4e97-8004-2f12ea2b6e07",
	"metadata": {},
	"outputs": [],
	"source": [
	"res = pd.DataFrame(columns=['qid', 'docid', 'docno', 'rank', 'score', 'query'])\n",
	"for index, row in topics.iterrows():\n",
	" tmp_res = bm25_nostem.search(row.query)\n",
	" # qidの設定\n",
	" for index2, row2 in enumerate(tmp_res.itertuples()):\n",
	" tmp_res.iloc[index2, 0] = row.qid\n",
	" res = pd.concat([res, tmp_res])"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"id": "8207d356-efbb-4035-9345-867da5b417ec",
	"metadata": {},
	"outputs": [],
	"source": [
	"res"
	]
	},
	{
	"cell_type": "markdown",
	"id": "23e1e37a-faa4-4648-b3f7-1c0638e69a2f",
	"metadata": {},
	"source": [
	"---\n",
	"## 評価\n",
	"- Qrelファイルは、nDCGを使うためにNTCIRのABC判定を適合性スコア2, 1, 0にそれぞれ変換している（要検討）"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 223,
	"id": "41ddba0d-87c3-4d77-8628-9d9bee8b59fa",
	"metadata": {},
	"outputs": [],
	"source": [
	"QREL_FILES = [\n",
	" '../datasets/ntcir/ntcir-1/mlir/rel2_ntc1-j1_0001-0030.utf8.jsonl',\n",
	" '../datasets/ntcir/ntcir-1/mlir/rel2_ntc1-j1_0031-0083.utf8.jsonl'\n",
	"]"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 227,
	"id": "de734979-0acd-4345-9561-c065af7103fd",
	"metadata": {},
	"outputs": [],
	"source": [
	"qrels = pd.DataFrame(columns=['qid', 'docno', 'rel'])\n",
	"for file in QREL_FILES:\n",
	" tmp_df = pd.read_json(file, orient='records', lines=True, dtype={'qid': str, 'docno': str, 'rel': int})\n",
	" qrels = pd.concat([qrels, tmp_df])\n",
	"# コラム名の変更\n",
	"qrels = qrels.rename(columns={'qid': 'qid', 'docno': 'docno', 'rel': 'label'})"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"id": "bcbfe773-5903-4b6b-8533-4d58fc9b0309",
	"metadata": {},
	"outputs": [],
	"source": [
	"qrels"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 232,
	"id": "4996d27a-ae87-49ce-9486-7cf7a054e7c5",
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"{'map': 0.23468946206807043, 'ndcg': 0.4877188653564257}"
	]
	},
	"execution_count": 232,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"pt.Utils.evaluate(res, qrels, metrics=['map', 'ndcg'])"
	]
	},
	{
	"cell_type": "markdown",
	"id": "32eab1f9-6dbb-4217-b6f5-b243e2f66e33",
	"metadata": {},
	"source": [
	"---\n",
	"## 比較実験\n",
	"- 検索モデル、トピックファイル、Qrelファイルを指定することで検索から評価まで一気にやってしまう\n",
	"- 検索モデルや評価指標は複数設定可能\n",
	"- 複数モデルの比較では統計的検定もしてくれる"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 237,
	"id": "533a58f7-caab-4edb-bbe4-f0662f236f2a",
	"metadata": {},
	"outputs": [],
	"source": [
	"tfidf = pt.BatchRetrieve(indexref, wmodel=\"TF_IDF\", properties={\"tokeniser\": \"UTFTokeniser\", \"termpipelines\": \"\"})\n",
	"bm25 = pt.BatchRetrieve(indexref, wmodel=\"BM25\", properties={\"tokeniser\": \"UTFTokeniser\", \"termpipelines\": \"\"})"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 236,
	"id": "221b132e-8bfc-4493-89c2-09a8a8fbfc1d",
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/html": [
	"<div>\n",
	"<style scoped>\n",
	" .dataframe tbody tr th:only-of-type {\n",
	" vertical-align: middle;\n",
	" }\n",
	"\n",
	" .dataframe tbody tr th {\n",
	" vertical-align: top;\n",
	" }\n",
	"\n",
	" .dataframe thead th {\n",
	" text-align: right;\n",
	" }\n",
	"</style>\n",
	"<table border=\"1\" class=\"dataframe\">\n",
	" <thead>\n",
	" <tr style=\"text-align: right;\">\n",
	" <th></th>\n",
	" <th>name</th>\n",
	" <th>qid</th>\n",
	" <th>measure</th>\n",
	" <th>value</th>\n",
	" </tr>\n",
	" </thead>\n",
	" <tbody>\n",
	" <tr>\n",
	" <th>166</th>\n",
	" <td>BR(BM25)</td>\n",
	" <td>0001</td>\n",
	" <td>map</td>\n",
	" <td>0.068809</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>167</th>\n",
	" <td>BR(BM25)</td>\n",
	" <td>0001</td>\n",
	" <td>ndcg</td>\n",
	" <td>0.368975</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>168</th>\n",
	" <td>BR(BM25)</td>\n",
	" <td>0002</td>\n",
	" <td>map</td>\n",
	" <td>0.620673</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>169</th>\n",
	" <td>BR(BM25)</td>\n",
	" <td>0002</td>\n",
	" <td>ndcg</td>\n",
	" <td>0.750051</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>170</th>\n",
	" <td>BR(BM25)</td>\n",
	" <td>0003</td>\n",
	" <td>map</td>\n",
	" <td>0.002815</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>...</th>\n",
	" <td>...</td>\n",
	" <td>...</td>\n",
	" <td>...</td>\n",
	" <td>...</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>161</th>\n",
	" <td>BR(TF_IDF)</td>\n",
	" <td>0081</td>\n",
	" <td>ndcg</td>\n",
	" <td>0.220244</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>162</th>\n",
	" <td>BR(TF_IDF)</td>\n",
	" <td>0082</td>\n",
	" <td>map</td>\n",
	" <td>0.469421</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>163</th>\n",
	" <td>BR(TF_IDF)</td>\n",
	" <td>0082</td>\n",
	" <td>ndcg</td>\n",
	" <td>0.686337</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>164</th>\n",
	" <td>BR(TF_IDF)</td>\n",
	" <td>0083</td>\n",
	" <td>map</td>\n",
	" <td>0.185141</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>165</th>\n",
	" <td>BR(TF_IDF)</td>\n",
	" <td>0083</td>\n",
	" <td>ndcg</td>\n",
	" <td>0.552784</td>\n",
	" </tr>\n",
	" </tbody>\n",
	"</table>\n",
	"<p>332 rows × 4 columns</p>\n",
	"</div>"
	],
	"text/plain": [
	" name qid measure value\n",
	"166 BR(BM25) 0001 map 0.068809\n",
	"167 BR(BM25) 0001 ndcg 0.368975\n",
	"168 BR(BM25) 0002 map 0.620673\n",
	"169 BR(BM25) 0002 ndcg 0.750051\n",
	"170 BR(BM25) 0003 map 0.002815\n",
	".. ... ... ... ...\n",
	"161 BR(TF_IDF) 0081 ndcg 0.220244\n",
	"162 BR(TF_IDF) 0082 map 0.469421\n",
	"163 BR(TF_IDF) 0082 ndcg 0.686337\n",
	"164 BR(TF_IDF) 0083 map 0.185141\n",
	"165 BR(TF_IDF) 0083 ndcg 0.552784\n",
	"\n",
	"[332 rows x 4 columns]"
	]
	},
	"execution_count": 236,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"pt.Experiment(\n",
	" [tfidf, bm25],\n",
	" topics,\n",
	" qrels,\n",
	" eval_metrics=[\"map\", \"ndcg\"],\n",
	" perquery=True\n",
	")"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 238,
	"id": "1d80599b-b218-48a3-bdf2-9d871f1927cd",
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/html": [
	"<div>\n",
	"<style scoped>\n",
	" .dataframe tbody tr th:only-of-type {\n",
	" vertical-align: middle;\n",
	" }\n",
	"\n",
	" .dataframe tbody tr th {\n",
	" vertical-align: top;\n",
	" }\n",
	"\n",
	" .dataframe thead th {\n",
	" text-align: right;\n",
	" }\n",
	"</style>\n",
	"<table border=\"1\" class=\"dataframe\">\n",
	" <thead>\n",
	" <tr style=\"text-align: right;\">\n",
	" <th></th>\n",
	" <th>name</th>\n",
	" <th>map</th>\n",
	" <th>ndcg</th>\n",
	" <th>map +</th>\n",
	" <th>map -</th>\n",
	" <th>map p-value</th>\n",
	" <th>map reject</th>\n",
	" <th>map p-value corrected</th>\n",
	" <th>ndcg +</th>\n",
	" <th>ndcg -</th>\n",
	" <th>ndcg p-value</th>\n",
	" <th>ndcg reject</th>\n",
	" <th>ndcg p-value corrected</th>\n",
	" </tr>\n",
	" </thead>\n",
	" <tbody>\n",
	" <tr>\n",
	" <th>0</th>\n",
	" <td>BR(TF_IDF)</td>\n",
	" <td>0.260845</td>\n",
	" <td>0.512424</td>\n",
	" <td>NaN</td>\n",
	" <td>NaN</td>\n",
	" <td>NaN</td>\n",
	" <td>True</td>\n",
	" <td>NaN</td>\n",
	" <td>NaN</td>\n",
	" <td>NaN</td>\n",
	" <td>NaN</td>\n",
	" <td>True</td>\n",
	" <td>NaN</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>1</th>\n",
	" <td>BR(BM25)</td>\n",
	" <td>0.234689</td>\n",
	" <td>0.487719</td>\n",
	" <td>23.0</td>\n",
	" <td>45.0</td>\n",
	" <td>0.005328</td>\n",
	" <td>True</td>\n",
	" <td>0.010657</td>\n",
	" <td>22.0</td>\n",
	" <td>46.0</td>\n",
	" <td>0.002379</td>\n",
	" <td>True</td>\n",
	" <td>0.004758</td>\n",
	" </tr>\n",
	" </tbody>\n",
	"</table>\n",
	"</div>"
	],
	"text/plain": [
	" name map ndcg map + map - map p-value map reject \\\n",
	"0 BR(TF_IDF) 0.260845 0.512424 NaN NaN NaN True \n",
	"1 BR(BM25) 0.234689 0.487719 23.0 45.0 0.005328 True \n",
	"\n",
	" map p-value corrected ndcg + ndcg - ndcg p-value ndcg reject \\\n",
	"0 NaN NaN NaN NaN True \n",
	"1 0.010657 22.0 46.0 0.002379 True \n",
	"\n",
	" ndcg p-value corrected \n",
	"0 NaN \n",
	"1 0.004758 "
	]
	},
	"execution_count": 238,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"pt.Experiment(\n",
	" [tfidf, bm25],\n",
	" topics,\n",
	" qrels,\n",
	" eval_metrics=[\"map\", \"ndcg\"],\n",
	" baseline=0,\n",
	" correction='holm'\n",
	")"
	]
	},
	{
	"cell_type": "markdown",
	"id": "b2bc8acb-af3a-4b99-8a5b-4251d96daba5",
	"metadata": {},
	"source": [
	"---\n",
	"- TFIDF強い"
	]
	}
	],
	"metadata": {
	"kernelspec": {
	"display_name": "Python 3 (ipykernel)",
	"language": "python",
	"name": "python3"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 3
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython3",
	"version": "3.9.5"
	}
	},
	"nbformat": 4,
	"nbformat_minor": 5
	}