Skip to content

Instantly share code, notes, and snippets.

@ArneBinder
Last active July 8, 2019 19:36
Show Gist options
  • Save ArneBinder/8bcbf138909e35477fbd53f351c1b176 to your computer and use it in GitHub Desktop.
Save ArneBinder/8bcbf138909e35477fbd53f351c1b176 to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"%matplotlib inline\n",
"import json\n",
"import pandas as pd\n",
"import matplotlib.pyplot as plt"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"def load(fn, getter=None):\n",
" data = json.load(open(fn))\n",
" try:\n",
" if getter is not None:\n",
" data = list(getter(data))\n",
" if len(data) > 100:\n",
" print('#record: %i' % len(data))\n",
" k0 = 0\n",
" if isinstance(data, dict):\n",
" k0 = list(data_mwoz.keys())[0]\n",
" d0 = data[k0]\n",
" except:\n",
" print('error while getting first element')\n",
" d0 = None\n",
" return data, d0"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### sources\n",
"* http://nlpprogress.com/english/dialogue.html and http://nlpprogress.com/english/question_answering.html\n",
"* https://parl.ai/projects/ and https://parl.ai/docs/tasks.html\n",
"\n",
"### interesting:\n",
"* PersonaChat\n",
"* Multi-WOZ\n",
"* Reddit\n",
"* CoQA\n",
"* QuAC\n",
"* ShARC\n",
"\n",
"### further reading\n",
"* [A Repository of Conversational Datasets (Henderson et al. 2019)](https://arxiv.org/pdf/1904.06472.pdf)\n",
"* [What makes a good conversation? How controllable attributes affect human judgment (See et al. 2019)](https://arxiv.org/pdf/1902.08654.pdf)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## PersonaChat (ConvAI2)\n",
"http://convai.io/\n",
"https://arxiv.org/pdf/1801.07243.pdf\n",
"\n",
"~11,000 dialogs (162,064 utterances)\n",
"\n",
"\n",
"ratings for additional bot-human dataset (~3,500 dialogs)"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"#record: 3495\n"
]
},
{
"data": {
"text/plain": [
"{'dialog_id': '0xab38710',\n",
" 'dialog': [{'id': 0,\n",
" 'sender': 'participant2',\n",
" 'text': 'I love iphone! i just bought new iphone!',\n",
" 'sender_class': 'Bot'},\n",
" {'id': 1,\n",
" 'sender': 'participant1',\n",
" 'text': \"Thats good for you, i'm not very into new tech\",\n",
" 'sender_class': 'Human'},\n",
" {'id': 2,\n",
" 'sender': 'participant2',\n",
" 'text': 'I am a college student and i am a college student',\n",
" 'sender_class': 'Bot'},\n",
" {'id': 3,\n",
" 'sender': 'participant1',\n",
" 'text': 'I am go to gym and live on donations',\n",
" 'sender_class': 'Human'},\n",
" {'id': 4,\n",
" 'sender': 'participant2',\n",
" 'text': 'I am a vegan and i am in the midwest',\n",
" 'sender_class': 'Bot'},\n",
" {'id': 5,\n",
" 'sender': 'participant1',\n",
" 'text': 'So vegan... i have dogs maybe i should told then that they may eat cheap salads insted of meat',\n",
" 'sender_class': 'Human'},\n",
" {'id': 6,\n",
" 'sender': 'participant2',\n",
" 'text': 'I would not mind having them in the office that would be hard for me',\n",
" 'sender_class': 'Bot'},\n",
" {'id': 7,\n",
" 'sender': 'participant1',\n",
" 'text': 'Dogs or vegan in office?',\n",
" 'sender_class': 'Human'},\n",
" {'id': 8,\n",
" 'sender': 'participant2',\n",
" 'text': 'I am a vegetarian so i am vegan',\n",
" 'sender_class': 'Bot'},\n",
" {'id': 9,\n",
" 'sender': 'participant1',\n",
" 'text': 'Strange answer',\n",
" 'sender_class': 'Human'},\n",
" {'id': 10,\n",
" 'sender': 'participant2',\n",
" 'text': 'I guess i just do not know what to ps4 experience with.',\n",
" 'sender_class': 'Bot'}],\n",
" 'bot_profile': ['i have amazing children and grandchildren.',\n",
" 'i can sew my own clothes.',\n",
" 'i had cancer but its gone now.',\n",
" 'i am retired and living the great life.',\n",
" 'i do not have a smartphone.'],\n",
" 'user_profile': ['two dogs live with me.',\n",
" 'i like doing the treadmill and rowing machine.',\n",
" \"eating is something i do when i'm bored.\",\n",
" 'i have short hair.',\n",
" 'i go to the gym regularly.'],\n",
" 'eval_score': 1,\n",
" 'profile_match': 0}"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data, d0 = load('/mnt/DATA/ML/data/corpora/dialog/PersonaCHAT/summer_wild_evaluation_dialogs_indent.json')\n",
"d0"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"eval_scores = pd.DataFrame({'eval_score': [r['eval_score'] for r in data]})"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<matplotlib.axes._subplots.AxesSubplot at 0x7f3728efa7b8>"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"eval_scores.plot.hist()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Multi-WOZ\n",
"https://www.repository.cam.ac.uk/handle/1810/280608\n",
"\n",
"~10,000 dialogs"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"#record: 10438\n",
"error while getting first element\n"
]
}
],
"source": [
"data_mwoz, d0_mwoz = load('/mnt/DATA/ML/data/corpora/dialog/MULTIWOZ2 2/data.json')\n",
"d0_mwoz"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Reddit\n",
"https://github.com/PolyAI-LDN/conversational-datasets https://arxiv.org/pdf/1904.06472.pdf"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'body': 'Yes. ',\n",
" 'score_hidden': None,\n",
" 'archived': None,\n",
" 'name': None,\n",
" 'author': 'SpiritualAlternative',\n",
" 'author_flair_text': None,\n",
" 'downs': None,\n",
" 'created_utc': '1520550110',\n",
" 'subreddit_id': 't5_2r3gv',\n",
" 'link_id': 't3_82ymmx',\n",
" 'parent_id': 't1_dvedzte',\n",
" 'score': '1',\n",
" 'retrieved_on': '1524963292',\n",
" 'controversiality': '0',\n",
" 'gilded': '0',\n",
" 'id': 'dveh7r5',\n",
" 'subreddit': 'MachineLearning',\n",
" 'ups': None,\n",
" 'distinguished': None,\n",
" 'author_flair_css_class': None}"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data_reddit, d0_reddit = load('/mnt/DATA/ML/data/corpora/dialog/reddit/thread.json')\n",
"d0_reddit"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<matplotlib.axes._subplots.AxesSubplot at 0x7f36f02143c8>"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAXgAAAD4CAYAAADmWv3KAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjEsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy8QZhcZAAAUl0lEQVR4nO3de7BdZXnH8e9DcjAJ4WKSg0UxOdDRYCDk4jkplzZFkIggNy9VBxnLWIPFUrWtErRKOyOjaKvClIpppVFBG1RwWkEakoKIEnKhkSBJwAuFQ9oSQ2sCJobg0z/2PnASzmXtc/baOy6+n5k9rNte75PFm1/Wfvfaa0VmIkmqnv3aXYAkqRwGvCRVlAEvSRVlwEtSRRnwklRRY9tdQH9TpkzJrq6udpchSb8x1q5d+/PM7Bxo3T4V8F1dXaxZs6bdZUjSb4yI+M/B1jlEI0kVZcBLUkUZ8JJUUfvUGLyk31xPP/00vb297Ny5s92lVNK4ceM4/PDD6ejoKPweA15SU/T29nLggQfS1dVFRLS7nErJTLZu3Upvby9HHHFE4fc5RCOpKXbu3MnkyZMN9xJEBJMnT27401FpAR8R0yNiXb/Xtoh4f1ntSWo/w708Izm2pQ3RZOYmYDZARIwBHgNuKqs9SdKeWjUGfwrwk8wc9IJ8SdXStejmpu7v4U+e0dT9vRC0KuDfBnxtoBURsRBYCDB16tQRN9DszlSUnU6qnr5f1U+ZMqXdpYxK6V+yRsT+wFnA1wdan5mLM7M7M7s7Owe8nYIkVdbu3btL23crrqJ5PXBvZv5PC9qS9AJ33XXXMW/ePGbPns2FF17I1VdfzYc+9KFn1y9ZsoSLL74YgHPOOYdXv/rVHH300SxevLjQ/p966inOOOMMZs2axTHHHMPSpUsBWL16NSeccAKzZs1i3rx5bN++nZ07d3LBBRcwc+ZM5syZw+233/5sDW95y1s488wzWbBgAQCf/vSn6enp4dhjj+Wyyy5ryrFoxRDN2xlkeEaSmmnDhg0sXbqU73//+3R0dHDRRRcxceJEbrzxRj71qU8BsHTpUj7ykY8AcO211zJp0iR27NhBT08Pb3rTm5g8efKQbdx666289KUv5eaba8PCv/jFL9i1axdvfetbWbp0KT09PWzbto3x48dz5ZVXArB+/Xo2btzIggULePDBBwG4++67ue+++5g0aRLLli3joYceYtWqVWQmZ511FnfeeSfz588f1fEo9Qw+IiYApwI3ltmOJAGsWLGCtWvX0tPTw+zZs1mxYgU/+9nPOPLII1m5ciVbt25l06ZNnHjiiQBcddVVzJo1i+OOO45HH32Uhx56aNg2Zs6cyfLly7nkkkv43ve+x8EHH8ymTZs47LDD6OnpAeCggw5i7Nix3HXXXZx//vkAHHXUUUybNu3ZgD/11FOZNGkSAMuWLWPZsmXMmTOHuXPnsnHjxkK1DKfUM/jM/CUw9D+HktQkmck73/lOPvGJT+yx/Itf/CI33HADRx11FOeeey4RwR133MHy5cu5++67mTBhAieddFKhHxK98pWvZO3atdxyyy1ceumlLFiwgHPOOWfA69Qzc9D9HHDAAXtsd+mll3LhhRc28KcdnrcqkFSKdlxhdsopp3D22WfzgQ98gEMPPZQnnniC7du388Y3vpHLL7+cadOmccUVVwC1oZUXv/jFTJgwgY0bN7Jy5cpCbWzevJlJkybxjne8g4kTJ7JkyRIWLVrE5s2bWb16NT09PWzfvp3x48czf/58rr/+ek4++WQefPBBHnnkEaZPn8699967xz5f97rX8dGPfpTzzjuPiRMn8thjj9HR0cGhhx46quNhwEuqjBkzZvDxj3+cBQsW8Otf/5qOjg6uvvpqpk2bxowZM3jggQeYN28eAKeddhrXXHMNxx57LNOnT+e4444r1Mb69ev54Ac/yH777UdHRwef//zn2X///Vm6dCkXX3wxO3bsYPz48SxfvpyLLrqI97znPcycOZOxY8eyZMkSXvSiFz1vnwsWLGDDhg0cf/zxAEycOJHrrrtu1AEfQ32EaLXu7u4c6ROdvA5eaq8NGzbwqle9qt1lVNpAxzgi1mZm90Dbe7MxSaooh2gkaQBbt27llFNOed7yFStWDHsp5b7CgJfUNJlZmTtKTp48mXXr1rW7jGeNZDjdIRpJTTFu3Di2bt06oiDS0Poe+DFu3LiG3ucZvKSmOPzww+nt7WXLli3tLqWS+h7Z1wgDXlJTdHR0NPQ4OZXPIRpJqigDXpIqyoCXpIoy4CWpogx4SaooA16SKsqAl6SKMuAlqaIMeEmqKANekirKgJekiio14CPikIj4RkRsjIgNEXF8me1Jkp5T9s3GrgRuzcw3R8T+wISS25Mk1ZUW8BFxEDAf+EOAzNwF7CqrPUnSnsocojkS2AL8U0T8R0T8Y0QcsPdGEbEwItZExBrvIy1JzVNmwI8F5gKfz8w5wFPAor03yszFmdmdmd2dnZ0lliNJLyxlBnwv0JuZ99Tnv0Et8CVJLVBawGfmfwOPRsT0+qJTgAfKak+StKeyr6K5GLi+fgXNT4ELSm5PklRXasBn5jqgu8w2JEkD85esklRRBrwkVZQBL0kVZcBLUkUZ8JJUUQa8JFWUAS9JFWXAS1JFGfCSVFEGvCRVlAEvSRVlwEtSRRnwklRRBrwkVZQBL0kVZcBLUkUZ8JJUUQa8JFWUAS9JFWXAS1JFGfCSVFFjy9x5RDwMbAeeAXZnZneZ7UmSnlNqwNe9JjN/3oJ2JEn9OEQjSRVVdsAnsCwi1kbEwoE2iIiFEbEmItZs2bKl5HIk6YWj7IA/MTPnAq8H3hsR8/feIDMXZ2Z3ZnZ3dnaWXI4kvXCUGvCZubn+38eBm4B5ZbYnSXpOaQEfEQdExIF908AC4P6y2pMk7anMq2heAtwUEX3tfDUzby2xPUlSP6UFfGb+FJhV1v4lSUPzMklJqigDXpIqyoCXpIoy4CWpogx4SaooA16SKsqAl6SKMuAlqaIMeEmqqEIBHxHHlF2IJKm5ip7BXxMRqyLioog4pNSKJElNUSjgM/N3gfOAlwNrIuKrEXFqqZVJkkal8Bh8Zj4E/CVwCfD7wFURsTEi3lhWcZKkkSs6Bn9sRHwW2ACcDJyZma+qT3+2xPokSSNU9HbBfwf8A/DhzNzRtzAzN0fEX5ZSmSRpVIoG/OnAjsx8BiAi9gPGZeYvM/MrpVUnSRqxomPwy4Hx/eYn1JdJkvZRRQN+XGY+2TdTn55QTkmSpGYoGvBPRcTcvpmIeDWwY4jtJUltVnQM/v3A1yNic33+MOCt5ZQkSWqGQgGfmasj4ihgOhDAxsx8ush7I2IMsAZ4LDPfMOJKJUkNKXoGD9ADdNXfMyciyMwvF3jf+6hdP39Q4+VJkkaqUMBHxFeA3wbWAc/UFycwZMBHxOHAGcDlwJ+NvExJUqOKnsF3AzMyMxvc/+eADwEHDrZBRCwEFgJMnTq1wd1LrdG16Oa2tPvwJ89oS7uqhqJX0dwP/FYjO46INwCPZ+baobbLzMWZ2Z2Z3Z2dnY00IUkaQtEz+CnAAxGxCvhV38LMPGuI95wInBURpwPjgIMi4rrMfMeIq5UkFVY04P+q0R1n5qXApQARcRLwF4a7JLVO0cskvxsR04BXZObyiJgAjCm3NEnSaBS9XfC7gW8AX6gvehnwraKNZOYdXgMvSa1V9EvW91IbU98Gzz7849CyipIkjV7RgP9VZu7qm4mIsdSug5ck7aOKBvx3I+LDwPj6s1i/DvxreWVJkkaraMAvArYA64ELgVuoPZ9VkrSPKnoVza+pPbLvH8otR5LULEXvRfMzBhhzz8wjm16RJKkpGrkXTZ9xwFuASc0vR5LULIXG4DNza7/XY5n5OeDkkmuTJI1C0SGauf1m96N2Rj/oHSIlSe1XdIjmb/tN7wYeBv6g6dVIkpqm6FU0rym7EElScxUdohnyaUyZ+ZnmlCNJapZGrqLpAf6lPn8mcCfwaBlFSZJGr5EHfszNzO0AEfFXwNcz84/KKkySNDpFb1UwFdjVb34X0NX0aiRJTVP0DP4rwKqIuInaL1rPBb5cWlWSpFErehXN5RHxHeD36osuyMz/KK8sSdJoFR2iAZgAbMvMK4HeiDiipJokSU1Q9JF9lwGXUH+INtABXFdWUZKk0St6Bn8ucBbwFEBmbsZbFUjSPq1owO/KzKR+y+CIOKC8kiRJzVA04G+IiC8Ah0TEu4HlDPPwj4gYFxGrIuKHEfGjiPjr0RYrSSqu6FU0f1N/Fus2YDrwscy8bZi3/Qo4OTOfjIgO4K6I+E5mrhxdyZKkIoYN+IgYA/xbZr4WGC7Un1Uf0nmyPttRfz3vqVCSpHIMO0STmc8Av4yIgxvdeUSMiYh1wOPAbZl5zwDbLIyINRGxZsuWLY02IUkaRNFfsu4E1kfEbdSvpAHIzD8d6k31fxxmR8QhwE0RcUxm3r/XNouBxQDd3d2e4UtSkxQN+JvrrxHJzP+LiDuA04D7h9lcktQEQwZ8REzNzEcy80uN7jgiOoGn6+E+HngtcMUI65QkNWi4Mfhv9U1ExDcb3PdhwO0RcR+wmtoY/Lcb3IckaYSGG6KJftNHNrLjzLwPmNNwRZKkphjuDD4HmZYk7eOGO4OfFRHbqJ3Jj69PU5/PzDyo1OokSSM2ZMBn5phWFSJJaq5G7gcvSfoNYsBLUkUZ8JJUUQa8JFWUAS9JFWXAS1JFGfCSVFEGvCRVlAEvSRVlwEtSRRnwklRRBrwkVZQBL0kVZcBLUkUZ8JJUUQa8JFWUAS9JFVVawEfEyyPi9ojYEBE/ioj3ldWWJOn5hnsm62jsBv48M++NiAOBtRFxW2Y+UGKbkqS60s7gM/O/MvPe+vR2YAPwsrLakyTtqSVj8BHRBcwB7mlFe5KkcodoAIiIicA3gfdn5rYB1i8EFgJMnTq17HIkaVBdi25uS7sPf/KMUvZb6hl8RHRQC/frM/PGgbbJzMWZ2Z2Z3Z2dnWWWI0kvKGVeRRPAF4ENmfmZstqRJA2szDP4E4HzgZMjYl39dXqJ7UmS+iltDD4z7wKirP1LkobmL1klqaIMeEmqKANekirKgJekijLgJamiDHhJqigDXpIqyoCXpIoy4CWpogx4SaooA16SKsqAl6SKMuAlqaIMeEmqKANekirKgJekijLgJamiDHhJqigDXpIqyoCXpIoy4CWpokoL+Ii4NiIej4j7y2pDkjS4Ms/glwCnlbh/SdIQSgv4zLwTeKKs/UuShtb2MfiIWBgRayJizZYtW9pdjiRVRtsDPjMXZ2Z3ZnZ3dna2uxxJqoy2B7wkqRwGvCRVVJmXSX4NuBuYHhG9EfGustqSJD3f2LJ2nJlvL2vfkqThOUQjSRVlwEtSRRnwklRRBrwkVZQBL0kVZcBLUkUZ8JJUUQa8JFWUAS9JFWXAS1JFGfCSVFEGvCRVlAEvSRVlwEtSRRnwklRRBrwkVZQBL0kVZcBLUkUZ8JJUUQa8JFWUAS9JFVVqwEfEaRGxKSJ+HBGLymxLkrSn0gI+IsYAVwOvB2YAb4+IGWW1J0naU5ln8POAH2fmTzNzF/DPwNkltidJ6mdsift+GfBov/le4Hf23igiFgIL67NPRsSmEbY3Bfj5CN87YnHFsJu0pa4CrKsx9q/GWFcD4opR1TVtsBVlBnwMsCyftyBzMbB41I1FrMnM7tHup9msqzHW1RjraswLra4yh2h6gZf3mz8c2Fxie5KkfsoM+NXAKyLiiIjYH3gb8C8ltidJ6qe0IZrM3B0RfwL8GzAGuDYzf1RWezRhmKck1tUY62qMdTXmBVVXZD5vWFySVAH+klWSKsqAl6SK2ucDPiKujYjHI+L+QdZHRFxVvx3CfRExt9+60m6VUKCu8+r13BcRP4iIWf3WPRwR6yNiXUSsaXFdJ0XEL+ptr4uIj/Vb187j9cF+Nd0fEc9ExKT6ujKP18sj4vaI2BARP4qI9w2wTcv7WMG6Wt7HCtbV8j5WsK6W97GIGBcRqyLih/W6/nqAbcrrX5m5T7+A+cBc4P5B1p8OfIfadffHAffUl48BfgIcCewP/BCY0cK6TgBeXJ9+fV9d9fmHgSltOl4nAd8eYHlbj9de254J/HuLjtdhwNz69IHAg3v/udvRxwrW1fI+VrCulvexInW1o4/V+8zE+nQHcA9wXKv61z5/Bp+ZdwJPDLHJ2cCXs2YlcEhEHEbJt0oYrq7M/EFm/m99diW13wGUrsDxGkxbj9de3g58rVltDyUz/ysz761Pbwc2UPsVdn8t72NF6mpHHyt4vAbT1uO1l5b0sXqfebI+21F/7X1lS2n9a58P+AIGuiXCy4ZY3g7vovYvdJ8ElkXE2qjdqqHVjq9/ZPxORBxdX7ZPHK+ImACcBnyz3+KWHK+I6ALmUDvL6q+tfWyIuvpreR8bpq629bHhjler+1hEjImIdcDjwG2Z2bL+VeatClplsFsiFLpVQtki4jXU/vL9br/FJ2bm5og4FLgtIjbWz3Bb4V5gWmY+GRGnA98CXsE+cryofXT+fmb2P9sv/XhFxERqf+Hfn5nb9l49wFta0seGqatvm5b3sWHqalsfK3K8aHEfy8xngNkRcQhwU0Qck5n9v4sqrX9V4Qx+sFsitP1WCRFxLPCPwNmZubVveWZurv/3ceAmah/FWiIzt/V9ZMzMW4COiJjCPnC86t7GXh+dyz5eEdFBLRSuz8wbB9ikLX2sQF1t6WPD1dWuPlbkeNW1vI/V9/1/wB3UPj30V17/avaXCmW8gC4G/9LwDPb8gmJVfflY4KfAETz3BcXRLaxrKvBj4IS9lh8AHNhv+gfAaS2s67d47gdu84BH6seurcervv5gauP0B7TqeNX/7F8GPjfENi3vYwXrankfK1hXy/tYkbra0ceATuCQ+vR44HvAG1rVv/b5IZqI+Bq1b+WnREQvcBm1LyrIzGuAW6h9C/1j4JfABfV1pd4qoUBdHwMmA38fEQC7s3a3uJdQ+5gGtf+BX83MW1tY15uBP46I3cAO4G1Z603tPl4A5wLLMvOpfm8t9XgBJwLnA+vr46QAH6YWnu3sY0XqakcfK1JXO/pYkbqg9X3sMOBLUXsA0n7ADZn57Yh4T7+6Sutf3qpAkiqqCmPwkqQBGPCSVFEGvCRVlAEvSRVlwEtSRRnwklRRBrwkVdT/A37TieXSLi5eAAAAAElFTkSuQmCC\n",
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"# THIS IS AN EXTRACT!\n",
"eval_scores_reddit = pd.DataFrame({'eval_score': [float(r['score']) for r in data_reddit]})\n",
"eval_scores_reddit.plot.hist()"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"#records: 13\n"
]
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>archived</th>\n",
" <th>author</th>\n",
" <th>author_flair_css_class</th>\n",
" <th>author_flair_text</th>\n",
" <th>body</th>\n",
" <th>controversiality</th>\n",
" <th>created_utc</th>\n",
" <th>distinguished</th>\n",
" <th>downs</th>\n",
" <th>gilded</th>\n",
" <th>id</th>\n",
" <th>link_id</th>\n",
" <th>name</th>\n",
" <th>parent_id</th>\n",
" <th>retrieved_on</th>\n",
" <th>score</th>\n",
" <th>score_hidden</th>\n",
" <th>subreddit</th>\n",
" <th>subreddit_id</th>\n",
" <th>ups</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>None</td>\n",
" <td>SpiritualAlternative</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>Yes.</td>\n",
" <td>0</td>\n",
" <td>1520550110</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>0</td>\n",
" <td>dveh7r5</td>\n",
" <td>t3_82ymmx</td>\n",
" <td>None</td>\n",
" <td>t1_dvedzte</td>\n",
" <td>1524963292</td>\n",
" <td>1</td>\n",
" <td>None</td>\n",
" <td>MachineLearning</td>\n",
" <td>t5_2r3gv</td>\n",
" <td>None</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>None</td>\n",
" <td>Phylliida</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>If you’d just like to make a chatbot similar t...</td>\n",
" <td>0</td>\n",
" <td>1520581389</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>0</td>\n",
" <td>dvf6th1</td>\n",
" <td>t3_82ymmx</td>\n",
" <td>None</td>\n",
" <td>t3_82ymmx</td>\n",
" <td>1524976202</td>\n",
" <td>2</td>\n",
" <td>None</td>\n",
" <td>MachineLearning</td>\n",
" <td>t5_2r3gv</td>\n",
" <td>None</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>None</td>\n",
" <td>Ha7den</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>Im new to the field but looking at experimenti...</td>\n",
" <td>0</td>\n",
" <td>1520538367</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>0</td>\n",
" <td>dve4ghe</td>\n",
" <td>t3_82ymmx</td>\n",
" <td>None</td>\n",
" <td>t3_82ymmx</td>\n",
" <td>1524956856</td>\n",
" <td>3</td>\n",
" <td>None</td>\n",
" <td>MachineLearning</td>\n",
" <td>t5_2r3gv</td>\n",
" <td>None</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>None</td>\n",
" <td>r4and0muser9482</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>From historical perspective, I can recommend w...</td>\n",
" <td>0</td>\n",
" <td>1520526881</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>0</td>\n",
" <td>dvdrdkt</td>\n",
" <td>t3_82ymmx</td>\n",
" <td>None</td>\n",
" <td>t3_82ymmx</td>\n",
" <td>1524950020</td>\n",
" <td>1</td>\n",
" <td>None</td>\n",
" <td>MachineLearning</td>\n",
" <td>t5_2r3gv</td>\n",
" <td>None</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>None</td>\n",
" <td>JosephLChu</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>I'm not sure what Amazon plans to do with the ...</td>\n",
" <td>0</td>\n",
" <td>1520713572</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>0</td>\n",
" <td>dvhvg4m</td>\n",
" <td>t3_82ymmx</td>\n",
" <td>None</td>\n",
" <td>t1_dvhn7hh</td>\n",
" <td>1525024681</td>\n",
" <td>2</td>\n",
" <td>None</td>\n",
" <td>MachineLearning</td>\n",
" <td>t5_2r3gv</td>\n",
" <td>None</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>None</td>\n",
" <td>Speech_xyz</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>Chatbots are not dialogue systems.</td>\n",
" <td>0</td>\n",
" <td>1520596358</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>0</td>\n",
" <td>dvfdfd4</td>\n",
" <td>t3_82ymmx</td>\n",
" <td>None</td>\n",
" <td>t1_dvedzte</td>\n",
" <td>1524979453</td>\n",
" <td>1</td>\n",
" <td>None</td>\n",
" <td>MachineLearning</td>\n",
" <td>t5_2r3gv</td>\n",
" <td>None</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>None</td>\n",
" <td>millenniumpianist</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>A lot of [Sordoni et al. papers] (https://pdfs...</td>\n",
" <td>0</td>\n",
" <td>1520545138</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>0</td>\n",
" <td>dveby2j</td>\n",
" <td>t3_82ymmx</td>\n",
" <td>None</td>\n",
" <td>t3_82ymmx</td>\n",
" <td>1524960640</td>\n",
" <td>2</td>\n",
" <td>None</td>\n",
" <td>MachineLearning</td>\n",
" <td>t5_2r3gv</td>\n",
" <td>None</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>None</td>\n",
" <td>JosephLChu</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>This handbook for the Dialogue State Tracking ...</td>\n",
" <td>0</td>\n",
" <td>1520537842</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>0</td>\n",
" <td>dve3v95</td>\n",
" <td>t3_82ymmx</td>\n",
" <td>None</td>\n",
" <td>t3_82ymmx</td>\n",
" <td>1524956559</td>\n",
" <td>3</td>\n",
" <td>None</td>\n",
" <td>MachineLearning</td>\n",
" <td>t5_2r3gv</td>\n",
" <td>None</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>None</td>\n",
" <td>lee_more_touchy</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>http://www.opendial-toolkit.net/ is worth a look.</td>\n",
" <td>0</td>\n",
" <td>1520529426</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>0</td>\n",
" <td>dvdu93a</td>\n",
" <td>t3_82ymmx</td>\n",
" <td>None</td>\n",
" <td>t3_82ymmx</td>\n",
" <td>1524951568</td>\n",
" <td>1</td>\n",
" <td>None</td>\n",
" <td>MachineLearning</td>\n",
" <td>t5_2r3gv</td>\n",
" <td>None</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9</th>\n",
" <td>None</td>\n",
" <td>Pieranha</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>Very helpful, thanks! You say 'yet', does that...</td>\n",
" <td>0</td>\n",
" <td>1520704245</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>0</td>\n",
" <td>dvhn7hh</td>\n",
" <td>t3_82ymmx</td>\n",
" <td>None</td>\n",
" <td>t1_dvhktmd</td>\n",
" <td>1525020075</td>\n",
" <td>1</td>\n",
" <td>None</td>\n",
" <td>MachineLearning</td>\n",
" <td>t5_2r3gv</td>\n",
" <td>None</td>\n",
" </tr>\n",
" <tr>\n",
" <th>10</th>\n",
" <td>None</td>\n",
" <td>Pieranha</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>Do you know of other workshops/conferences tha...</td>\n",
" <td>0</td>\n",
" <td>1520700352</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>0</td>\n",
" <td>dvhjrkc</td>\n",
" <td>t3_82ymmx</td>\n",
" <td>None</td>\n",
" <td>t1_dve3v95</td>\n",
" <td>1525018234</td>\n",
" <td>1</td>\n",
" <td>None</td>\n",
" <td>MachineLearning</td>\n",
" <td>t5_2r3gv</td>\n",
" <td>None</td>\n",
" </tr>\n",
" <tr>\n",
" <th>11</th>\n",
" <td>None</td>\n",
" <td>JosephLChu</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>The only other thing that comes to mind right ...</td>\n",
" <td>0</td>\n",
" <td>1520701560</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>0</td>\n",
" <td>dvhktmd</td>\n",
" <td>t3_82ymmx</td>\n",
" <td>None</td>\n",
" <td>t1_dvhjrkc</td>\n",
" <td>1525018805</td>\n",
" <td>2</td>\n",
" <td>None</td>\n",
" <td>MachineLearning</td>\n",
" <td>t5_2r3gv</td>\n",
" <td>None</td>\n",
" </tr>\n",
" <tr>\n",
" <th>12</th>\n",
" <td>None</td>\n",
" <td>SEND_ME_NIPS_PAPERS</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>You mean chatbots?</td>\n",
" <td>0</td>\n",
" <td>1520547080</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>0</td>\n",
" <td>dvedzte</td>\n",
" <td>t3_82ymmx</td>\n",
" <td>None</td>\n",
" <td>t3_82ymmx</td>\n",
" <td>1524961682</td>\n",
" <td>1</td>\n",
" <td>None</td>\n",
" <td>MachineLearning</td>\n",
" <td>t5_2r3gv</td>\n",
" <td>None</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" archived author author_flair_css_class author_flair_text \\\n",
"0 None SpiritualAlternative None None \n",
"1 None Phylliida None None \n",
"2 None Ha7den None None \n",
"3 None r4and0muser9482 None None \n",
"4 None JosephLChu None None \n",
"5 None Speech_xyz None None \n",
"6 None millenniumpianist None None \n",
"7 None JosephLChu None None \n",
"8 None lee_more_touchy None None \n",
"9 None Pieranha None None \n",
"10 None Pieranha None None \n",
"11 None JosephLChu None None \n",
"12 None SEND_ME_NIPS_PAPERS None None \n",
"\n",
" body controversiality \\\n",
"0 Yes. 0 \n",
"1 If you’d just like to make a chatbot similar t... 0 \n",
"2 Im new to the field but looking at experimenti... 0 \n",
"3 From historical perspective, I can recommend w... 0 \n",
"4 I'm not sure what Amazon plans to do with the ... 0 \n",
"5 Chatbots are not dialogue systems. 0 \n",
"6 A lot of [Sordoni et al. papers] (https://pdfs... 0 \n",
"7 This handbook for the Dialogue State Tracking ... 0 \n",
"8 http://www.opendial-toolkit.net/ is worth a look. 0 \n",
"9 Very helpful, thanks! You say 'yet', does that... 0 \n",
"10 Do you know of other workshops/conferences tha... 0 \n",
"11 The only other thing that comes to mind right ... 0 \n",
"12 You mean chatbots? 0 \n",
"\n",
" created_utc distinguished downs gilded id link_id name \\\n",
"0 1520550110 None None 0 dveh7r5 t3_82ymmx None \n",
"1 1520581389 None None 0 dvf6th1 t3_82ymmx None \n",
"2 1520538367 None None 0 dve4ghe t3_82ymmx None \n",
"3 1520526881 None None 0 dvdrdkt t3_82ymmx None \n",
"4 1520713572 None None 0 dvhvg4m t3_82ymmx None \n",
"5 1520596358 None None 0 dvfdfd4 t3_82ymmx None \n",
"6 1520545138 None None 0 dveby2j t3_82ymmx None \n",
"7 1520537842 None None 0 dve3v95 t3_82ymmx None \n",
"8 1520529426 None None 0 dvdu93a t3_82ymmx None \n",
"9 1520704245 None None 0 dvhn7hh t3_82ymmx None \n",
"10 1520700352 None None 0 dvhjrkc t3_82ymmx None \n",
"11 1520701560 None None 0 dvhktmd t3_82ymmx None \n",
"12 1520547080 None None 0 dvedzte t3_82ymmx None \n",
"\n",
" parent_id retrieved_on score score_hidden subreddit subreddit_id \\\n",
"0 t1_dvedzte 1524963292 1 None MachineLearning t5_2r3gv \n",
"1 t3_82ymmx 1524976202 2 None MachineLearning t5_2r3gv \n",
"2 t3_82ymmx 1524956856 3 None MachineLearning t5_2r3gv \n",
"3 t3_82ymmx 1524950020 1 None MachineLearning t5_2r3gv \n",
"4 t1_dvhn7hh 1525024681 2 None MachineLearning t5_2r3gv \n",
"5 t1_dvedzte 1524979453 1 None MachineLearning t5_2r3gv \n",
"6 t3_82ymmx 1524960640 2 None MachineLearning t5_2r3gv \n",
"7 t3_82ymmx 1524956559 3 None MachineLearning t5_2r3gv \n",
"8 t3_82ymmx 1524951568 1 None MachineLearning t5_2r3gv \n",
"9 t1_dvhktmd 1525020075 1 None MachineLearning t5_2r3gv \n",
"10 t1_dve3v95 1525018234 1 None MachineLearning t5_2r3gv \n",
"11 t1_dvhjrkc 1525018805 2 None MachineLearning t5_2r3gv \n",
"12 t3_82ymmx 1524961682 1 None MachineLearning t5_2r3gv \n",
"\n",
" ups \n",
"0 None \n",
"1 None \n",
"2 None \n",
"3 None \n",
"4 None \n",
"5 None \n",
"6 None \n",
"7 None \n",
"8 None \n",
"9 None \n",
"10 None \n",
"11 None \n",
"12 None "
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# THIS IS AN EXTRACT!\n",
"pd_reddit = pd.DataFrame(data_reddit)\n",
"print('#records: %i' % len(pd_reddit))\n",
"pd_reddit"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## CoQA\n",
"https://stanfordnlp.github.io/coqa/\n",
"https://arxiv.org/pdf/1808.07042.pdf\n",
"\n",
"~8,000 conversations; ~127,000 questions"
]
},
{
"cell_type": "code",
"execution_count": 63,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"#record: 7199\n"
]
},
{
"data": {
"text/plain": [
"{'source': 'wikipedia',\n",
" 'id': '3zotghdk5ibi9cex97fepx7jetpso7',\n",
" 'filename': 'Vatican_Library.txt',\n",
" 'story': 'The Vatican Apostolic Library (), more commonly called the Vatican Library or simply the Vat, is the library of the Holy See, located in Vatican City. Formally established in 1475, although it is much older, it is one of the oldest libraries in the world and contains one of the most significant collections of historical texts. It has 75,000 codices from throughout history, as well as 1.1 million printed books, which include some 8,500 incunabula. \\n\\nThe Vatican Library is a research library for history, law, philosophy, science and theology. The Vatican Library is open to anyone who can document their qualifications and research needs. Photocopies for private study of pages from books published between 1801 and 1990 can be requested in person or by mail. \\n\\nIn March 2014, the Vatican Library began an initial four-year project of digitising its collection of manuscripts, to be made available online. \\n\\nThe Vatican Secret Archives were separated from the library at the beginning of the 17th century; they contain another 150,000 items. \\n\\nScholars have traditionally divided the history of the library into five periods, Pre-Lateran, Lateran, Avignon, Pre-Vatican and Vatican. \\n\\nThe Pre-Lateran period, comprising the initial days of the library, dated from the earliest days of the Church. Only a handful of volumes survive from this period, though some are very significant.',\n",
" 'questions': [{'input_text': 'When was the Vat formally opened?',\n",
" 'turn_id': 1},\n",
" {'input_text': 'what is the library for?', 'turn_id': 2},\n",
" {'input_text': 'for what subjects?', 'turn_id': 3},\n",
" {'input_text': 'and?', 'turn_id': 4},\n",
" {'input_text': 'what was started in 2014?', 'turn_id': 5},\n",
" {'input_text': 'how do scholars divide the library?', 'turn_id': 6},\n",
" {'input_text': 'how many?', 'turn_id': 7},\n",
" {'input_text': 'what is the official name of the Vat?', 'turn_id': 8},\n",
" {'input_text': 'where is it?', 'turn_id': 9},\n",
" {'input_text': 'how many printed books does it contain?', 'turn_id': 10},\n",
" {'input_text': 'when were the Secret Archives moved from the rest of the library?',\n",
" 'turn_id': 11},\n",
" {'input_text': 'how many items are in this secret collection?',\n",
" 'turn_id': 12},\n",
" {'input_text': 'Can anyone use this library?', 'turn_id': 13},\n",
" {'input_text': 'what must be requested to view?', 'turn_id': 14},\n",
" {'input_text': 'what must be requested in person or by mail?',\n",
" 'turn_id': 15},\n",
" {'input_text': 'of what books?', 'turn_id': 16},\n",
" {'input_text': 'What is the Vat the library of?', 'turn_id': 17},\n",
" {'input_text': 'How many books survived the Pre Lateran period?',\n",
" 'turn_id': 18},\n",
" {'input_text': 'what is the point of the project started in 2014?',\n",
" 'turn_id': 19},\n",
" {'input_text': 'what will this allow?', 'turn_id': 20}],\n",
" 'answers': [{'span_start': 151,\n",
" 'span_end': 179,\n",
" 'span_text': 'Formally established in 1475',\n",
" 'input_text': 'It was formally established in 1475',\n",
" 'turn_id': 1},\n",
" {'span_start': 454,\n",
" 'span_end': 494,\n",
" 'span_text': 'he Vatican Library is a research library',\n",
" 'input_text': 'research',\n",
" 'turn_id': 2},\n",
" {'span_start': 457,\n",
" 'span_end': 511,\n",
" 'span_text': 'Vatican Library is a research library for history, law',\n",
" 'input_text': 'history, and law',\n",
" 'turn_id': 3},\n",
" {'span_start': 457,\n",
" 'span_end': 545,\n",
" 'span_text': 'Vatican Library is a research library for history, law, philosophy, science and theology',\n",
" 'input_text': 'philosophy, science and theology',\n",
" 'turn_id': 4},\n",
" {'span_start': 769,\n",
" 'span_end': 879,\n",
" 'span_text': 'March 2014, the Vatican Library began an initial four-year project of digitising its collection of manuscripts',\n",
" 'input_text': 'a project',\n",
" 'turn_id': 5},\n",
" {'span_start': 1048,\n",
" 'span_end': 1127,\n",
" 'span_text': 'Scholars have traditionally divided the history of the library into five period',\n",
" 'input_text': 'into periods',\n",
" 'turn_id': 6},\n",
" {'span_start': 1048,\n",
" 'span_end': 1128,\n",
" 'span_text': 'Scholars have traditionally divided the history of the library into five periods',\n",
" 'input_text': 'five',\n",
" 'turn_id': 7},\n",
" {'span_start': 4,\n",
" 'span_end': 94,\n",
" 'span_text': 'Vatican Apostolic Library (), more commonly called the Vatican Library or simply the Vat, ',\n",
" 'input_text': 'The Vatican Apostolic Library',\n",
" 'turn_id': 8},\n",
" {'span_start': 94,\n",
" 'span_end': 150,\n",
" 'span_text': 'is the library of the Holy See, located in Vatican City.',\n",
" 'input_text': 'in Vatican City',\n",
" 'turn_id': 9},\n",
" {'span_start': 328,\n",
" 'span_end': 412,\n",
" 'span_text': ' It has 75,000 codices from throughout history, as well as 1.1 million printed books',\n",
" 'input_text': '1.1 million',\n",
" 'turn_id': 10},\n",
" {'span_start': 917,\n",
" 'span_end': 1009,\n",
" 'span_text': 'atican Secret Archives were separated from the library at the beginning of the 17th century;',\n",
" 'input_text': 'at the beginning of the 17th century;',\n",
" 'turn_id': 11},\n",
" {'span_start': 915,\n",
" 'span_end': 1046,\n",
" 'span_text': ' Vatican Secret Archives were separated from the library at the beginning of the 17th century; they contain another 150,000 items. ',\n",
" 'input_text': '150,000',\n",
" 'turn_id': 12},\n",
" {'span_start': 546,\n",
" 'span_end': 643,\n",
" 'span_text': ' The Vatican Library is open to anyone who can document their qualifications and research needs. ',\n",
" 'input_text': 'anyone who can document their qualifications and research needs.',\n",
" 'turn_id': 13},\n",
" {'span_start': -1,\n",
" 'span_end': -1,\n",
" 'span_text': 'unknown',\n",
" 'input_text': 'unknown',\n",
" 'turn_id': 14,\n",
" 'bad_turn': 'true'},\n",
" {'span_start': 643,\n",
" 'span_end': 764,\n",
" 'span_text': 'Photocopies for private study of pages from books published between 1801 and 1990 can be requested in person or by mail. ',\n",
" 'input_text': 'Photocopies',\n",
" 'turn_id': 15},\n",
" {'span_start': 644,\n",
" 'span_end': 724,\n",
" 'span_text': 'hotocopies for private study of pages from books published between 1801 and 1990',\n",
" 'input_text': 'only books published between 1801 and 1990',\n",
" 'turn_id': 16},\n",
" {'span_start': 78,\n",
" 'span_end': 125,\n",
" 'span_text': 'simply the Vat, is the library of the Holy See,',\n",
" 'input_text': 'the Holy See',\n",
" 'turn_id': 17},\n",
" {'span_start': 1192,\n",
" 'span_end': 1384,\n",
" 'span_text': 'Pre-Lateran period, comprising the initial days of the library, dated from the earliest days of the Church. Only a handful of volumes survive from this period, though some are very significant',\n",
" 'input_text': 'a handful of volumes',\n",
" 'turn_id': 18},\n",
" {'span_start': 785,\n",
" 'span_end': 881,\n",
" 'span_text': 'Vatican Library began an initial four-year project of digitising its collection of manuscripts, ',\n",
" 'input_text': 'digitising manuscripts',\n",
" 'turn_id': 19},\n",
" {'span_start': 868,\n",
" 'span_end': 910,\n",
" 'span_text': 'manuscripts, to be made available online. ',\n",
" 'input_text': 'them to be viewed online.',\n",
" 'turn_id': 20}],\n",
" 'name': 'Vatican_Library.txt'}"
]
},
"execution_count": 63,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data_coqa, d0_coqa = load('/mnt/DATA/ML/data/corpora/QA/CoQA/coqa-train-v1.0.json', getter=lambda d: d['data'])\n",
"d0_coqa"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## QuAC\n",
"http://quac.ai/ http://quac.ai/datasheet.pdf\n",
"\n",
"conversational, but extractive"
]
},
{
"cell_type": "code",
"execution_count": 60,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"#record: 11567\n"
]
},
{
"data": {
"text/plain": [
"{'context': 'According to the Indian census of 2001, there were 30,803,747 speakers of Malayalam in Kerala, making up 93.2% of the total number of Malayalam speakers in India, and 96.7% of the total population of the state. There were a further 701,673 (2.1% of the total number) in Karnataka, 557,705 (1.7%) in Tamil Nadu and 406,358 (1.2%) in Maharashtra. The number of Malayalam speakers in Lakshadweep is 51,100, which is only 0.15% of the total number, but is as much as about 84% of the population of Lakshadweep. In all, Malayalis made up 3.22% of the total Indian population in 2001. Of the total 33,066,392 Malayalam speakers in India in 2001, 33,015,420 spoke the standard dialects, 19,643 spoke the Yerava dialect and 31,329 spoke non-standard regional variations like Eranadan. As per the 1991 census data, 28.85% of all Malayalam speakers in India spoke a second language and 19.64% of the total knew three or more languages. Large numbers of Malayalis have settled in Bangalore, Mangalore, Delhi, Coimbatore, Hyderabad, Mumbai (Bombay), Ahmedabad, Pune, and Chennai (Madras). A large number of Malayalis have also emigrated to the Middle East, the United States, and Europe. Accessed November 22, 2014.</ref> including a large number of professionals. There were 7,093 Malayalam speakers in Australia in 2006. The 2001 Canadian census reported 7,070 people who listed Malayalam as their mother tongue, mostly in the Greater Toronto Area and Southern Ontario. In 2010, the Census of Population of Singapore reported that there were 26,348 Malayalees in Singapore. The 2006 New Zealand census reported 2,139 speakers. 134 Malayalam speaking households were reported in 1956 in Fiji. There is also a considerable Malayali population in the Persian Gulf regions, especially in Bahrain, Muscat, Doha, Dubai, Abu Dhabi, Kuwait and European region mainly in London. World Malayalee Council, the organisation working with the Malayali diaspora across the Globe has embarked upon a project for making a data bank of the diaspora. CANNOTANSWER',\n",
" 'qas': [{'followup': 'm',\n",
" 'yesno': 'x',\n",
" 'question': 'Where is Malayali located?',\n",
" 'answers': [{'text': '30,803,747 speakers of Malayalam in Kerala, making up 93.2% of the total number of Malayalam speakers in India,',\n",
" 'answer_start': 51}],\n",
" 'id': 'C_69758fcdfc1f46baba0e92c0f3b0919c_1_q#0',\n",
" 'orig_answer': {'text': '30,803,747 speakers of Malayalam in Kerala, making up 93.2% of the total number of Malayalam speakers in India,',\n",
" 'answer_start': 51}},\n",
" {'followup': 'n',\n",
" 'yesno': 'x',\n",
" 'question': 'What other languages are spoken there?',\n",
" 'answers': [{'text': '33,015,420 spoke the standard dialects, 19,643 spoke the Yerava dialect and 31,329 spoke non-standard regional variations like Eranadan.',\n",
" 'answer_start': 640}],\n",
" 'id': 'C_69758fcdfc1f46baba0e92c0f3b0919c_1_q#1',\n",
" 'orig_answer': {'text': '33,015,420 spoke the standard dialects, 19,643 spoke the Yerava dialect and 31,329 spoke non-standard regional variations like Eranadan.',\n",
" 'answer_start': 640}},\n",
" {'followup': 'n',\n",
" 'yesno': 'x',\n",
" 'question': 'What else is this place known for?',\n",
" 'answers': [{'text': 'World Malayalee Council, the organisation working with the Malayali diaspora across the Globe has embarked upon a project for making a data bank of the diaspora.',\n",
" 'answer_start': 1862}],\n",
" 'id': 'C_69758fcdfc1f46baba0e92c0f3b0919c_1_q#2',\n",
" 'orig_answer': {'text': 'World Malayalee Council, the organisation working with the Malayali diaspora across the Globe has embarked upon a project for making a data bank of the diaspora.',\n",
" 'answer_start': 1862}},\n",
" {'followup': 'n',\n",
" 'yesno': 'x',\n",
" 'question': 'Were they ever successful in doing this?',\n",
" 'answers': [{'text': 'CANNOTANSWER', 'answer_start': 2024}],\n",
" 'id': 'C_69758fcdfc1f46baba0e92c0f3b0919c_1_q#3',\n",
" 'orig_answer': {'text': 'CANNOTANSWER', 'answer_start': 2024}},\n",
" {'followup': 'n',\n",
" 'yesno': 'x',\n",
" 'question': 'Do they produce anything from here?',\n",
" 'answers': [{'text': 'CANNOTANSWER', 'answer_start': 2024}],\n",
" 'id': 'C_69758fcdfc1f46baba0e92c0f3b0919c_1_q#4',\n",
" 'orig_answer': {'text': 'CANNOTANSWER', 'answer_start': 2024}},\n",
" {'followup': 'n',\n",
" 'yesno': 'y',\n",
" 'question': 'Is this population still growing?',\n",
" 'answers': [{'text': 'In 2010, the Census of Population of Singapore reported that there were 26,348 Malayalees in Singapore.',\n",
" 'answer_start': 1461}],\n",
" 'id': 'C_69758fcdfc1f46baba0e92c0f3b0919c_1_q#5',\n",
" 'orig_answer': {'text': 'In 2010, the Census of Population of Singapore reported that there were 26,348 Malayalees in Singapore.',\n",
" 'answer_start': 1461}},\n",
" {'followup': 'n',\n",
" 'yesno': 'x',\n",
" 'question': 'Is the country thriving?',\n",
" 'answers': [{'text': 'CANNOTANSWER', 'answer_start': 2024}],\n",
" 'id': 'C_69758fcdfc1f46baba0e92c0f3b0919c_1_q#6',\n",
" 'orig_answer': {'text': 'CANNOTANSWER', 'answer_start': 2024}}],\n",
" 'id': 'C_69758fcdfc1f46baba0e92c0f3b0919c_1'}"
]
},
"execution_count": 60,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"def getter_quac(_data):\n",
" for a in _data['data']:\n",
" for p in a['paragraphs']:\n",
" yield p\n",
"data_quac, d0_quac = load('/mnt/DATA/ML/data/corpora/QA/QuAC/train_v0.2.json', getter=getter_quac)\n",
"d0_quac"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## ShaRC\n",
"https://sharc-data.github.io/\n",
"https://arxiv.org/pdf/1809.01494.pdf\n",
"\n",
"conversational"
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"#record: 21890\n"
]
},
{
"data": {
"text/plain": [
"{'utterance_id': '0006838de2f66899440d9ba6df275603b6cdc5c4',\n",
" 'tree_id': 'dd4cb41e8c180a800a000c796ac71f7fc58eb66d',\n",
" 'source_url': 'https://www.irs.gov/businesses/small-businesses-self-employed/declaring-bankruptcy',\n",
" 'snippet': '# Bankruptcy\\n\\nOther types of bankruptcy include Chapters 9, 12 and 15. Cases under these chapters of the bankruptcy code involve municipalities, family farmers and fisherman, and international cases.',\n",
" 'question': 'Can I file bankruptcy under Chapter 9, 12 or 15?',\n",
" 'scenario': 'I am a doctor. It does not involve a municipality. It involves an international case.',\n",
" 'answer': 'Yes',\n",
" 'history': [],\n",
" 'evidence': [{'follow_up_question': 'Are you a family farmer or fisherman?',\n",
" 'follow_up_answer': 'No'},\n",
" {'follow_up_question': 'Does it involve a municipality?',\n",
" 'follow_up_answer': 'No'},\n",
" {'follow_up_question': 'Does it involve an international case?',\n",
" 'follow_up_answer': 'Yes'}]}"
]
},
"execution_count": 37,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data_sharc, d0_sharc = load('/mnt/DATA/ML/data/corpora/QA/ShARC/sharc1-official/json/sharc_train.json')\n",
"d0_sharc"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Natural Questions\n",
"https://ai.google.com/research/NaturalQuestions\n",
"\n",
"~300,000 questions"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"data_nq, d0_nq = load('/mnt/DATA/ML/data/corpora/QA/NQ/sample/v1.0_sample_nq-dev-sample_extract10.json')\n",
"#d0_nq"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python [conda env:notebook]",
"language": "python",
"name": "conda-env-notebook-py"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.3"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment