xiaoouwang/negation.ipynb

## negation.ipynb
{
  "cells": [
    {
      "metadata": {},
      "cell_type": "markdown",
      "source": "# Does CamemBERT/FlauBERT/Bert understand negation?\n\nShort answer: no.\n\nThis notebook replicates a section of the paper of `Ettinger2020` on negation using a French corpus. The idea is to add negation on propositions and test if Bert `without fine-tuning` is able to switch the answer from one to another.\n\nActually this principle resembles a lot the Winograd Schema Challenge.\n\nHere is an example in French:\n\n```\nid;masked;tgt;item;cond;right_answer;options\n0;La truite est un <mask>;poisson;0;TA;poisson;poisson outil\n1;La truite n'est pas un <mask>;poisson;0;FN;outil;poisson outil\n```\nThe `<mask>` token should be `poisson` in the first example and `outil` in the second.\n\nHowever, in a corpus of 18 pairs of sentences Bert was unable to switch response on either pair.\n\nThere's no reason for CamemBERT/FlauBERT to behave differently on this task.\n\nThis notebook shows that French counterparts of Bert (CamemBERT and FlauBERT) are indeed incapable of performing better on a French corpus.\n\nPlease note that I write some wrapper functions to faciliate use of French Berts published via a packge named `frenchnlp`. Be sure to install it before running this notebook on your computer.\n\n`!pip install frenchnlp`"
    },
    {
      "metadata": {
        "trusted": false
      },
      "cell_type": "code",
      "source": "# Some utility functions\n\nfrom frenchnlp import *\n\ndef change_verb(sent,new_verb):\n    return sent.replace(\"n'est\",\"ne \"+ new_verb).replace(\"est\",new_verb)\n\ndef change_header(sent,new_term):\n    if \"La\" in sent:\n        return sent.replace(\"La\",\"Le terme\")\n    else:\n        return sent.replace(\"Le\",\"Le terme\")",
      "execution_count": 1,
      "outputs": []
    },
    {
      "metadata": {},
      "cell_type": "markdown",
      "source": "## Corpus and variations\n\nI translated the items of Ettinger2020 into French. Besides, I add some variations to test the effect of minor modifications on the original utterance.\n\nFor a sentence like `La truite est un <mask>` (`<original>`), you have:\n\n* La truite représente un `<mask>`.  `<variation1>`\n* La truite représente un `<mask>` en français. `<variation2>`\n* Le terme truite désigne un `<mask>`. `<variation3>`\n* Le terme truite désigne un `<mask>` en français. `<variation4>`\n\nI append `en français` to each sentence because it is possible that Bert works better when the left and right contexts are provided since it was not trained like GPT3 on a traditional language modeling task.\n\nAs you can see from the dataframe, `<mask>` was also replaced with `<special1>` to comply with FlauBERT's annotation."
    },
    {
      "metadata": {
        "trusted": false
      },
      "cell_type": "code",
      "source": "df = xo_load_data(\"negation_french.csv\")\ndf[\"masked_cam\"] = df[\"masked\"].apply(lambda x:x+\".\")\ndf[\"masked_flau\"] = df[\"masked_cam\"].apply(lambda x:x.replace(\"<mask>\",\"<special1>\")) \ndf[\"ch_verb\"] = df[\"masked\"].apply(lambda x: change_verb(x,\"représente\")+\".\")\ndf[\"ch_verb_flau\"] = df[\"masked\"].apply(lambda x:change_verb(x,\"représente\").replace(\"<mask>\",\"<special1>\")+\".\")\ndf[\"ch_verb_add_right\"] = df[\"ch_verb\"].apply(lambda x:x.replace(\".\",\"\")+\" en français.\")\ndf[\"ch_verb_add_right_flau\"] = df[\"ch_verb_flau\"].apply(lambda x:x.replace(\".\",\"\")+\" en français.\")\ndf[\"ch_whole\"] = df[\"masked_cam\"].apply(lambda x:change_verb(change_header(x,\"Le terme\"),\"désigne\"))\ndf[\"ch_whole_flau\"] = df[\"masked_flau\"].apply(lambda x:change_verb(change_header(x,\"Le terme\"),\"désigne\"))\ndf[\"ch_whole_add_right\"] = df[\"ch_whole\"].apply(lambda x:x.replace(\".\",\"\")+\" en français.\")\ndf[\"ch_whole_add_right_flau\"] = df[\"ch_whole_flau\"].apply(lambda x:x.replace(\".\",\"\")+\" en français.\")\ncols = list(df.columns)\ndf[cols[:7]].head(1)",
      "execution_count": 10,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": "  id                   masked      tgt item cond right_answer        options\n0  0  La truite est un <mask>  poisson    0   TA      poisson  poisson outil",
            "text/html": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>id</th>\n      <th>masked</th>\n      <th>tgt</th>\n      <th>item</th>\n      <th>cond</th>\n      <th>right_answer</th>\n      <th>options</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>0</th>\n      <td>0</td>\n      <td>La truite est un &lt;mask&gt;</td>\n      <td>poisson</td>\n      <td>0</td>\n      <td>TA</td>\n      <td>poisson</td>\n      <td>poisson outil</td>\n    </tr>\n  </tbody>\n</table>\n</div>"
          },
          "metadata": {},
          "execution_count": 10
        }
      ]
    },
    {
      "metadata": {
        "trusted": false
      },
      "cell_type": "code",
      "source": "df[cols[7:14]].head(1)",
      "execution_count": 11,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": "   response1  prob1  response2  prob2                masked_cam  \\\n0          0      0          0      0  La truite est un <mask>.   \n\n                    masked_flau                          ch_verb  \n0  La truite est un <special1>.  La truite représente un <mask>.  ",
            "text/html": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>response1</th>\n      <th>prob1</th>\n      <th>response2</th>\n      <th>prob2</th>\n      <th>masked_cam</th>\n      <th>masked_flau</th>\n      <th>ch_verb</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>0</th>\n      <td>0</td>\n      <td>0</td>\n      <td>0</td>\n      <td>0</td>\n      <td>La truite est un &lt;mask&gt;.</td>\n      <td>La truite est un &lt;special1&gt;.</td>\n      <td>La truite représente un &lt;mask&gt;.</td>\n    </tr>\n  </tbody>\n</table>\n</div>"
          },
          "metadata": {},
          "execution_count": 11
        }
      ]
    },
    {
      "metadata": {
        "trusted": false
      },
      "cell_type": "code",
      "source": "df[cols[14:]].head(1)",
      "execution_count": 12,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": "                          ch_verb_flau  \\\n0  La truite représente un <special1>.   \n\n                             ch_verb_add_right  \\\n0  La truite représente un <mask> en français.   \n\n                            ch_verb_add_right_flau  \\\n0  La truite représente un <special1> en français.   \n\n                             ch_whole                           ch_whole_flau  \\\n0  Le terme truite désigne un <mask>.  Le terme truite désigne un <special1>.   \n\n                               ch_whole_add_right  \\\n0  Le terme truite désigne un <mask> en français.   \n\n                             ch_whole_add_right_flau  \n0  Le terme truite désigne un <special1> en franç...  ",
            "text/html": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>ch_verb_flau</th>\n      <th>ch_verb_add_right</th>\n      <th>ch_verb_add_right_flau</th>\n      <th>ch_whole</th>\n      <th>ch_whole_flau</th>\n      <th>ch_whole_add_right</th>\n      <th>ch_whole_add_right_flau</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>0</th>\n      <td>La truite représente un &lt;special1&gt;.</td>\n      <td>La truite représente un &lt;mask&gt; en français.</td>\n      <td>La truite représente un &lt;special1&gt; en français.</td>\n      <td>Le terme truite désigne un &lt;mask&gt;.</td>\n      <td>Le terme truite désigne un &lt;special1&gt;.</td>\n      <td>Le terme truite désigne un &lt;mask&gt; en français.</td>\n      <td>Le terme truite désigne un &lt;special1&gt; en franç...</td>\n    </tr>\n  </tbody>\n</table>\n</div>"
          },
          "metadata": {},
          "execution_count": 12
        }
      ]
    },
    {
      "metadata": {
        "trusted": false
      },
      "cell_type": "code",
      "source": "# Camembert\npipeline = xo_fillin(\"camembert-base\",1000)\ncam_results = xo_produce_answers(pipeline,\"masked_cam\",df)\n\n# Flaubert\npipeline_flau = xo_fillin(\"flaubert/flaubert_base_cased\",1000)\nflau_results = xo_produce_answers(pipeline_flau,\"masked_flau\",df)",
      "execution_count": 13,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stderr",
          "text": "Some weights of FlaubertWithLMHeadModel were not initialized from the model checkpoint at flaubert/flaubert_base_cased and are newly initialized: ['transformer.position_ids']\nYou should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.\n"
        }
      ]
    },
    {
      "metadata": {},
      "cell_type": "markdown",
      "source": "## Results on the standard version with no variation\n\nFor examples like `La truite est un <mask>`, note that the accuracy is 50% with equal number of right and wrong responses.\n\nInterestingly, both models behaved the same way.\n\nThis is somewhat expected because both models were trained on similar corpus using similar method. Some differences exist, however. CamemBERT was trained using `whole word masking` and FlauBERT `token masking`. For details, please read the original papers.\n\nSuppose that no switch ever happens, the accuracy should be exactly 50%. However upon closer investigation, a successful switch took place for CamemBERT.\n\n```\nLa fourmi est un <mask>. insecte\nLa fourmi n'est pas un <mask>. légume\n```\n\nThe 50% accuracy is due to another switch with both wrong answers.\n\n```\nLe petit pois est un <mask>.\t bâtiment\t\nLe petit pois n'est pas un <mask>.\t légume\n```"
    },
    {
      "metadata": {
        "trusted": false
      },
      "cell_type": "code",
      "source": "# 50% in both cases\nprint(xo_compute_score(\"right_answer\",cam_results))\nprint(xo_compute_score(\"right_answer\",flau_results))",
      "execution_count": 17,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": "   correct  no_response  bad_response  total_responses  exactitude  qualite  \\\n0        9           20             7               36        25.0    56.25   \n\n   reussite  \n0     52.78  \n   correct  no_response  bad_response  total_responses  exactitude  qualite  \\\n0        9           20             7               36        25.0    56.25   \n\n   reussite  \n0     52.78  \n"
        }
      ]
    },
    {
      "metadata": {},
      "cell_type": "markdown",
      "source": "## Results on the version with être replaced by représente `<variation1>`\n\nThe accuracy decreases to 41.67% mainly because of the asymmetry between responses for affirmative items and those for negative items. Put in simpler terms, in some cases CamemBERT/FlauBERT gave only one answer to a pair of sentences and this one answer was wrong."
    },
    {
      "metadata": {
        "trusted": false
      },
      "cell_type": "code",
      "source": "pipeline = xo_fillin(\"camembert-base\",1000)\ncam_results_ch_verb = xo_produce_answers(pipeline,\"ch_verb\",df)\npipeline_flau = xo_fillin(\"flaubert/flaubert_base_cased\",1000)\nflau_results_ch_verb = xo_produce_answers(pipeline_flau,\"ch_verb_flau\",df)\nprint(xo_compute_score(\"right_answer\",cam_results_ch_verb))\nprint(xo_compute_score(\"right_answer\",flau_results_ch_verb))",
      "execution_count": 208,
      "outputs": [
        {
          "name": "stderr",
          "output_type": "stream",
          "text": "Some weights of FlaubertWithLMHeadModel were not initialized from the model checkpoint at flaubert/flaubert_base_cased and are newly initialized: ['transformer.position_ids']\nYou should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.\n   correct  no_response  bad_response  total_responses  exactitude  qualite  \\\n0       15            3            18               36       41.67    45.45   \n\n   reussite  \n0     45.83  \n   correct  no_response  bad_response  total_responses  exactitude  qualite  \\\n0       15            3            18               36       41.67    45.45   \n\n   reussite  \n0     45.83  \n"
        }
      ]
    },
    {
      "metadata": {},
      "cell_type": "markdown",
      "source": "## Results on the version with être replaced by désigner `<variation3>`\n\nThe accuracy remains the same (50%) but note that there is 1 case where FlauBERT/CamemBERT fails to give a response, suggesting that how the phrase is worded has an effect on the results."
    },
    {
      "metadata": {
        "trusted": false
      },
      "cell_type": "code",
      "source": "# camembert\npipeline = xo_fillin(\"camembert-base\",1000)\ncam_results_ch_whole = xo_produce_answers(pipeline,\"ch_whole\",df)\n\n# flaubert\npipeline_flau = xo_fillin(\"flaubert/flaubert_base_cased\",1000)\nflau_results_ch_whole_flau = xo_produce_answers(pipeline_flau,\"ch_whole_flau\",df)\n# results\nprint(xo_compute_score(\"right_answer\",cam_results_ch_whole))\nprint(xo_compute_score(\"right_answer\",flau_results_ch_whole_flau))",
      "execution_count": 18,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stderr",
          "text": "Some weights of FlaubertWithLMHeadModel were not initialized from the model checkpoint at flaubert/flaubert_base_cased and are newly initialized: ['transformer.position_ids']\nYou should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.\n   correct  no_response  bad_response  total_responses  exactitude  qualite  \\\n0       18            1            17               36        50.0    51.43   \n\n   reussite  \n0     51.39  \n   correct  no_response  bad_response  total_responses  exactitude  qualite  \\\n0       18            1            17               36        50.0    51.43   \n\n   reussite  \n0     51.39  \n"
        }
      ]
    },
    {
      "metadata": {},
      "cell_type": "markdown",
      "source": "## Results on the version with être replaced by représenter/désigner and a right context `<variation2> and <variation4>`\n\nThe accuracy decreases to 31% and 25% for `<variation2>` and `<variation4>`, counterparts of `<variation1>` and `<variation3>` with a right context. Adding `en français` to the right impacts the model's ability to give a response out of the two options.\n\nNote that given the large number of non responses, the `qualite` measure considering only the answered sentences is more informative. It's 48% for `<variation2>` and 56% for `<variation4>`. However, the asymmetry makes it difficult to assess the models' sensitivity to negation."
    },
    {
      "metadata": {
        "trusted": false
      },
      "cell_type": "code",
      "source": "# camembert\n\npipeline = xo_fillin(\"camembert-base\",1000)\ncam_results_ch_verb_add_right = xo_produce_answers(pipeline,\"ch_verb_add_right\",df)\n\n# falubert\npipeline_flau = xo_fillin(\"flaubert/flaubert_base_cased\",1000)\nflau_results_ch_verb_add_right_flau = xo_produce_answers(pipeline_flau,\"ch_verb_add_right_flau\",df)\n\nprint(xo_compute_score(\"right_answer\",cam_results_ch_verb_add_right))\nprint(xo_compute_score(\"right_answer\",flau_results_ch_verb_add_right_flau))",
      "execution_count": 15,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stderr",
          "text": "Some weights of FlaubertWithLMHeadModel were not initialized from the model checkpoint at flaubert/flaubert_base_cased and are newly initialized: ['transformer.position_ids']\nYou should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.\n   correct  no_response  bad_response  total_responses  exactitude  qualite  \\\n0       11           13            12               36       30.56    47.83   \n\n   reussite  \n0     48.61  \n   correct  no_response  bad_response  total_responses  exactitude  qualite  \\\n0       11           13            12               36       30.56    47.83   \n\n   reussite  \n0     48.61  \n"
        }
      ]
    },
    {
      "metadata": {
        "trusted": false
      },
      "cell_type": "code",
      "source": "pipeline = xo_fillin(\"camembert-base\",1000)\ncam_results_ch_whole_add_right = xo_produce_answers(pipeline,\"ch_whole_add_right\",df)\npipeline_flau = xo_fillin(\"flaubert/flaubert_base_cased\",1000)\nflau_results_ch_whole_add_right_flau = xo_produce_answers(pipeline_flau,\"ch_whole_add_right_flau\",df)\n# 50% in both cases\nprint(xo_compute_score(\"right_answer\",cam_results_ch_whole_add_right))\nprint(xo_compute_score(\"right_answer\",flau_results_ch_whole_add_right_flau))",
      "execution_count": 16,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stderr",
          "text": "Some weights of FlaubertWithLMHeadModel were not initialized from the model checkpoint at flaubert/flaubert_base_cased and are newly initialized: ['transformer.position_ids']\nYou should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.\n   correct  no_response  bad_response  total_responses  exactitude  qualite  \\\n0        9           20             7               36        25.0    56.25   \n\n   reussite  \n0     52.78  \n   correct  no_response  bad_response  total_responses  exactitude  qualite  \\\n0        9           20             7               36        25.0    56.25   \n\n   reussite  \n0     52.78  \n"
        }
      ]
    },
    {
      "metadata": {
        "trusted": false
      },
      "cell_type": "code",
      "source": "cam_results_ch_whole_add_right[[\"ch_whole_add_right\",\"options\",\"response1\",\"prob1\",\"response2\",\"prob2\"]].to_csv(\"res_cam4.csv\",index=False)\nflau_results_ch_whole_add_right_flau[[\"ch_whole_add_right_flau\",\"options\",\"response1\",\"prob1\",\"response2\",\"prob2\"]].to_csv(\"res_flau4.csv\",index=False)",
      "execution_count": 213,
      "outputs": []
    },
    {
      "metadata": {},
      "cell_type": "markdown",
      "source": "## Conclusion\n\nThis simple experiment shows the insensitivity of Bert-like language models to negation. The same observations have proven to be valid on a similar French corpus.\n\nWhat improvements can be made?\n\nThe addition of self-supervised tasks requiring more sophisticated linguistic information than the simple linear order (such as the addition of syntactic information by (Xu et al., 2020)) could be helpful.\n\n## Reference\n\nDevlin, J., Chang M.-W., Lee K. and Toutanova K. (2019). “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding” In NAACL-HLT 2019.\n\nEttinger, A. (2019). What bert is not : Lessons from a new suite of psycholinguistic diagnostics for language models. Transactions of the Association for Computational Linguistics, 8:34–48.\n\nLe, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., ... & Schwab, D. (2019). Flaubert: Unsupervised language model pre-training for french. arXiv preprint arXiv:1912.05372.\n\nMartin, L., Muller, B., Suárez, P. J. O., Dupont, Y., Romary, L., de la Clergerie, É. V., ... & Sagot, B. (2019). Camembert: a tasty french language model. arXiv preprint arXiv:1911.03894.\n\nXu Z., Guo D., Tang D., Su Q., Shou L., Gong M., Zhong W., Quan X., Duan N., and Jiang D. (2020) “Syntax-Enhanced Pre-trained Model”. arXiv:2012.14116\n"
    }
  ],
  "metadata": {
    "_draft": {
      "nbviewer_url": "https://gist.github.com/c3a0968fbad83f884d6db95b7b4c96d0"
    },
    "gist": {
      "id": "c3a0968fbad83f884d6db95b7b4c96d0",
      "data": {
        "description": "bert_negation.ipynb",
        "public": true
      }
    },
    "kernelspec": {
      "name": "base",
      "display_name": "Python 3",
      "language": "python"
    },
    "language_info": {
      "name": "python",
      "version": "3.7.6",
      "mimetype": "text/x-python",
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "pygments_lexer": "ipython3",
      "nbconvert_exporter": "python",
      "file_extension": ".py"
    },
    "nbTranslate": {
      "hotkey": "alt-t",
      "sourceLang": "en",
      "targetLang": "fr",
      "displayLangs": [
        "*"
      ],
      "langInMainMenu": true,
      "useGoogleTranslate": true
    },
    "toc": {
      "nav_menu": {},
      "number_sections": true,
      "sideBar": false,
      "skip_h1_title": true,
      "base_numbering": 1,
      "title_cell": "Table des matières",
      "title_sidebar": "Contents",
      "toc_cell": false,
      "toc_position": {},
      "toc_section_display": true,
      "toc_window_display": false
    },
    "varInspector": {
      "window_display": false,
      "cols": {
        "lenName": 16,
        "lenType": 16,
        "lenVar": 40
      },
      "kernels_config": {
        "python": {
          "library": "var_list.py",
          "delete_cmd_prefix": "del ",
          "delete_cmd_postfix": "",
          "varRefreshCmd": "print(var_dic_list())"
        },
        "r": {
          "library": "var_list.r",
          "delete_cmd_prefix": "rm(",
          "delete_cmd_postfix": ") ",
          "varRefreshCmd": "cat(var_dic_list()) "
        }
      },
      "types_to_exclude": [
        "module",
        "function",
        "builtin_function_or_method",
        "instance",
        "_Feature"
      ]
    }
  },
  "nbformat": 4,
  "nbformat_minor": 2
}
	{
	"cells": [
	{
	"metadata": {},
	"cell_type": "markdown",
	"source": "# Does CamemBERT/FlauBERT/Bert understand negation?\n\nShort answer: no.\n\nThis notebook replicates a section of the paper of `Ettinger2020` on negation using a French corpus. The idea is to add negation on propositions and test if Bert `without fine-tuning` is able to switch the answer from one to another.\n\nActually this principle resembles a lot the Winograd Schema Challenge.\n\nHere is an example in French:\n\n```\nid;masked;tgt;item;cond;right_answer;options\n0;La truite est un <mask>;poisson;0;TA;poisson;poisson outil\n1;La truite n'est pas un <mask>;poisson;0;FN;outil;poisson outil\n```\nThe `<mask>` token should be `poisson` in the first example and `outil` in the second.\n\nHowever, in a corpus of 18 pairs of sentences Bert was unable to switch response on either pair.\n\nThere's no reason for CamemBERT/FlauBERT to behave differently on this task.\n\nThis notebook shows that French counterparts of Bert (CamemBERT and FlauBERT) are indeed incapable of performing better on a French corpus.\n\nPlease note that I write some wrapper functions to faciliate use of French Berts published via a packge named `frenchnlp`. Be sure to install it before running this notebook on your computer.\n\n`!pip install frenchnlp`"
	},
	{
	"metadata": {
	"trusted": false
	},
	"cell_type": "code",
	"source": "# Some utility functions\n\nfrom frenchnlp import *\n\ndef change_verb(sent,new_verb):\n return sent.replace(\"n'est\",\"ne \"+ new_verb).replace(\"est\",new_verb)\n\ndef change_header(sent,new_term):\n if \"La\" in sent:\n return sent.replace(\"La\",\"Le terme\")\n else:\n return sent.replace(\"Le\",\"Le terme\")",
	"execution_count": 1,
	"outputs": []
	},
	{
	"metadata": {},
	"cell_type": "markdown",
	"source": "## Corpus and variations\n\nI translated the items of Ettinger2020 into French. Besides, I add some variations to test the effect of minor modifications on the original utterance.\n\nFor a sentence like `La truite est un <mask>` (`<original>`), you have:\n\n* La truite représente un `<mask>`. `<variation1>`\n* La truite représente un `<mask>` en français. `<variation2>`\n* Le terme truite désigne un `<mask>`. `<variation3>`\n* Le terme truite désigne un `<mask>` en français. `<variation4>`\n\nI append `en français` to each sentence because it is possible that Bert works better when the left and right contexts are provided since it was not trained like GPT3 on a traditional language modeling task.\n\nAs you can see from the dataframe, `<mask>` was also replaced with `<special1>` to comply with FlauBERT's annotation."
	},
	{
	"metadata": {
	"trusted": false
	},
	"cell_type": "code",
	"source": "df = xo_load_data(\"negation_french.csv\")\ndf[\"masked_cam\"] = df[\"masked\"].apply(lambda x:x+\".\")\ndf[\"masked_flau\"] = df[\"masked_cam\"].apply(lambda x:x.replace(\"<mask>\",\"<special1>\")) \ndf[\"ch_verb\"] = df[\"masked\"].apply(lambda x: change_verb(x,\"représente\")+\".\")\ndf[\"ch_verb_flau\"] = df[\"masked\"].apply(lambda x:change_verb(x,\"représente\").replace(\"<mask>\",\"<special1>\")+\".\")\ndf[\"ch_verb_add_right\"] = df[\"ch_verb\"].apply(lambda x:x.replace(\".\",\"\")+\" en français.\")\ndf[\"ch_verb_add_right_flau\"] = df[\"ch_verb_flau\"].apply(lambda x:x.replace(\".\",\"\")+\" en français.\")\ndf[\"ch_whole\"] = df[\"masked_cam\"].apply(lambda x:change_verb(change_header(x,\"Le terme\"),\"désigne\"))\ndf[\"ch_whole_flau\"] = df[\"masked_flau\"].apply(lambda x:change_verb(change_header(x,\"Le terme\"),\"désigne\"))\ndf[\"ch_whole_add_right\"] = df[\"ch_whole\"].apply(lambda x:x.replace(\".\",\"\")+\" en français.\")\ndf[\"ch_whole_add_right_flau\"] = df[\"ch_whole_flau\"].apply(lambda x:x.replace(\".\",\"\")+\" en français.\")\ncols = list(df.columns)\ndf[cols[:7]].head(1)",
	"execution_count": 10,
	"outputs": [
	{
	"output_type": "execute_result",
	"data": {
	"text/plain": " id masked tgt item cond right_answer options\n0 0 La truite est un <mask> poisson 0 TA poisson poisson outil",
	"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>id</th>\n <th>masked</th>\n <th>tgt</th>\n <th>item</th>\n <th>cond</th>\n <th>right_answer</th>\n <th>options</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>0</td>\n <td>La truite est un <mask></td>\n <td>poisson</td>\n <td>0</td>\n <td>TA</td>\n <td>poisson</td>\n <td>poisson outil</td>\n </tr>\n </tbody>\n</table>\n</div>"
	},
	"metadata": {},
	"execution_count": 10
	}
	]
	},
	{
	"metadata": {
	"trusted": false
	},
	"cell_type": "code",
	"source": "df[cols[7:14]].head(1)",
	"execution_count": 11,
	"outputs": [
	{
	"output_type": "execute_result",
	"data": {
	"text/plain": " response1 prob1 response2 prob2 masked_cam \\\n0 0 0 0 0 La truite est un <mask>. \n\n masked_flau ch_verb \n0 La truite est un <special1>. La truite représente un <mask>. ",
	"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>response1</th>\n <th>prob1</th>\n <th>response2</th>\n <th>prob2</th>\n <th>masked_cam</th>\n <th>masked_flau</th>\n <th>ch_verb</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>La truite est un <mask>.</td>\n <td>La truite est un <special1>.</td>\n <td>La truite représente un <mask>.</td>\n </tr>\n </tbody>\n</table>\n</div>"
	},
	"metadata": {},
	"execution_count": 11
	}
	]
	},
	{
	"metadata": {
	"trusted": false
	},
	"cell_type": "code",
	"source": "df[cols[14:]].head(1)",
	"execution_count": 12,
	"outputs": [
	{
	"output_type": "execute_result",
	"data": {
	"text/plain": " ch_verb_flau \\\n0 La truite représente un <special1>. \n\n ch_verb_add_right \\\n0 La truite représente un <mask> en français. \n\n ch_verb_add_right_flau \\\n0 La truite représente un <special1> en français. \n\n ch_whole ch_whole_flau \\\n0 Le terme truite désigne un <mask>. Le terme truite désigne un <special1>. \n\n ch_whole_add_right \\\n0 Le terme truite désigne un <mask> en français. \n\n ch_whole_add_right_flau \n0 Le terme truite désigne un <special1> en franç... ",
	"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>ch_verb_flau</th>\n <th>ch_verb_add_right</th>\n <th>ch_verb_add_right_flau</th>\n <th>ch_whole</th>\n <th>ch_whole_flau</th>\n <th>ch_whole_add_right</th>\n <th>ch_whole_add_right_flau</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>La truite représente un <special1>.</td>\n <td>La truite représente un <mask> en français.</td>\n <td>La truite représente un <special1> en français.</td>\n <td>Le terme truite désigne un <mask>.</td>\n <td>Le terme truite désigne un <special1>.</td>\n <td>Le terme truite désigne un <mask> en français.</td>\n <td>Le terme truite désigne un <special1> en franç...</td>\n </tr>\n </tbody>\n</table>\n</div>"
	},
	"metadata": {},
	"execution_count": 12
	}
	]
	},
	{
	"metadata": {
	"trusted": false
	},
	"cell_type": "code",
	"source": "# Camembert\npipeline = xo_fillin(\"camembert-base\",1000)\ncam_results = xo_produce_answers(pipeline,\"masked_cam\",df)\n\n# Flaubert\npipeline_flau = xo_fillin(\"flaubert/flaubert_base_cased\",1000)\nflau_results = xo_produce_answers(pipeline_flau,\"masked_flau\",df)",
	"execution_count": 13,
	"outputs": [
	{
	"output_type": "stream",
	"name": "stderr",
	"text": "Some weights of FlaubertWithLMHeadModel were not initialized from the model checkpoint at flaubert/flaubert_base_cased and are newly initialized: ['transformer.position_ids']\nYou should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.\n"
	}
	]
	},
	{
	"metadata": {},
	"cell_type": "markdown",
	"source": "## Results on the standard version with no variation\n\nFor examples like `La truite est un <mask>`, note that the accuracy is 50% with equal number of right and wrong responses.\n\nInterestingly, both models behaved the same way.\n\nThis is somewhat expected because both models were trained on similar corpus using similar method. Some differences exist, however. CamemBERT was trained using `whole word masking` and FlauBERT `token masking`. For details, please read the original papers.\n\nSuppose that no switch ever happens, the accuracy should be exactly 50%. However upon closer investigation, a successful switch took place for CamemBERT.\n\n```\nLa fourmi est un <mask>. insecte\nLa fourmi n'est pas un <mask>. légume\n```\n\nThe 50% accuracy is due to another switch with both wrong answers.\n\n```\nLe petit pois est un <mask>.\t bâtiment\t\nLe petit pois n'est pas un <mask>.\t légume\n```"
	},
	{
	"metadata": {
	"trusted": false
	},
	"cell_type": "code",
	"source": "# 50% in both cases\nprint(xo_compute_score(\"right_answer\",cam_results))\nprint(xo_compute_score(\"right_answer\",flau_results))",
	"execution_count": 17,
	"outputs": [
	{
	"output_type": "stream",
	"name": "stdout",
	"text": " correct no_response bad_response total_responses exactitude qualite \\\n0 9 20 7 36 25.0 56.25 \n\n reussite \n0 52.78 \n correct no_response bad_response total_responses exactitude qualite \\\n0 9 20 7 36 25.0 56.25 \n\n reussite \n0 52.78 \n"
	}
	]
	},
	{
	"metadata": {},
	"cell_type": "markdown",
	"source": "## Results on the version with être replaced by représente `<variation1>`\n\nThe accuracy decreases to 41.67% mainly because of the asymmetry between responses for affirmative items and those for negative items. Put in simpler terms, in some cases CamemBERT/FlauBERT gave only one answer to a pair of sentences and this one answer was wrong."
	},
	{
	"metadata": {
	"trusted": false
	},
	"cell_type": "code",
	"source": "pipeline = xo_fillin(\"camembert-base\",1000)\ncam_results_ch_verb = xo_produce_answers(pipeline,\"ch_verb\",df)\npipeline_flau = xo_fillin(\"flaubert/flaubert_base_cased\",1000)\nflau_results_ch_verb = xo_produce_answers(pipeline_flau,\"ch_verb_flau\",df)\nprint(xo_compute_score(\"right_answer\",cam_results_ch_verb))\nprint(xo_compute_score(\"right_answer\",flau_results_ch_verb))",
	"execution_count": 208,
	"outputs": [
	{
	"name": "stderr",
	"output_type": "stream",
	"text": "Some weights of FlaubertWithLMHeadModel were not initialized from the model checkpoint at flaubert/flaubert_base_cased and are newly initialized: ['transformer.position_ids']\nYou should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.\n correct no_response bad_response total_responses exactitude qualite \\\n0 15 3 18 36 41.67 45.45 \n\n reussite \n0 45.83 \n correct no_response bad_response total_responses exactitude qualite \\\n0 15 3 18 36 41.67 45.45 \n\n reussite \n0 45.83 \n"
	}
	]
	},
	{
	"metadata": {},
	"cell_type": "markdown",
	"source": "## Results on the version with être replaced by désigner `<variation3>`\n\nThe accuracy remains the same (50%) but note that there is 1 case where FlauBERT/CamemBERT fails to give a response, suggesting that how the phrase is worded has an effect on the results."
	},
	{
	"metadata": {
	"trusted": false
	},
	"cell_type": "code",
	"source": "# camembert\npipeline = xo_fillin(\"camembert-base\",1000)\ncam_results_ch_whole = xo_produce_answers(pipeline,\"ch_whole\",df)\n\n# flaubert\npipeline_flau = xo_fillin(\"flaubert/flaubert_base_cased\",1000)\nflau_results_ch_whole_flau = xo_produce_answers(pipeline_flau,\"ch_whole_flau\",df)\n# results\nprint(xo_compute_score(\"right_answer\",cam_results_ch_whole))\nprint(xo_compute_score(\"right_answer\",flau_results_ch_whole_flau))",
	"execution_count": 18,
	"outputs": [
	{
	"output_type": "stream",
	"name": "stderr",
	"text": "Some weights of FlaubertWithLMHeadModel were not initialized from the model checkpoint at flaubert/flaubert_base_cased and are newly initialized: ['transformer.position_ids']\nYou should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.\n correct no_response bad_response total_responses exactitude qualite \\\n0 18 1 17 36 50.0 51.43 \n\n reussite \n0 51.39 \n correct no_response bad_response total_responses exactitude qualite \\\n0 18 1 17 36 50.0 51.43 \n\n reussite \n0 51.39 \n"
	}
	]
	},
	{
	"metadata": {},
	"cell_type": "markdown",
	"source": "## Results on the version with être replaced by représenter/désigner and a right context `<variation2> and <variation4>`\n\nThe accuracy decreases to 31% and 25% for `<variation2>` and `<variation4>`, counterparts of `<variation1>` and `<variation3>` with a right context. Adding `en français` to the right impacts the model's ability to give a response out of the two options.\n\nNote that given the large number of non responses, the `qualite` measure considering only the answered sentences is more informative. It's 48% for `<variation2>` and 56% for `<variation4>`. However, the asymmetry makes it difficult to assess the models' sensitivity to negation."
	},
	{
	"metadata": {
	"trusted": false
	},
	"cell_type": "code",
	"source": "# camembert\n\npipeline = xo_fillin(\"camembert-base\",1000)\ncam_results_ch_verb_add_right = xo_produce_answers(pipeline,\"ch_verb_add_right\",df)\n\n# falubert\npipeline_flau = xo_fillin(\"flaubert/flaubert_base_cased\",1000)\nflau_results_ch_verb_add_right_flau = xo_produce_answers(pipeline_flau,\"ch_verb_add_right_flau\",df)\n\nprint(xo_compute_score(\"right_answer\",cam_results_ch_verb_add_right))\nprint(xo_compute_score(\"right_answer\",flau_results_ch_verb_add_right_flau))",
	"execution_count": 15,
	"outputs": [
	{
	"output_type": "stream",
	"name": "stderr",
	"text": "Some weights of FlaubertWithLMHeadModel were not initialized from the model checkpoint at flaubert/flaubert_base_cased and are newly initialized: ['transformer.position_ids']\nYou should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.\n correct no_response bad_response total_responses exactitude qualite \\\n0 11 13 12 36 30.56 47.83 \n\n reussite \n0 48.61 \n correct no_response bad_response total_responses exactitude qualite \\\n0 11 13 12 36 30.56 47.83 \n\n reussite \n0 48.61 \n"
	}
	]
	},
	{
	"metadata": {
	"trusted": false
	},
	"cell_type": "code",
	"source": "pipeline = xo_fillin(\"camembert-base\",1000)\ncam_results_ch_whole_add_right = xo_produce_answers(pipeline,\"ch_whole_add_right\",df)\npipeline_flau = xo_fillin(\"flaubert/flaubert_base_cased\",1000)\nflau_results_ch_whole_add_right_flau = xo_produce_answers(pipeline_flau,\"ch_whole_add_right_flau\",df)\n# 50% in both cases\nprint(xo_compute_score(\"right_answer\",cam_results_ch_whole_add_right))\nprint(xo_compute_score(\"right_answer\",flau_results_ch_whole_add_right_flau))",
	"execution_count": 16,
	"outputs": [
	{
	"output_type": "stream",
	"name": "stderr",
	"text": "Some weights of FlaubertWithLMHeadModel were not initialized from the model checkpoint at flaubert/flaubert_base_cased and are newly initialized: ['transformer.position_ids']\nYou should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.\n correct no_response bad_response total_responses exactitude qualite \\\n0 9 20 7 36 25.0 56.25 \n\n reussite \n0 52.78 \n correct no_response bad_response total_responses exactitude qualite \\\n0 9 20 7 36 25.0 56.25 \n\n reussite \n0 52.78 \n"
	}
	]
	},
	{
	"metadata": {
	"trusted": false
	},
	"cell_type": "code",
	"source": "cam_results_ch_whole_add_right[[\"ch_whole_add_right\",\"options\",\"response1\",\"prob1\",\"response2\",\"prob2\"]].to_csv(\"res_cam4.csv\",index=False)\nflau_results_ch_whole_add_right_flau[[\"ch_whole_add_right_flau\",\"options\",\"response1\",\"prob1\",\"response2\",\"prob2\"]].to_csv(\"res_flau4.csv\",index=False)",
	"execution_count": 213,
	"outputs": []
	},
	{
	"metadata": {},
	"cell_type": "markdown",
	"source": "## Conclusion\n\nThis simple experiment shows the insensitivity of Bert-like language models to negation. The same observations have proven to be valid on a similar French corpus.\n\nWhat improvements can be made?\n\nThe addition of self-supervised tasks requiring more sophisticated linguistic information than the simple linear order (such as the addition of syntactic information by (Xu et al., 2020)) could be helpful.\n\n## Reference\n\nDevlin, J., Chang M.-W., Lee K. and Toutanova K. (2019). “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding” In NAACL-HLT 2019.\n\nEttinger, A. (2019). What bert is not : Lessons from a new suite of psycholinguistic diagnostics for language models. Transactions of the Association for Computational Linguistics, 8:34–48.\n\nLe, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., ... & Schwab, D. (2019). Flaubert: Unsupervised language model pre-training for french. arXiv preprint arXiv:1912.05372.\n\nMartin, L., Muller, B., Suárez, P. J. O., Dupont, Y., Romary, L., de la Clergerie, É. V., ... & Sagot, B. (2019). Camembert: a tasty french language model. arXiv preprint arXiv:1911.03894.\n\nXu Z., Guo D., Tang D., Su Q., Shou L., Gong M., Zhong W., Quan X., Duan N., and Jiang D. (2020) “Syntax-Enhanced Pre-trained Model”. arXiv:2012.14116\n"
	}
	],
	"metadata": {
	"_draft": {
	"nbviewer_url": "https://gist.github.com/c3a0968fbad83f884d6db95b7b4c96d0"
	},
	"gist": {
	"id": "c3a0968fbad83f884d6db95b7b4c96d0",
	"data": {
	"description": "bert_negation.ipynb",
	"public": true
	}
	},
	"kernelspec": {
	"name": "base",
	"display_name": "Python 3",
	"language": "python"
	},
	"language_info": {
	"name": "python",
	"version": "3.7.6",
	"mimetype": "text/x-python",
	"codemirror_mode": {
	"name": "ipython",
	"version": 3
	},
	"pygments_lexer": "ipython3",
	"nbconvert_exporter": "python",
	"file_extension": ".py"
	},
	"nbTranslate": {
	"hotkey": "alt-t",
	"sourceLang": "en",
	"targetLang": "fr",
	"displayLangs": [
	"*"
	],
	"langInMainMenu": true,
	"useGoogleTranslate": true
	},
	"toc": {
	"nav_menu": {},
	"number_sections": true,
	"sideBar": false,
	"skip_h1_title": true,
	"base_numbering": 1,
	"title_cell": "Table des matières",
	"title_sidebar": "Contents",
	"toc_cell": false,
	"toc_position": {},
	"toc_section_display": true,
	"toc_window_display": false
	},
	"varInspector": {
	"window_display": false,
	"cols": {
	"lenName": 16,
	"lenType": 16,
	"lenVar": 40
	},
	"kernels_config": {
	"python": {
	"library": "var_list.py",
	"delete_cmd_prefix": "del ",
	"delete_cmd_postfix": "",
	"varRefreshCmd": "print(var_dic_list())"
	},
	"r": {
	"library": "var_list.r",
	"delete_cmd_prefix": "rm(",
	"delete_cmd_postfix": ") ",
	"varRefreshCmd": "cat(var_dic_list()) "
	}
	},
	"types_to_exclude": [
	"module",
	"function",
	"builtin_function_or_method",
	"instance",
	"_Feature"
	]
	}
	},
	"nbformat": 4,
	"nbformat_minor": 2
	}