artreven/wsid.ipynb

## wsid.ipynb
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 0. Preparations\n",
    "## Virtual Environment\n",
    "Create a virtual environment.\n",
    "Run the following code in terminal:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "virtualenv --python=python3 ~/<venv folder>\n",
    "source ~/<venv folder>/bin/activate\n",
    "pip install git+git://github.com/semantic-web-company/wsid.git#egg=wsid"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Download the dataset\n",
    "The dataset is located at https://github.com/artreven/thesaural_wsi. We will only use the \"cocktails\" part. Clone the dataset and save the data folder path to env variable `COCKTAIL_PATH`.\n",
    "Run the following code in terminal:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "git clone https://github.com/artreven/thesaural_wsi.git\n",
    "export COCKTAILS_PATH=$(pwd)/thesaural_wsi/cocktails/"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## iPython\n",
    "Next we install [iPython](https://ipython.org/) and we execute all the subsequent commands in iPython shell."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pip install ipython\n",
    "ipython"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "After this command you should see iPython prompt. Next we download some data for `nltk` that we make use of later."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import nltk\n",
    "nltk.download('wordnet')\n",
    "nltk.download('punkt')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 1. Helpful functions\n",
    "Next we define several helpful functions."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### `preprocess`\n",
    "Clean text corpus: eliminate short tokens, numbers, lemmatize, remove too frequent and too infrequent tokens."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from collections import Counter\n",
    "from nltk.stem import WordNetLemmatizer\n",
    "import re\n",
    "\n",
    "lemmatizer = WordNetLemmatizer()\n",
    "\n",
    "def preprocess(corpus_in, entity_str, lemmatize=True, min_tf=3, max_df=0.67):\n",
    "    corpus = []\n",
    "    vocabulary = Counter()\n",
    "    df = Counter()\n",
    "    max_df *= len(corpus_in)\n",
    "    for title, txt in corpus_in:\n",
    "        txt = txt.lower()\n",
    "        # remove short words, but not the entity\n",
    "        txt = ' '.join(re.findall(r'\\w{{3,}}|{}'.format(entity_str), txt))  \n",
    "        txt = re.sub(r'(?<=[\\s.!?])\\d+(?=[\\s.!?])', 'DIGIT', txt)  # remove numbers\n",
    "        if lemmatize:\n",
    "            tokens = [\n",
    "                lemmatizer.lemmatize(token)\n",
    "                        if token not in [entity_str, 'DIGIT'] else token\n",
    "                for token in txt.split()]\n",
    "        else:\n",
    "            tokens = txt.split()\n",
    "        vocabulary.update(tokens)\n",
    "        df.update(set(tokens))\n",
    "        corpus.append([title, tokens])\n",
    "    for token in [entity_str, 'DIGIT']:\n",
    "        df[token] = max_df - 1\n",
    "    # Filtering based on frequencies\n",
    "    for i in range(len(corpus)):\n",
    "        tokens = [w for w in corpus[i][1]\n",
    "                  if df[w] >= min_tf\n",
    "                  if df[w] < max_df]\n",
    "        txt = ' '.join(tokens)\n",
    "        corpus[i][1] = txt\n",
    "    return corpus, vocabulary"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### `get_senses_str`\n",
    "Print senses nicely."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def get_senses_str(senses, broader_senses_inds):\n",
    "    out = []\n",
    "    for i, sense in enumerate(senses):\n",
    "        sorted_sense = sorted(\n",
    "            sense.items(),\n",
    "            key=lambda x: x[1],\n",
    "            reverse=True)\n",
    "        sense_out = 'Sense #{}\\n'.format(i)\n",
    "        sense_out += ', '.join('{}: {:10.5f}'.format(t, w)\n",
    "                               for t, w in sorted_sense[:15])\n",
    "        if i in broader_senses_inds:\n",
    "            sense_out += '\\nTarget sense'\n",
    "        out.append(sense_out)\n",
    "    return out"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### run the test\n",
    "This is the main function. It executes the following workflow: read and preprocess the corpus, induce the senses of the target entity, disambiguate the entity in the texts of the corpus using the induced senses, report the quality. In the report we use a baseline that is classifying all the entity occurrences into the most popular sense. This way we guarantee that the baseline is challenging: since we take the most popular sense we make only as many mistakes as many occurrences of unpopular senses we have. For an entity with 2 senses the baseline is always above 50%.\n",
    "We report [precision, recall and f1](https://en.wikipedia.org/wiki/Precision_and_recall)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "\n",
    "from sklearn.metrics import accuracy_score, precision_score, recall_score\n",
    "\n",
    "from wsid import induce, disambiguate\n",
    "\n",
    "def run_test(entity_str, entity_corpus_path,\n",
    "             target_sense_identifiers=('cocktail', 'mix')):\n",
    "    entity_str = entity_str.lower()\n",
    "    corpus = []\n",
    "    # Read and prepare the corpus\n",
    "    for text_name in os.listdir(entity_corpus_path):\n",
    "        if not text_name.endswith('txt'): continue\n",
    "        text_path = os.path.join(entity_corpus_path, text_name)\n",
    "        with open(text_path) as f:\n",
    "            txt = f.read()\n",
    "        corpus.append((text_name, txt))\n",
    "    corpus, vocabulary = preprocess(corpus, entity_str)\n",
    "\n",
    "    # Do the sense induction\n",
    "    cos, t2i, i2t = induce.cooc.get_co(\n",
    "        [x[1] for x in corpus], entity=entity_str, w=10)\n",
    "    hubs, clusters, e_cos = induce.induce(\n",
    "        cos, t2i, i2t, entity_str=entity_str,\n",
    "        broader_groups=target_sense_identifiers)\n",
    "    target_senses_inds = [i for i, x in enumerate(hubs) \n",
    "                          if x[-1] == target_sense_identifiers]\n",
    "    for sense_str in get_senses_str(clusters, target_senses_inds):\n",
    "        print()\n",
    "        print(sense_str)\n",
    "\n",
    "    # Disambiguate the sense of the entity in the documents\n",
    "    doc_clusters = []\n",
    "    for i in range(len(corpus)):\n",
    "        txt = corpus[i][1]\n",
    "        categ_ind, conf, distr, evidence = disambiguate.cluster_text(\n",
    "            txt, hubs, clusters, entity=entity_str)\n",
    "        categ = 'this' if (categ_ind in target_senses_inds) else 'other'\n",
    "        doc_clusters.append([categ, conf])\n",
    "        print('{}: {}, {}, {}'.format(corpus[i][0], categ, categ_ind, conf))\n",
    "    prediction = [x[0] for x in doc_clusters]\n",
    "\n",
    "    # Report\n",
    "    y_true = [\n",
    "        'this' if corpus[i][0].endswith('(cocktail).txt') else 'other'\n",
    "        for i in range(len(corpus))]\n",
    "    this_count = y_true.count('this')\n",
    "    other_count = y_true.count('other')\n",
    "    baseline_pos_label = ['this', 'other'][this_count < other_count]\n",
    "    baseline = ([baseline_pos_label] * len(corpus))\n",
    "    print(\n",
    "        'Baseline accuracy: ' +\n",
    "        'precision = {:0.3f}\\t'.format(precision_score(y_true, baseline,\n",
    "                                                       pos_label=baseline_pos_label)) +\n",
    "        'recall = {:0.3f}\\t'.format(recall_score(y_true, baseline,\n",
    "                                                 pos_label=baseline_pos_label)) +\n",
    "        'accuracy = {:0.3f}'.format(accuracy_score(y_true, baseline)))\n",
    "    print(\n",
    "        'Method accuracy: ' +\n",
    "        'precision = {:0.3f}\\t'.format(precision_score(y_true, prediction,\n",
    "                                                       pos_label='this')) +\n",
    "        'recall = {:0.3f}\\t'.format(recall_score(y_true, prediction,\n",
    "                                                 pos_label='this')) +\n",
    "        'accuracy = {:0.3f}'.format(accuracy_score(y_true, prediction)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "_____\n",
    "# 2. Execute"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "entity_str = 'Americano'\n",
    "cocktails_path = os.getenv('COCKTAILS_PATH')\n",
    "entity_corpus_path = os.path.join(cocktails_path, entity_str)\n",
    "\n",
    "run_test(entity_str, entity_corpus_path, target_sense_identifiers=('cocktail', 'mix'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "After you get the results you can change the name of the cocktail (`entity_str` value) to other cocktails in the folder. Try `Grasshopper` and `Tequila Sunrise` for instance."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
	{
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"# 0. Preparations\n",
	"## Virtual Environment\n",
	"Create a virtual environment.\n",
	"Run the following code in terminal:"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"virtualenv --python=python3 ~/<venv folder>\n",
	"source ~/<venv folder>/bin/activate\n",
	"pip install git+git://github.com/semantic-web-company/wsid.git#egg=wsid"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Download the dataset\n",
	"The dataset is located at https://github.com/artreven/thesaural_wsi. We will only use the \"cocktails\" part. Clone the dataset and save the data folder path to env variable `COCKTAIL_PATH`.\n",
	"Run the following code in terminal:"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"git clone https://github.com/artreven/thesaural_wsi.git\n",
	"export COCKTAILS_PATH=$(pwd)/thesaural_wsi/cocktails/"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## iPython\n",
	"Next we install [iPython](https://ipython.org/) and we execute all the subsequent commands in iPython shell."
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"pip install ipython\n",
	"ipython"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"After this command you should see iPython prompt. Next we download some data for `nltk` that we make use of later."
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"import nltk\n",
	"nltk.download('wordnet')\n",
	"nltk.download('punkt')"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"# 1. Helpful functions\n",
	"Next we define several helpful functions."
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"### `preprocess`\n",
	"Clean text corpus: eliminate short tokens, numbers, lemmatize, remove too frequent and too infrequent tokens."
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"from collections import Counter\n",
	"from nltk.stem import WordNetLemmatizer\n",
	"import re\n",
	"\n",
	"lemmatizer = WordNetLemmatizer()\n",
	"\n",
	"def preprocess(corpus_in, entity_str, lemmatize=True, min_tf=3, max_df=0.67):\n",
	" corpus = []\n",
	" vocabulary = Counter()\n",
	" df = Counter()\n",
	" max_df *= len(corpus_in)\n",
	" for title, txt in corpus_in:\n",
	" txt = txt.lower()\n",
	" # remove short words, but not the entity\n",
	" txt = ' '.join(re.findall(r'\\w{{3,}}\|{}'.format(entity_str), txt)) \n",
	" txt = re.sub(r'(?<=[\\s.!?])\\d+(?=[\\s.!?])', 'DIGIT', txt) # remove numbers\n",
	" if lemmatize:\n",
	" tokens = [\n",
	" lemmatizer.lemmatize(token)\n",
	" if token not in [entity_str, 'DIGIT'] else token\n",
	" for token in txt.split()]\n",
	" else:\n",
	" tokens = txt.split()\n",
	" vocabulary.update(tokens)\n",
	" df.update(set(tokens))\n",
	" corpus.append([title, tokens])\n",
	" for token in [entity_str, 'DIGIT']:\n",
	" df[token] = max_df - 1\n",
	" # Filtering based on frequencies\n",
	" for i in range(len(corpus)):\n",
	" tokens = [w for w in corpus[i][1]\n",
	" if df[w] >= min_tf\n",
	" if df[w] < max_df]\n",
	" txt = ' '.join(tokens)\n",
	" corpus[i][1] = txt\n",
	" return corpus, vocabulary"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"### `get_senses_str`\n",
	"Print senses nicely."
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"def get_senses_str(senses, broader_senses_inds):\n",
	" out = []\n",
	" for i, sense in enumerate(senses):\n",
	" sorted_sense = sorted(\n",
	" sense.items(),\n",
	" key=lambda x: x[1],\n",
	" reverse=True)\n",
	" sense_out = 'Sense #{}\\n'.format(i)\n",
	" sense_out += ', '.join('{}: {:10.5f}'.format(t, w)\n",
	" for t, w in sorted_sense[:15])\n",
	" if i in broader_senses_inds:\n",
	" sense_out += '\\nTarget sense'\n",
	" out.append(sense_out)\n",
	" return out"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"### run the test\n",
	"This is the main function. It executes the following workflow: read and preprocess the corpus, induce the senses of the target entity, disambiguate the entity in the texts of the corpus using the induced senses, report the quality. In the report we use a baseline that is classifying all the entity occurrences into the most popular sense. This way we guarantee that the baseline is challenging: since we take the most popular sense we make only as many mistakes as many occurrences of unpopular senses we have. For an entity with 2 senses the baseline is always above 50%.\n",
	"We report [precision, recall and f1](https://en.wikipedia.org/wiki/Precision_and_recall)."
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"import os\n",
	"\n",
	"from sklearn.metrics import accuracy_score, precision_score, recall_score\n",
	"\n",
	"from wsid import induce, disambiguate\n",
	"\n",
	"def run_test(entity_str, entity_corpus_path,\n",
	" target_sense_identifiers=('cocktail', 'mix')):\n",
	" entity_str = entity_str.lower()\n",
	" corpus = []\n",
	" # Read and prepare the corpus\n",
	" for text_name in os.listdir(entity_corpus_path):\n",
	" if not text_name.endswith('txt'): continue\n",
	" text_path = os.path.join(entity_corpus_path, text_name)\n",
	" with open(text_path) as f:\n",
	" txt = f.read()\n",
	" corpus.append((text_name, txt))\n",
	" corpus, vocabulary = preprocess(corpus, entity_str)\n",
	"\n",
	" # Do the sense induction\n",
	" cos, t2i, i2t = induce.cooc.get_co(\n",
	" [x[1] for x in corpus], entity=entity_str, w=10)\n",
	" hubs, clusters, e_cos = induce.induce(\n",
	" cos, t2i, i2t, entity_str=entity_str,\n",
	" broader_groups=target_sense_identifiers)\n",
	" target_senses_inds = [i for i, x in enumerate(hubs) \n",
	" if x[-1] == target_sense_identifiers]\n",
	" for sense_str in get_senses_str(clusters, target_senses_inds):\n",
	" print()\n",
	" print(sense_str)\n",
	"\n",
	" # Disambiguate the sense of the entity in the documents\n",
	" doc_clusters = []\n",
	" for i in range(len(corpus)):\n",
	" txt = corpus[i][1]\n",
	" categ_ind, conf, distr, evidence = disambiguate.cluster_text(\n",
	" txt, hubs, clusters, entity=entity_str)\n",
	" categ = 'this' if (categ_ind in target_senses_inds) else 'other'\n",
	" doc_clusters.append([categ, conf])\n",
	" print('{}: {}, {}, {}'.format(corpus[i][0], categ, categ_ind, conf))\n",
	" prediction = [x[0] for x in doc_clusters]\n",
	"\n",
	" # Report\n",
	" y_true = [\n",
	" 'this' if corpus[i][0].endswith('(cocktail).txt') else 'other'\n",
	" for i in range(len(corpus))]\n",
	" this_count = y_true.count('this')\n",
	" other_count = y_true.count('other')\n",
	" baseline_pos_label = ['this', 'other'][this_count < other_count]\n",
	" baseline = ([baseline_pos_label] * len(corpus))\n",
	" print(\n",
	" 'Baseline accuracy: ' +\n",
	" 'precision = {:0.3f}\\t'.format(precision_score(y_true, baseline,\n",
	" pos_label=baseline_pos_label)) +\n",
	" 'recall = {:0.3f}\\t'.format(recall_score(y_true, baseline,\n",
	" pos_label=baseline_pos_label)) +\n",
	" 'accuracy = {:0.3f}'.format(accuracy_score(y_true, baseline)))\n",
	" print(\n",
	" 'Method accuracy: ' +\n",
	" 'precision = {:0.3f}\\t'.format(precision_score(y_true, prediction,\n",
	" pos_label='this')) +\n",
	" 'recall = {:0.3f}\\t'.format(recall_score(y_true, prediction,\n",
	" pos_label='this')) +\n",
	" 'accuracy = {:0.3f}'.format(accuracy_score(y_true, prediction)))"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"_____\n",
	"# 2. Execute"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"entity_str = 'Americano'\n",
	"cocktails_path = os.getenv('COCKTAILS_PATH')\n",
	"entity_corpus_path = os.path.join(cocktails_path, entity_str)\n",
	"\n",
	"run_test(entity_str, entity_corpus_path, target_sense_identifiers=('cocktail', 'mix'))"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"After you get the results you can change the name of the cocktail (`entity_str` value) to other cocktails in the folder. Try `Grasshopper` and `Tequila Sunrise` for instance."
	]
	}
	],
	"metadata": {
	"kernelspec": {
	"display_name": "Python 3",
	"language": "python",
	"name": "python3"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 3
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython3",
	"version": "3.6.8"
	}
	},
	"nbformat": 4,
	"nbformat_minor": 2
	}