artreven/ptlm_wsid.ipynb

## ptlm_wsid.ipynb
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 0. Preparations\n",
    "\n",
    "## Virtual Environment\n",
    "Create a virtual environment. Run the following code in terminal:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "virtualenv --python=python3 ~/<venv folder>\n",
    "source ~/<venv folder>/bin/activate\n",
    "pip install -e git://github.com/semantic-web-company/ptlm_wsid.git#egg=ptlm_wsid"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## NLTK and spacy\n",
    "We need to download some useful nltk and spaCy data:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "python -m nltk.downloader punkt stopwords averaged_perceptron_tagger wordnet\n",
    "python -m spacy download en_core_web_sm"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## iPython\n",
    "Next we install iPython and we execute all the subsequent commands in iPython shell"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pip install ipython\n",
    "ipython"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 1. Execution\n",
    "First we import some useful functionalities and define a function:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import re\n",
    "from typing import List\n",
    "\n",
    "import ptlm_wsid.target_context as tc\n",
    "import ptlm_wsid.generative_factors as gf\n",
    "\n",
    "\n",
    "def prepare_target_contexts(cxt_strs: List[str],\n",
    "                            target_word: str,\n",
    "                            verbose: bool = True) -> List[tc.TargetContext]:\n",
    "    \"\"\"\n",
    "    The function creates a simple regex from the target word and searches this\n",
    "    pattern in the context strings. If found then the start and end indices are\n",
    "    used to produce a TargetContext.\n",
    "\n",
    "    :param cxt_strs: list of context strings\n",
    "    :param target_word: the target word\n",
    "    :param verbose: print also individual predictions\n",
    "    \"\"\"\n",
    "    tcs = []\n",
    "    for cxt_str in cxt_strs:\n",
    "        re_match = re.search(target_word, cxt_str, re.IGNORECASE)\n",
    "        if re_match is None:\n",
    "            raise ValueError(f'In \"{cxt_str}\" the target '\n",
    "                             f'\"{target_word}\" was not found')\n",
    "        start_ind, end_ind = re_match.start(), re_match.end()\n",
    "        new_tc = tc.TargetContext(\n",
    "            context=cxt_str, target_start_end_inds=(start_ind, end_ind))\n",
    "        if verbose:\n",
    "            top_predictions = new_tc.get_topn_predictions()\n",
    "            print(f'Predictions for {target_word} in {cxt_str}: '\n",
    "                  f'{top_predictions}')\n",
    "        tcs.append(new_tc)\n",
    "    return tcs\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally you can try to induce and print the senses:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "cxts_dicts = {\n",
    "        1: \"The jaguar's present range extends from Southwestern United States and Mexico in North America, across much of Central America, and south to Paraguay and northern Argentina in South America.\",\n",
    "        2: \"Overall, the jaguar is the largest native cat species of the New World and the third largest in the world.\",\n",
    "        3: \"Given its historical distribution, the jaguar has featured prominently in the mythology of numerous indigenous American cultures, including those of the Maya and Aztec.\",\n",
    "        4: \"The jaguar is a compact and well-muscled animal.\",\n",
    "        5: \"Melanistic jaguars are informally known as black panthers, but as with all forms of polymorphism they do not form a separate species.\",\n",
    "        6: \"The jaguar uses scrape marks, urine, and feces to mark its territory.\",\n",
    "        7: \"The word 'jaguar' is thought to derive from the Tupian word yaguara, meaning 'beast of prey'.\",\n",
    "        8: \"Jaguar's business was founded as the Swallow Sidecar Company in 1922, originally making motorcycle sidecars before developing bodies for passenger cars.\",\n",
    "        9: \"In 1990 Ford acquired Jaguar Cars and it remained in their ownership, joined in 2000 by Land Rover, till 2008.\",\n",
    "        10: \"Two of the proudest moments in Jaguar's long history in motor sport involved winning the Le Mans 24 hours race, firstly in 1951 and again in 1953.\",\n",
    "        11: \"He therefore accepted BMC's offer to merge with Jaguar to form British Motor (Holdings) Limited.\",\n",
    "        12: \"The Jaguar E-Pace is a compact SUV, officially revealed on 13 July 2017.\"}\n",
    "\n",
    "titles, cxts = list(zip(*cxts_dicts.items()))  # convert to 2 lists\n",
    "tcs = prepare_target_contexts(cxt_strs=cxts, target_word='jaguar')\n",
    "senses = gf.induce(\n",
    "    contexts=[tc.context for tc in tcs],\n",
    "    target_start_end_tuples=[tc.target_start_end_inds for tc in tcs],\n",
    "    titles=titles,\n",
    "    target_pos='N',  # we want only nouns\n",
    "    n_sense_indicators=5,  # how many substitutes for each sense in the output\n",
    "    top_n_pred=25)  # the number of substitutes for each context\n",
    "for i, sense in enumerate(senses):\n",
    "    print(f'Sense #{i+1}')\n",
    "    print(f'Sense indicators: {\", \".join(str(x) for x in sense.intent)}')\n",
    "    print(f'Found in contexts: {\", \".join(str(x) for x in sense.extent)}')\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And then disambiguate"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sense_indicators = [list(sense.intent) for sense in senses]\n",
    "for tc, title in zip(tcs, titles):\n",
    "    scores = tc.disambiguate(sense_clusters=sense_indicators)\n",
    "    print(f'For context: \"{str(title).upper()}. {tc.context}\" '\n",
    "          f'the sense: {sense_indicators[scores.index(max(scores))]} '\n",
    "          f'is chosen.')"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
	{
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"# 0. Preparations\n",
	"\n",
	"## Virtual Environment\n",
	"Create a virtual environment. Run the following code in terminal:"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"scrolled": true
	},
	"outputs": [],
	"source": [
	"virtualenv --python=python3 ~/<venv folder>\n",
	"source ~/<venv folder>/bin/activate\n",
	"pip install -e git://github.com/semantic-web-company/ptlm_wsid.git#egg=ptlm_wsid"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## NLTK and spacy\n",
	"We need to download some useful nltk and spaCy data:"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"python -m nltk.downloader punkt stopwords averaged_perceptron_tagger wordnet\n",
	"python -m spacy download en_core_web_sm"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## iPython\n",
	"Next we install iPython and we execute all the subsequent commands in iPython shell"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"pip install ipython\n",
	"ipython"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"# 1. Execution\n",
	"First we import some useful functionalities and define a function:"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"import re\n",
	"from typing import List\n",
	"\n",
	"import ptlm_wsid.target_context as tc\n",
	"import ptlm_wsid.generative_factors as gf\n",
	"\n",
	"\n",
	"def prepare_target_contexts(cxt_strs: List[str],\n",
	" target_word: str,\n",
	" verbose: bool = True) -> List[tc.TargetContext]:\n",
	" \"\"\"\n",
	" The function creates a simple regex from the target word and searches this\n",
	" pattern in the context strings. If found then the start and end indices are\n",
	" used to produce a TargetContext.\n",
	"\n",
	" :param cxt_strs: list of context strings\n",
	" :param target_word: the target word\n",
	" :param verbose: print also individual predictions\n",
	" \"\"\"\n",
	" tcs = []\n",
	" for cxt_str in cxt_strs:\n",
	" re_match = re.search(target_word, cxt_str, re.IGNORECASE)\n",
	" if re_match is None:\n",
	" raise ValueError(f'In \"{cxt_str}\" the target '\n",
	" f'\"{target_word}\" was not found')\n",
	" start_ind, end_ind = re_match.start(), re_match.end()\n",
	" new_tc = tc.TargetContext(\n",
	" context=cxt_str, target_start_end_inds=(start_ind, end_ind))\n",
	" if verbose:\n",
	" top_predictions = new_tc.get_topn_predictions()\n",
	" print(f'Predictions for {target_word} in {cxt_str}: '\n",
	" f'{top_predictions}')\n",
	" tcs.append(new_tc)\n",
	" return tcs\n"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Finally you can try to induce and print the senses:"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"cxts_dicts = {\n",
	" 1: \"The jaguar's present range extends from Southwestern United States and Mexico in North America, across much of Central America, and south to Paraguay and northern Argentina in South America.\",\n",
	" 2: \"Overall, the jaguar is the largest native cat species of the New World and the third largest in the world.\",\n",
	" 3: \"Given its historical distribution, the jaguar has featured prominently in the mythology of numerous indigenous American cultures, including those of the Maya and Aztec.\",\n",
	" 4: \"The jaguar is a compact and well-muscled animal.\",\n",
	" 5: \"Melanistic jaguars are informally known as black panthers, but as with all forms of polymorphism they do not form a separate species.\",\n",
	" 6: \"The jaguar uses scrape marks, urine, and feces to mark its territory.\",\n",
	" 7: \"The word 'jaguar' is thought to derive from the Tupian word yaguara, meaning 'beast of prey'.\",\n",
	" 8: \"Jaguar's business was founded as the Swallow Sidecar Company in 1922, originally making motorcycle sidecars before developing bodies for passenger cars.\",\n",
	" 9: \"In 1990 Ford acquired Jaguar Cars and it remained in their ownership, joined in 2000 by Land Rover, till 2008.\",\n",
	" 10: \"Two of the proudest moments in Jaguar's long history in motor sport involved winning the Le Mans 24 hours race, firstly in 1951 and again in 1953.\",\n",
	" 11: \"He therefore accepted BMC's offer to merge with Jaguar to form British Motor (Holdings) Limited.\",\n",
	" 12: \"The Jaguar E-Pace is a compact SUV, officially revealed on 13 July 2017.\"}\n",
	"\n",
	"titles, cxts = list(zip(*cxts_dicts.items())) # convert to 2 lists\n",
	"tcs = prepare_target_contexts(cxt_strs=cxts, target_word='jaguar')\n",
	"senses = gf.induce(\n",
	" contexts=[tc.context for tc in tcs],\n",
	" target_start_end_tuples=[tc.target_start_end_inds for tc in tcs],\n",
	" titles=titles,\n",
	" target_pos='N', # we want only nouns\n",
	" n_sense_indicators=5, # how many substitutes for each sense in the output\n",
	" top_n_pred=25) # the number of substitutes for each context\n",
	"for i, sense in enumerate(senses):\n",
	" print(f'Sense #{i+1}')\n",
	" print(f'Sense indicators: {\", \".join(str(x) for x in sense.intent)}')\n",
	" print(f'Found in contexts: {\", \".join(str(x) for x in sense.extent)}')\n"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"And then disambiguate"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"sense_indicators = [list(sense.intent) for sense in senses]\n",
	"for tc, title in zip(tcs, titles):\n",
	" scores = tc.disambiguate(sense_clusters=sense_indicators)\n",
	" print(f'For context: \"{str(title).upper()}. {tc.context}\" '\n",
	" f'the sense: {sense_indicators[scores.index(max(scores))]} '\n",
	" f'is chosen.')"
	]
	}
	],
	"metadata": {
	"kernelspec": {
	"display_name": "Python 3",
	"language": "python",
	"name": "python3"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 3
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython3",
	"version": "3.6.10"
	}
	},
	"nbformat": 4,
	"nbformat_minor": 4
	}