Skip to content

Instantly share code, notes, and snippets.

@artreven
Last active May 19, 2020 12:42
Show Gist options
  • Save artreven/cd7f781c55124bdf1c4301bd737149cb to your computer and use it in GitHub Desktop.
Save artreven/cd7f781c55124bdf1c4301bd737149cb to your computer and use it in GitHub Desktop.
WSID with pre-trained Language Model
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 0. Preparations\n",
"\n",
"## Virtual Environment\n",
"Create a virtual environment. Run the following code in terminal:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"virtualenv --python=python3 ~/<venv folder>\n",
"source ~/<venv folder>/bin/activate\n",
"pip install -e git://github.com/semantic-web-company/ptlm_wsid.git#egg=ptlm_wsid"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## NLTK and spacy\n",
"We need to download some useful nltk and spaCy data:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"python -m nltk.downloader punkt stopwords averaged_perceptron_tagger wordnet\n",
"python -m spacy download en_core_web_sm"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## iPython\n",
"Next we install iPython and we execute all the subsequent commands in iPython shell"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"pip install ipython\n",
"ipython"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 1. Execution\n",
"First we import some useful functionalities and define a function:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import re\n",
"from typing import List\n",
"\n",
"import ptlm_wsid.target_context as tc\n",
"import ptlm_wsid.generative_factors as gf\n",
"\n",
"\n",
"def prepare_target_contexts(cxt_strs: List[str],\n",
" target_word: str,\n",
" verbose: bool = True) -> List[tc.TargetContext]:\n",
" \"\"\"\n",
" The function creates a simple regex from the target word and searches this\n",
" pattern in the context strings. If found then the start and end indices are\n",
" used to produce a TargetContext.\n",
"\n",
" :param cxt_strs: list of context strings\n",
" :param target_word: the target word\n",
" :param verbose: print also individual predictions\n",
" \"\"\"\n",
" tcs = []\n",
" for cxt_str in cxt_strs:\n",
" re_match = re.search(target_word, cxt_str, re.IGNORECASE)\n",
" if re_match is None:\n",
" raise ValueError(f'In \"{cxt_str}\" the target '\n",
" f'\"{target_word}\" was not found')\n",
" start_ind, end_ind = re_match.start(), re_match.end()\n",
" new_tc = tc.TargetContext(\n",
" context=cxt_str, target_start_end_inds=(start_ind, end_ind))\n",
" if verbose:\n",
" top_predictions = new_tc.get_topn_predictions()\n",
" print(f'Predictions for {target_word} in {cxt_str}: '\n",
" f'{top_predictions}')\n",
" tcs.append(new_tc)\n",
" return tcs\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Finally you can try to induce and print the senses:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"cxts_dicts = {\n",
" 1: \"The jaguar's present range extends from Southwestern United States and Mexico in North America, across much of Central America, and south to Paraguay and northern Argentina in South America.\",\n",
" 2: \"Overall, the jaguar is the largest native cat species of the New World and the third largest in the world.\",\n",
" 3: \"Given its historical distribution, the jaguar has featured prominently in the mythology of numerous indigenous American cultures, including those of the Maya and Aztec.\",\n",
" 4: \"The jaguar is a compact and well-muscled animal.\",\n",
" 5: \"Melanistic jaguars are informally known as black panthers, but as with all forms of polymorphism they do not form a separate species.\",\n",
" 6: \"The jaguar uses scrape marks, urine, and feces to mark its territory.\",\n",
" 7: \"The word 'jaguar' is thought to derive from the Tupian word yaguara, meaning 'beast of prey'.\",\n",
" 8: \"Jaguar's business was founded as the Swallow Sidecar Company in 1922, originally making motorcycle sidecars before developing bodies for passenger cars.\",\n",
" 9: \"In 1990 Ford acquired Jaguar Cars and it remained in their ownership, joined in 2000 by Land Rover, till 2008.\",\n",
" 10: \"Two of the proudest moments in Jaguar's long history in motor sport involved winning the Le Mans 24 hours race, firstly in 1951 and again in 1953.\",\n",
" 11: \"He therefore accepted BMC's offer to merge with Jaguar to form British Motor (Holdings) Limited.\",\n",
" 12: \"The Jaguar E-Pace is a compact SUV, officially revealed on 13 July 2017.\"}\n",
"\n",
"titles, cxts = list(zip(*cxts_dicts.items())) # convert to 2 lists\n",
"tcs = prepare_target_contexts(cxt_strs=cxts, target_word='jaguar')\n",
"senses = gf.induce(\n",
" contexts=[tc.context for tc in tcs],\n",
" target_start_end_tuples=[tc.target_start_end_inds for tc in tcs],\n",
" titles=titles,\n",
" target_pos='N', # we want only nouns\n",
" n_sense_indicators=5, # how many substitutes for each sense in the output\n",
" top_n_pred=25) # the number of substitutes for each context\n",
"for i, sense in enumerate(senses):\n",
" print(f'Sense #{i+1}')\n",
" print(f'Sense indicators: {\", \".join(str(x) for x in sense.intent)}')\n",
" print(f'Found in contexts: {\", \".join(str(x) for x in sense.extent)}')\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And then disambiguate"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"sense_indicators = [list(sense.intent) for sense in senses]\n",
"for tc, title in zip(tcs, titles):\n",
" scores = tc.disambiguate(sense_clusters=sense_indicators)\n",
" print(f'For context: \"{str(title).upper()}. {tc.context}\" '\n",
" f'the sense: {sense_indicators[scores.index(max(scores))]} '\n",
" f'is chosen.')"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.10"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment