Skip to content

Instantly share code, notes, and snippets.

@bbzzzz
Created March 2, 2015 16:21
Show Gist options
  • Save bbzzzz/5f67575d1397416b0f3d to your computer and use it in GitHub Desktop.
Save bbzzzz/5f67575d1397416b0f3d to your computer and use it in GitHub Desktop.
Natrual Language Processing - Word Meaning and Word Similarity
{
"metadata": {
"name": "Wordnet Interface"
},
"nbformat": 3,
"nbformat_minor": 0,
"worksheets": [
{
"cells": [
{
"cell_type": "heading",
"level": 3,
"metadata": {},
"source": "Starting NLTK"
},
{
"cell_type": "code",
"collapsed": false,
"input": "import nltk\nfrom nltk.corpus import wordnet as wn",
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 2
},
{
"cell_type": "heading",
"level": 3,
"metadata": {},
"source": "Synsets and lemmas"
},
{
"cell_type": "markdown",
"metadata": {},
"source": " For an arbitrary word, i.e. dog, it may have different senses, and we can find its synsets."
},
{
"cell_type": "code",
"collapsed": false,
"input": "wn.synsets('dog')",
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 18,
"text": "[Synset('dog.n.01'),\n Synset('frump.n.01'),\n Synset('dog.n.03'),\n Synset('cad.n.01'),\n Synset('frank.n.02'),\n Synset('pawl.n.01'),\n Synset('andiron.n.01'),\n Synset('chase.v.01')]"
}
],
"prompt_number": 18
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Once you have a synset, there are functions to find the information on that synset, and we will start with \u201clemma_names\u201d, \u201clemmas\u201d, \u201cdefinitions\u201d and \u201cexamples\u201d. "
},
{
"cell_type": "markdown",
"metadata": {},
"source": "For the first synset 'dog.n.01', which means the first noun sense of \u2018dog\u2019, we can first find all of its words/lemma names. These are all the words that are synonyms of this sense of \u2018dog\u2019."
},
{
"cell_type": "code",
"collapsed": false,
"input": "wn.synset('dog.n.01').lemma_names()",
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 9,
"text": "[u'dog', u'domestic_dog', u'Canis_familiaris']"
}
],
"prompt_number": 9
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Given a synset, find all its lemmas, where a lemma is the pairing of a word with a synset. "
},
{
"cell_type": "code",
"collapsed": false,
"input": "wn.synset('dog.n.01').lemmas()",
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 10,
"text": "[Lemma('dog.n.01.dog'),\n Lemma('dog.n.01.domestic_dog'),\n Lemma('dog.n.01.Canis_familiaris')]"
}
],
"prompt_number": 10
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Given a lemma, find its synset"
},
{
"cell_type": "code",
"collapsed": false,
"input": "wn.lemma('dog.n.01.domestic_dog').synset()",
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 11,
"text": "Synset('dog.n.01')"
}
],
"prompt_number": 11
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Given a word, find lemmas contained in all synsets it belongs to"
},
{
"cell_type": "code",
"collapsed": false,
"input": " for synset in wn.synsets('dog'):\n print synset, \": \", synset.lemma_names()",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "Synset('dog.n.01') : [u'dog', u'domestic_dog', u'Canis_familiaris']\nSynset('frump.n.01') : [u'frump', u'dog']\nSynset('dog.n.03') : [u'dog']\nSynset('cad.n.01') : [u'cad', u'bounder', u'blackguard', u'dog', u'hound', u'heel']\nSynset('frank.n.02') : [u'frank', u'frankfurter', u'hotdog', u'hot_dog', u'dog', u'wiener', u'wienerwurst', u'weenie']\nSynset('pawl.n.01') : [u'pawl', u'detent', u'click', u'dog']\nSynset('andiron.n.01') : [u'andiron', u'firedog', u'dog', u'dog-iron']\nSynset('chase.v.01') : [u'chase', u'chase_after', u'trail', u'tail', u'tag', u'give_chase', u'dog', u'go_after', u'track']\n"
}
],
"prompt_number": 12
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Given a word, find all lemmas involving the word. Note that these are the synsets of the word\n\u2018dog\u2019, but just also showing that \u2018dog\u2019 is one of the words in each of the synsets. "
},
{
"cell_type": "code",
"collapsed": false,
"input": "wn.lemmas('dog')",
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 17,
"text": "[Lemma('dog.n.01.dog'),\n Lemma('frump.n.01.dog'),\n Lemma('dog.n.03.dog'),\n Lemma('cad.n.01.dog'),\n Lemma('frank.n.02.dog'),\n Lemma('pawl.n.01.dog'),\n Lemma('andiron.n.01.dog'),\n Lemma('chase.v.01.dog')]"
}
],
"prompt_number": 17
},
{
"cell_type": "heading",
"level": 3,
"metadata": {},
"source": "Definitions and examples:"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "The other functions of synsets give the additional information of definitions and examples. Find definitions of the synset for the first sense of the word \u2018dog\u2019: "
},
{
"cell_type": "code",
"collapsed": false,
"input": "wn.synset('dog.n.01').definition()",
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 13,
"text": "u'a member of the genus Canis (probably descended from the common wolf) that has been domesticated by man since prehistoric times; occurs in many breeds'"
}
],
"prompt_number": 13
},
{
"cell_type": "code",
"collapsed": false,
"input": " wn.synset('dog.n.01').examples()",
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 14,
"text": "[u'the dog barked all night']"
}
],
"prompt_number": 14
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Or we can show all the synsets and their definitions: "
},
{
"cell_type": "code",
"collapsed": false,
"input": " for synset in wn.synsets('dog'):\n print synset, \": \", synset.definition()",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "Synset('dog.n.01') : a member of the genus Canis (probably descended from the common wolf) that has been domesticated by man since prehistoric times; occurs in many breeds\nSynset('frump.n.01') : a dull unattractive unpleasant girl or woman\nSynset('dog.n.03') : informal term for a man\nSynset('cad.n.01') : someone who is morally reprehensible\nSynset('frank.n.02') : a smooth-textured sausage of minced beef or pork usually smoked; often served on a bread roll\nSynset('pawl.n.01') : a hinged catch that fits into a notch of a ratchet to move a wheel forward or prevent it from moving backward\nSynset('andiron.n.01') : metal supports for logs in a fireplace\nSynset('chase.v.01') : go after with the intent to catch\n"
}
],
"prompt_number": 15
},
{
"cell_type": "heading",
"level": 3,
"metadata": {},
"source": "The WordNet Hierarchy"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "WordNet contains many relations between synsets. In particular, we quite often explore the hierarchy of WordNet synsets induced by the hypernym and hyponym relations. (These relations are sometimes called \u201cis-a\u201d because they represent abstract levels of what things are.)"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Find hypernyms of a synset of \u2018dog\u2019:"
},
{
"cell_type": "code",
"collapsed": false,
"input": "dog1 = wn.synset('dog.n.01')\ndog1.hypernyms()",
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 19,
"text": "[Synset('canine.n.02'), Synset('domestic_animal.n.01')]"
}
],
"prompt_number": 19
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Find hyponyms:"
},
{
"cell_type": "code",
"collapsed": false,
"input": "dog1.hyponyms()",
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 20,
"text": "[Synset('basenji.n.01'),\n Synset('corgi.n.01'),\n Synset('cur.n.01'),\n Synset('dalmatian.n.02'),\n Synset('great_pyrenees.n.01'),\n Synset('griffon.n.02'),\n Synset('hunting_dog.n.01'),\n Synset('lapdog.n.01'),\n Synset('leonberg.n.01'),\n Synset('mexican_hairless.n.01'),\n Synset('newfoundland.n.01'),\n Synset('pooch.n.01'),\n Synset('poodle.n.01'),\n Synset('pug.n.01'),\n Synset('puppy.n.01'),\n Synset('spitz.n.01'),\n Synset('toy_dog.n.01'),\n Synset('working_dog.n.01')]"
}
],
"prompt_number": 20
},
{
"cell_type": "markdown",
"metadata": {},
"source": "We can find the most general hypernym as the root hypernym:"
},
{
"cell_type": "code",
"collapsed": false,
"input": "dog1.root_hypernyms()",
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 21,
"text": "[Synset('entity.n.01')]"
}
],
"prompt_number": 21
},
{
"cell_type": "markdown",
"metadata": {},
"source": "The function hypernym_paths shows paths between the top of the hierarchy down to the synset.\n\nIn this example, there are two paths between entity and the first sense of dog. "
},
{
"cell_type": "code",
"collapsed": false,
"input": "pathsdog=dog1.hypernym_paths()\nprint len(pathsdog)",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "2\n"
}
],
"prompt_number": 36
},
{
"cell_type": "code",
"collapsed": false,
"input": "[synset.name() for synset in pathsdog[0]]",
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 37,
"text": "[u'entity.n.01',\n u'physical_entity.n.01',\n u'object.n.01',\n u'whole.n.02',\n u'living_thing.n.01',\n u'organism.n.01',\n u'animal.n.01',\n u'chordate.n.01',\n u'vertebrate.n.01',\n u'mammal.n.01',\n u'placental.n.01',\n u'carnivore.n.01',\n u'canine.n.02',\n u'dog.n.01']"
}
],
"prompt_number": 37
},
{
"cell_type": "code",
"collapsed": false,
"input": "[synset.name() for synset in pathsdog[1]]",
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 38,
"text": "[u'entity.n.01',\n u'physical_entity.n.01',\n u'object.n.01',\n u'whole.n.02',\n u'living_thing.n.01',\n u'organism.n.01',\n u'animal.n.01',\n u'domestic_animal.n.01',\n u'dog.n.01']"
}
],
"prompt_number": 38
},
{
"cell_type": "markdown",
"metadata": {},
"source": "The min_depth function tells how many edges there are between a word and the top of the hierarchy. \n"
},
{
"cell_type": "code",
"collapsed": false,
"input": "dog1.min_depth()",
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 39,
"text": "8"
}
],
"prompt_number": 39
},
{
"cell_type": "heading",
"level": 3,
"metadata": {},
"source": "Word Similarity"
},
{
"cell_type": "code",
"collapsed": false,
"input": "dog = wn.synset('dog.n.01')\ncat = wn.synset('cat.n.01')",
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 22
},
{
"cell_type": "code",
"collapsed": false,
"input": "hit = wn.synset('hit.v.01')\nslap = wn.synset('slap.v.01')",
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 23
},
{
"cell_type": "markdown",
"metadata": {},
"source": "One way to find semantic similarity is to find the hypernyms of two synsets."
},
{
"cell_type": "code",
"collapsed": false,
"input": "dog.lowest_common_hypernyms(cat)",
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 28,
"text": "[Synset('carnivore.n.01')]"
}
],
"prompt_number": 28
},
{
"cell_type": "code",
"collapsed": false,
"input": "pathscat=cat.hypernym_paths()\n[synset.name() for synset in pathscat[0]]",
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 42,
"text": "[u'entity.n.01',\n u'physical_entity.n.01',\n u'object.n.01',\n u'whole.n.02',\n u'living_thing.n.01',\n u'organism.n.01',\n u'animal.n.01',\n u'chordate.n.01',\n u'vertebrate.n.01',\n u'mammal.n.01',\n u'placental.n.01',\n u'carnivore.n.01',\n u'feline.n.01',\n u'cat.n.01']"
}
],
"prompt_number": 42
},
{
"cell_type": "markdown",
"metadata": {},
"source": "synset1.path_similarity(synset2): Return a score denoting how similar two word senses are, based on the shortest path that connects the senses in the is-a (hypernym/hypnoym) taxonomy."
},
{
"cell_type": "markdown",
"metadata": {},
"source": "The score is in the range 0 to 1. A score of 1 represents identity i.e. comparing a sense with itself will return 1."
},
{
"cell_type": "code",
"collapsed": false,
"input": "dog.path_similarity(cat)",
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 24,
"text": "0.2"
}
],
"prompt_number": 24
},
{
"cell_type": "code",
"collapsed": false,
"input": "dog.path_similarity(dog)",
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 27,
"text": "1.0"
}
],
"prompt_number": 27
},
{
"cell_type": "code",
"collapsed": false,
"input": "hit.path_similarity(slap) ",
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 25,
"text": "0.14285714285714285"
}
],
"prompt_number": 25
},
{
"cell_type": "code",
"collapsed": false,
"input": "wn.path_similarity(hit, slap)",
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 26,
"text": "0.14285714285714285"
}
],
"prompt_number": 26
},
{
"cell_type": "markdown",
"metadata": {},
"source": "wordnet_ic Information Content: Load an information content file from the wordnet_ic corpus."
},
{
"cell_type": "code",
"collapsed": false,
"input": "from nltk.corpus import wordnet_ic\nbrown_ic = wordnet_ic.ic('ic-brown.dat')\nsemcor_ic = wordnet_ic.ic('ic-semcor.dat')",
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 43
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Resnik Similarity: Return a score denoting how similar two word senses are, based on the Information Content (IC) of the Least Common Subsumer (most specific ancestor node). \n\nNote that for any similarity measure that uses information content, the result is dependent on the corpus used to generate the information content and the specifics of how the information content was created."
},
{
"cell_type": "code",
"collapsed": false,
"input": "dog.res_similarity(cat, brown_ic)",
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 45,
"text": "7.911666509036577"
}
],
"prompt_number": 45
},
{
"cell_type": "code",
"collapsed": false,
"input": "dog.res_similarity(cat, semcor_ic) ",
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 46,
"text": "7.2549003421277245"
}
],
"prompt_number": 46
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Jiang-Conrath Similarity Return a score denoting how similar two word senses are, based on the Information Content (IC) of the Least Common Subsumer (most specific ancestor node) and that of the two input Synsets. The relationship is given by the equation 1 / (IC(s1) + IC(s2) - 2 * IC(lcs))."
},
{
"cell_type": "code",
"collapsed": false,
"input": "dog.jcn_similarity(cat, brown_ic) ",
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 47,
"text": "0.4497755285516739"
}
],
"prompt_number": 47
},
{
"cell_type": "code",
"collapsed": false,
"input": "dog.jcn_similarity(cat, semcor_ic) ",
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 48,
"text": "0.537382154955756"
}
],
"prompt_number": 48
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Lin Similarity: Return a score denoting how similar two word senses are, based on the Information Content (IC) of the Least Common Subsumer (most specific ancestor node) and that of the two input Synsets. The relationship is given by the equation 2 * IC(lcs) / (IC(s1) + IC(s2))."
},
{
"cell_type": "code",
"collapsed": false,
"input": "dog.lin_similarity(cat, brown_ic) ",
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 50,
"text": "0.8768009843733973"
}
],
"prompt_number": 50
},
{
"cell_type": "code",
"collapsed": false,
"input": "dog.lin_similarity(cat, semcor_ic) ",
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 56,
"text": "0.8863288628086228"
}
],
"prompt_number": 56
}
],
"metadata": {}
}
]
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment