cmgerber/Common Objects of Verbs; Thesaurus For Main Topics.ipynb

## Common Objects of Verbs; Thesaurus For Main Topics.ipynb
{
 "worksheets": [
  {
   "cells": [
    {
     "metadata": {},
     "cell_type": "markdown",
     "source": "## 1. Common Objects of Verbs ##\nThe Church and Hanks reading shows how interesting semantics can be found by looking at very simple patterns. For instance, if we look at what gets drunk (the object of the verb drink) we can automatically acquire a list of beverages. Similarly, if we find an informative verb in a text about mythology, and look at the subjects of certain verbs, we might be able to group all the gods' names together by seeing who does the blessing and smoting.\nMore generally, looking at common objects of verbs, or in some cases, subjects of verbs, we have another piece of evidence for grouping similar words together.\n\n**Find frequent verbs:** Using your tagged collection from the previous assignment, first pull out verbs and then rank by frequency (if you like, you might use WordNet's morphy() to normalize them into their lemma form, but this is not required). Print out the top 40 most frequent verbs and take a look at them:"
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "import sys\nsys.path.append(\"/Users/colingerber/Documents/Programming/Personal/Python_modules\")",
     "prompt_number": 2,
     "outputs": [],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "import nltk\nimport cPickle as pickle\nfrom pyUtilities import flattenList as flat",
     "prompt_number": 3,
     "outputs": [],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "with open('tagged_clinical_corpus.pkl', 'rb') as input:\n        tagged_corpus = pickle.load(input)",
     "prompt_number": 4,
     "outputs": [],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "flat_tagged_corpus = flat.flatten(tagged_corpus)",
     "prompt_number": 5,
     "outputs": [],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "flat_tagged_corpus[:10]",
     "prompt_number": 6,
     "outputs": [
      {
       "text": "[(u'Inclusion', u'NN'),\n (u'Criteria', u'NP'),\n (u':', u':'),\n (u'Healthy', u'JJ'),\n (u'male', u'NN'),\n (u'18-50', 'CD'),\n (u'years', u'NNS'),\n (u'of', u'IN'),\n (u'age', u'NN'),\n (u'Non-smoker', u'NN')]",
       "output_type": "pyout",
       "metadata": {},
       "prompt_number": 6
      }
     ],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "#get the verbs out\nverbs = [word for word in flat_tagged_corpus if word[1].startswith('V')]",
     "prompt_number": 14,
     "outputs": [],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "verbs[:10]",
     "prompt_number": 15,
     "outputs": [
      {
       "text": "[(u'taking', u'VBG'),\n (u'use', u'VB'),\n (u'accepted', u'VBN'),\n (u'Known', u'VBN'),\n (u'sleep', u'VB'),\n (u'bleeding', u'VBG'),\n (u'need', u'VB'),\n (u'utilizing', u'VBG'),\n (u'proven', u'VBN'),\n (u'following', u'VBG')]",
       "output_type": "pyout",
       "metadata": {},
       "prompt_number": 15
      }
     ],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "fd = nltk.FreqDist(verbs)",
     "prompt_number": 16,
     "outputs": [],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "fd.most_common(40)",
     "prompt_number": 17,
     "outputs": [
      {
       "text": "[((u'informed', u'VBN'), 1018),\n ((u'following', u'VBG'), 884),\n ((u'defined', u'VBN'), 717),\n ((u'known', u'VBN'), 691),\n ((u'treated', u'VBN'), 554),\n ((u'use', u'VB'), 506),\n ((u'Known', u'VBN'), 455),\n ((u'received', u'VBN'), 452),\n ((u'written', u'VBN'), 415),\n ((u'requiring', u'VBG'), 401),\n ((u'interfere', u'VB'), 379),\n ((u'childbearing', u'VBG'), 378),\n ((u'Use', u'VB'), 378),\n ((u'using', u'VBG'), 352),\n ((u'confirmed', u'VBN'), 351),\n ((u'participate', u'VB'), 350),\n ((u'receiving', u'VBG'), 339),\n ((u'specified', u'VBN'), 323),\n ((u'excluded', u'VBN'), 308),\n ((u'determined', u'VBN'), 303),\n ((u'allowed', u'VBN'), 296),\n ((u'taking', u'VBG'), 280),\n ((u'comply', u'VB'), 280),\n ((u'Screening', u'VBG'), 275),\n ((u'provide', u'VB'), 269),\n ((u'diagnosed', u'VBN'), 249),\n ((u'required', u'VBN'), 247),\n ((u'documented', u'VBN'), 241),\n ((u'based', u'VBN'), 237),\n ((u'lactating', u'VBG'), 224),\n ((u'used', u'VBN'), 221),\n ((u'agree', u'VB'), 218),\n ((u'nursing', u'VBG'), 213),\n ((u'related', u'VBN'), 207),\n ((u'considered', u'VBN'), 204),\n ((u'understand', u'VB'), 199),\n ((u'give', u'VB'), 189),\n ((u'enrolled', u'VBN'), 187),\n ((u'aged', u'VBN'), 176),\n ((u'Uncontrolled', u'VBN'), 171)]",
       "output_type": "pyout",
       "metadata": {},
       "prompt_number": 17
      }
     ],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "markdown",
     "source": "**Pick 2 out interesting verbs:** Next manually pick out two verbs to look at in detail that look interesting to you. Try to pick some for which the objects will be interesting and will form a pattern of some kind.  Find all the sentences in your corpus that contain these verbs.\n"
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "# verbs chosen: known diagnosed\nverb_sents = []\n\nfor sent in tagged_corpus:\n    for word in sent:\n        if word[0] == 'known' or word[0] == 'diagnosed':\n            verb_sents.append(sent)\n            break",
     "prompt_number": 23,
     "outputs": [],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "verb_sents[15]",
     "prompt_number": 27,
     "outputs": [
      {
       "text": "[(u'The', u'AT'),\n (u'patient', u'NN'),\n (u'has', u'HVZ'),\n (u'been', u'BEN'),\n (u'diagnosed', u'VBN'),\n (u'with', u'IN'),\n (u'a', u'AT'),\n (u'psychiatric', u'JJ'),\n (u'disorder', u'NN'),\n (u'other', u'AP'),\n (u'than', u'CS'),\n (u'MDD', 'NN'),\n (u'during', u'IN'),\n (u'the', u'AT'),\n (u'lead-in', u'JJ'),\n (u'studies', u'NNS'),\n (u'NCT01838681', 'NN'),\n (u'14570A', 'NN'),\n (u'or', u'CC'),\n (u'NCT01837797', 'NN'),\n (u'14571A', 'NN'),\n (u'.', u'.')]",
       "output_type": "pyout",
       "metadata": {},
       "prompt_number": 27
      }
     ],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "markdown",
     "source": "**Find common objects:** Now write a chunker to find the simple noun phrase objects of these four verbs and see if they tell you anything interesting about your collection.  Don't worry about making the noun phrases perfect; you can use the chunker from the first part of this homework if you like.  Print out the common noun phrases and take a look.  Write the code below, show some of the output, and then reflect on that output in a few sentences.  \n"
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "def chunker(tagged_corpus):\n    \n    grammar = r\"\"\"\n      CHUNK: {<VBN>.*<JJ|NN>+<(NN|CD)|NN>}\n    \"\"\"\n    cp = nltk.RegexpParser(grammar)\n    \n    results = []\n    \n    for sents in tagged_corpus:\n        tree = cp.parse(sents)\n        for subtree in tree.subtrees():\n            if subtree.label() == 'CHUNK':\n                results.append(subtree)\n    return results",
     "prompt_number": 36,
     "outputs": [],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "chunks = chunker(verb_sents)",
     "prompt_number": 37,
     "outputs": [],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "chunks[50:70]",
     "prompt_number": 38,
     "outputs": [
      {
       "text": "[Tree('CHUNK', [(u'implanted', u'VBN'), (u'cardiac', u'JJ'), (u'pacemaker', u'NN')]),\n Tree('CHUNK', [(u'known', u'VBN'), (u'chronic', u'JJ'), (u'respiratory', u'JJ'), (u'infection', u'NN')]),\n Tree('CHUNK', [(u'known', u'VBN'), (u'chronic', u'JJ'), (u'respirator', u'NN')]),\n Tree('CHUNK', [(u'known', u'VBN'), (u'hereditary', u'JJ'), (u'methaemoglobinaemia', u'NN')]),\n Tree('CHUNK', [(u'completed', u'VBN'), (u'local', u'JJ'), (u'therapy', u'NN')]),\n Tree('CHUNK', [(u'documented', u'VBN'), (u'metastatic', u'JJ'), (u'disease', u'NN')]),\n Tree('CHUNK', [(u'Untreated', u'VBN'), (u'primary', u'JJ'), (u'uveal', u'NN')]),\n Tree('CHUNK', [(u'suspected', u'VBN'), (u'tissue', u'NN'), (u'hypoxia', 'NN')]),\n Tree('CHUNK', [(u'known', u'VBN'), (u'liver', u'NN'), (u'function', u'NN'), (u'abnormality', u'NN')]),\n Tree('CHUNK', [(u'known', u'VBN'), (u'viral', u'JJ'), (u'infection', u'NN')]),\n Tree('CHUNK', [(u'known', u'VBN'), (u'HIV', 'NN'), (u'positivity', u'NN')]),\n Tree('CHUNK', [(u'known', u'VBN'), (u'CNS', 'NN'), (u'involvement', u'NN'), (u'NOTE', 'NN')]),\n Tree('CHUNK', [(u'known', u'VBN'), (u'curative', u'JJ'), (u'therapy', u'NN')]),\n Tree('CHUNK', [(u'known', u'VBN'), (u'bone', 'NN'), (u'marrow', u'NN'), (u'involvement', u'NN')]),\n Tree('CHUNK', [(u'known', u'VBN'), (u'breast', u'NN'), (u'lesion', u'NN')]),\n Tree('CHUNK', [(u'known', u'VBN'), (u'hepatic', u'JJ'), (u'disease', u'NN')]),\n Tree('CHUNK', [(u'acquired', u'VBN'), (u'uterine', u'NN'), (u'anomaly', u'NN')]),\n Tree('CHUNK', [(u'impaired', u'VBN'), (u'pulmonary', u'JJ'), (u'function', u'NN')]),\n Tree('CHUNK', [(u'Estimated', u'VBN'), (u'Creatinine', u'NN'), (u'Clearance', u'NN'), (u'50', u'CD')]),\n Tree('CHUNK', [(u'diagnosed', u'VBN'), (u'acute', u'JJ'), (u'myeloid', u'JJ'), (u'leukemia', u'NN')])]",
       "output_type": "pyout",
       "metadata": {},
       "prompt_number": 38
      }
     ],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "markdown",
     "source": "###Reflection\n\nThe first thing I realized is the my sentence splitting is not perfect yet. There are still some sentence chunks that need to be split into sentences.\n\nLooking at known and diagnosed allows you to find the diseases the studies are looking for. They generally are saying they want a known or diagnosed disease. This would be very useful for tagging studdies with the required diseases the patients must have."
    },
    {
     "metadata": {},
     "cell_type": "markdown",
     "source": "\n## 2. Identify Main Topics from WordNet Hypernms ##\nFirst read about the code supplied below; at the end you'll be asked to do an exercise."
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "from nltk.corpus import wordnet as wn\nfrom nltk.corpus import brown\nfrom nltk.corpus import stopwords",
     "prompt_number": 40,
     "outputs": [],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "markdown",
     "source": "This code first pulls out the most frequent words from a section of the brown corpus after removing stop words.  It lowercases everything, but should really be doing much smarter things with tokenization and phrases and so on. "
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "def preprocess_terms():\n    # select a subcorpus of brown to experiment with\n    words = [word.lower() for word in brown.words(categories=\"science_fiction\") if word.lower() not in stopwords.words('english')]\n    # count up the words\n    fd = nltk.FreqDist(words)\n    # show some sample words\n    print ' '.join(fd.keys()[100:150])\n    return fd\nfd = preprocess_terms()",
     "prompt_number": 41,
     "outputs": [
      {
       "output_type": "stream",
       "text": "eyebrows yancy-6 campaign explained highly brought patience moral stern ekstrohm glance total landscape experimentation spoke would commissioned hospital m. arms program call calm recommend tell separated beyond holy hurt glass warm hurl ward inhabited must pontifical join room setup work already gabriel's shook indicated give household flung organized surveyor want\n",
       "stream": "stdout"
      }
     ],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "markdown",
     "source": "Then makes a *very naive* guess at which are the most important words.  This is where some term weighting should take place."
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "def find_important_terms(fd):\n    important_words = []\n    #if the frequency is less than the top 50 words and more than 5 occurances\n    for word in fd.keys():\n        if fd[word] < fd.most_common(50)[-1][1] and fd[word] > 5:\n            important_words.append(word)\n    return important_words\n\nimportant_terms = find_important_terms(fd)",
     "prompt_number": 83,
     "outputs": [],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "markdown",
     "source": "The code below is a very crude way to see what the most common \"topics\" are among the \"important\" words, according to WordNet.  It does this by looking at the immediate hypernym of every sense of a wordform for those wordforms that are found to be nouns in WordNet.  This is problematic because many of these senses will be incorrect and also often the hypernym elides the specific meaning of the word, but if you compare, say *romance* to *science fiction* in brown, you do see differences in the results. "
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "# Count the direct hypernyms for every sense of each wordform.\n# This is very crude.  It should convert the wordform to a lemma, and should\n# be smarter about selecting important words and finding two-word phrases, etc.\n\n# Nonetheless, you get intersting differences between, say, scifi and romance.\ndef categories_from_hypernyms(termlist):\n    hypterms = []                        \n    for term in termlist:                  # for each term\n        s = wn.synsets(term.lower(), 'n')  # get its nominal synsets\n        for syn in s:                      # for each synset\n            for hyp in syn.hypernyms():    # It has a list of hypernyms\n                hypterms = hypterms + [hyp.name]  # Extract the hypernym name and add to list\n\n    hypfd = nltk.FreqDist(hypterms)\n    print \"Show most frequent hypernym results\"\n    return [(count, name, wn.synset(name).definition) for (name, count) in hypfd.items()[:25]] \n    \ncategories_from_hypernyms(important_terms)",
     "prompt_number": 84,
     "outputs": [
      {
       "output_type": "stream",
       "text": "Show most frequent hypernym results\n",
       "stream": "stdout"
      },
      {
       "output_type": "pyerr",
       "ename": "AttributeError",
       "evalue": "'function' object has no attribute 'lower'",
       "traceback": [
        "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m\n\u001b[0;31mAttributeError\u001b[0m                            Traceback (most recent call last)",
        "\u001b[0;32m<ipython-input-84-9802bd9a4da3>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m()\u001b[0m\n\u001b[1;32m     16\u001b[0m     \u001b[0;32mreturn\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mcount\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mname\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mwn\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msynset\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mname\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mdefinition\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;32mfor\u001b[0m \u001b[0;34m(\u001b[0m\u001b[0mname\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mcount\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mhypfd\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mitems\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;36m25\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m     17\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 18\u001b[0;31m \u001b[0mcategories_from_hypernyms\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mimportant_terms\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
        "\u001b[0;32m<ipython-input-84-9802bd9a4da3>\u001b[0m in \u001b[0;36mcategories_from_hypernyms\u001b[0;34m(termlist)\u001b[0m\n\u001b[1;32m     14\u001b[0m     \u001b[0mhypfd\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mnltk\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mFreqDist\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mhypterms\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m     15\u001b[0m     \u001b[0;32mprint\u001b[0m \u001b[0;34m\"Show most frequent hypernym results\"\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 16\u001b[0;31m     \u001b[0;32mreturn\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mcount\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mname\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mwn\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msynset\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mname\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mdefinition\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;32mfor\u001b[0m \u001b[0;34m(\u001b[0m\u001b[0mname\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mcount\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mhypfd\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mitems\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;36m25\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m     17\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m     18\u001b[0m \u001b[0mcategories_from_hypernyms\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mimportant_terms\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
        "\u001b[0;32m/Users/colingerber/anaconda/lib/python2.7/site-packages/nltk/corpus/reader/wordnet.pyc\u001b[0m in \u001b[0;36msynset\u001b[0;34m(self, name)\u001b[0m\n\u001b[1;32m   1215\u001b[0m     \u001b[0;32mdef\u001b[0m \u001b[0msynset\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mname\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   1216\u001b[0m         \u001b[0;31m# split name into lemma, part of speech and synset number\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 1217\u001b[0;31m         \u001b[0mlemma\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mpos\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0msynset_index_str\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mname\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mlower\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrsplit\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'.'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;36m2\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m   1218\u001b[0m         \u001b[0msynset_index\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mint\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0msynset_index_str\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;34m-\u001b[0m \u001b[0;36m1\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   1219\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n",
        "\u001b[0;31mAttributeError\u001b[0m: 'function' object has no attribute 'lower'"
       ]
      }
     ],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "markdown",
     "source": "**Here is the question** Modify this code in some way to do a better job of using WordNet to summarize terms.  You can trim senses in a better way, or traverse hypernyms differently.  You don't have to use hypernyms; you can use any WordNet relations you like, or chose your terms in another way.  You can also use other parts of speech if you like.  "
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "wn.synsets(term.lower())[0].hypernyms()",
     "prompt_number": 72,
     "outputs": [
      {
       "text": "[Synset('good_nature.n.01')]",
       "output_type": "pyout",
       "metadata": {},
       "prompt_number": 72
      }
     ],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "# Count the direct hypernyms for every sense of each wordform.\n# This is very crude.  It should convert the wordform to a lemma, and should\n# be smarter about selecting important words and finding two-word phrases, etc.\n\n# Nonetheless, you get intersting differences between, say, scifi and romance.\ndef categories_from_hypernyms(termlist):\n    hypterms = []                        \n    for term in termlist:                  # for each term\n        s = wn.synsets(term.lower(), 'n')  # get its nominal synsets\n        for syn in s:                      # for each synset\n            for hyp in syn.hypernyms():    # It has a list of hypernyms\n                hypterms = hypterms + [hyp.name()]  # Extract the hypernym name and add to list\n\n    hypfd = nltk.FreqDist(hypterms)\n    print \"Show most frequent hypernym results\"\n    return [(count, name, wn.synset(name).definition()) for (name, count) in hypfd.most_common(25)] \n    \ncategories_from_hypernyms(important_terms)",
     "prompt_number": 85,
     "outputs": [
      {
       "output_type": "stream",
       "text": "Show most frequent hypernym results\n",
       "stream": "stdout"
      },
      {
       "text": "[(7, u'time_period.n.01', u'an amount of time'),\n (7,\n  u'natural_object.n.01',\n  u'an object occurring naturally; not made by man'),\n (4, u'time_unit.n.01', u'a unit for measuring time periods'),\n (4,\n  u'concern.n.01',\n  u'something that interests you because it is important or affects you'),\n (3, u'location.n.01', u'a point or extent in space'),\n (3, u'appearance.n.01', u'outward or visible aspect of a person or thing'),\n (3,\n  u'area.n.06',\n  u'the extent of a 2-dimensional surface enclosed within a boundary'),\n (3,\n  u'surface.n.01',\n  u'the outer boundary of an artifact or a material layer constituting or resembling such a boundary'),\n (3,\n  u'surface.n.02',\n  u'the extended two-dimensional outer boundary of a three-dimensional object'),\n (3, u'bed.n.01', u'a piece of furniture that provides a place to sleep'),\n (3, u'covering.n.01', u'a natural object that covers or envelops'),\n (3,\n  u'statement.n.01',\n  u'a message that is stated or declared; a communication (oral or written) setting forth particulars or facts etc'),\n (3, u'speech.n.02', u'(language) communication by word of mouth'),\n (3, u'necessity.n.02', u'anything indispensable'),\n (3, u'activity.n.01', u'any specific behavior'),\n (3,\n  u'object.n.01',\n  u'a tangible and visible entity; an entity that can cast a shadow'),\n (3,\n  u'cognition.n.01',\n  u'the psychological result of perception and learning and reasoning'),\n (3,\n  u'message.n.02',\n  u'what a communication that is about something is about'),\n (3, u'person.n.01', u'a human being'),\n (3,\n  u'people.n.01',\n  u'(plural) any group of human beings (men or women or children) collectively'),\n (3,\n  u'body_part.n.01',\n  u'any part of an organism such as an organ or extremity'),\n (2,\n  u'structure.n.01',\n  u'a thing constructed; a complex entity constructed of many parts'),\n (2, u'tune.n.01', u'a succession of notes forming a distinctive sequence'),\n (2, u'thing.n.12', u'a separate and self-contained entity'),\n (2,\n  u'communication.n.02',\n  u'something that is communicated by or to or between people or groups')]",
       "output_type": "pyout",
       "metadata": {},
       "prompt_number": 85
      }
     ],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "",
     "outputs": [],
     "language": "python",
     "trusted": true,
     "collapsed": false
    }
   ],
   "metadata": {}
  }
 ],
 "metadata": {
  "name": "",
  "signature": "sha256:6c6475e961c68aaae3ac6262bd87515c84c6a3b94ae5e81571565651c51434e7"
 },
 "nbformat": 3
}