Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save xeoncross/7d9547803ae84f480893992b05af1d85 to your computer and use it in GitHub Desktop.
Save xeoncross/7d9547803ae84f480893992b05af1d85 to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"worksheets": [
{
"cells": [
{
"metadata": {},
"cell_type": "markdown",
"source": "## 1. Common Objects of Verbs ##\nThe Church and Hanks reading shows how interesting semantics can be found by looking at very simple patterns. For instance, if we look at what gets drunk (the object of the verb drink) we can automatically acquire a list of beverages. Similarly, if we find an informative verb in a text about mythology, and look at the subjects of certain verbs, we might be able to group all the gods' names together by seeing who does the blessing and smoting.\nMore generally, looking at common objects of verbs, or in some cases, subjects of verbs, we have another piece of evidence for grouping similar words together.\n\n**Find frequent verbs:** Using your tagged collection from the previous assignment, first pull out verbs and then rank by frequency (if you like, you might use WordNet's morphy() to normalize them into their lemma form, but this is not required). Print out the top 40 most frequent verbs and take a look at them:"
},
{
"metadata": {},
"cell_type": "code",
"input": "#This cell has all methods I need, covering all the different operations I might need to do with my text.\nimport nltk\nimport re\nimport string\nfrom urllib import urlopen\n\n#Get Feynman\ndef ReadFeynman():\n url = 'https://archive.org/stream/RichardFeynman/Richard_P_Feynman-Surely_Youre_Joking_Mr_Feynman_v5_djvu.txt'\n raw = urlopen(url).read()\n modified = raw[10150:]\n modified = string.replace(modified,'\\n','')\n modified = string.replace(modified,'\\xe2\\x80\\x94','')\n return modified\n\n#Get sentences from Feynman\ndef GetSentencesFromFeynman():\n sent_tokenizer=nltk.data.load('tokenizers/punkt/english.pickle')\n sents = sent_tokenizer.tokenize(ReadFeynman())\n return sents\n \n\n#Tag the Feynman\ndef GetTaggedSentences(all_sentences):\n pattern = r\"([A-Z]\\.)+|\\w+([-']\\w+)*|\\$?\\d+(\\.\\d+)?%?|\\.\\.\\.|[][.,;\\\"'?():-_`]\"\n sentences = all_sentences\n split_sents = [nltk.regexp_tokenize(sent, pattern) for sent in sentences]\n tagged_sents = []\n ngram_tagger = GetTrainedNGramTagger()\n for sent in split_sents:\n tagged_sents.append(ngram_tagger.tag(sent))\n return tagged_sents\n\n#First define a tagger\ndef build_backoff_tagger (train_sents):\n t0 = nltk.DefaultTagger('NN')\n t1 = nltk.UnigramTagger(train_sents, backoff=t0)\n t2 = nltk.BigramTagger(train_sents, backoff=t1)\n return t2\n\n#Training and buildng the ngram tagger\ndef GetTrainedNGramTagger():\n #Build training set\n train_sents = []\n for categ in nltk.corpus.brown.categories():\n for sent in nltk.corpus.brown.tagged_sents(categories=categ):\n train_sents.append(sent)\n train_sents.append([('Feynman', 'NP')])\n train_sents.append([('Los', 'NP'),('Alamos', 'NP')])\n train_sents.append([('MIT', 'NP')])\n #training the tagger\n ngram_tagger = build_backoff_tagger(train_sents)\n return ngram_tagger\n \n#Gets you the verbs\ndef VerbChunker(sent):\n grammar = \"VERB: {<V.*>}\"\n cp = nltk.RegexpParser(grammar)\n result = cp.parse(sent)\n return result\n\n#Gets you the noun phrases\ndef SlightlyModifiedChuangChunker(sent):\n grammar = \"NP: {<CD>*(((<JJ>|<N.*>)+(<N.*>|<CD>))|<N.*>)}\"\n cp = nltk.RegexpParser(grammar)\n result = cp.parse(sent)\n return result\n\n#Run the chunker against a Chunker with a specific grammar\ndef ChunkASection(sents,Chunker):\n chunkedlist = []\n for sent in sents:\n chunks = Chunker(sent)\n for chunk in chunks:\n if(type(chunk)==type(chunks)):\n temp =''\n for leaf in chunk.leaves():\n temp += leaf[0]+' '\n chunkedlist.append(temp)\n return chunkedlist\n\n#Getting sentences which contain a specific word\ndef GetSentencesWith(word):\n allSentences = []\n pattern = r\"([A-Z]\\.)+|\\w+([-']\\w+)*|\\$?\\d+(\\.\\d+)?%?|\\.\\.\\.|[][.,;\\\"'?():-_`]\"\n for sent in GetSentencesFromFeynman():\n if word in nltk.regexp_tokenize(sent, pattern):\n allSentences.append(sent)\n return allSentences",
"prompt_number": 107,
"outputs": [],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "code",
"input": "tagged_sents = GetTaggedSentences(GetSentencesFromFeynman())",
"outputs": [],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "code",
"input": "verbs = ChunkASection(tagged_sents,VerbChunker)",
"outputs": [],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "code",
"input": "verb_fd = nltk.FreqDist(verbs)\nprint verb_fd.items()[0:40]",
"outputs": [],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "markdown",
"source": "**Pick 2 out interesting verbs:** Next manually pick out two verbs to look at in detail that look interesting to you. Try to pick some for which the objects will be interesting and will form a pattern of some kind. Find all the sentences in your corpus that contain these verbs."
},
{
"metadata": {},
"cell_type": "code",
"input": "#made and worked - Feynman was an awesome scientist after all\n\nallMadeSents = GetSentencesWith(\"made\")\nallWorkedSents = GetSentencesWith(\"worked\")",
"outputs": [],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "code",
"input": "allMadeSents_tagged = GetTaggedSentences(allMadeSents)",
"prompt_number": 100,
"outputs": [],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "code",
"input": "allWorkedSents_tagged = GetTaggedSentences(allWorkedSents)",
"prompt_number": 104,
"outputs": [],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "markdown",
"source": "**Find common objects:** Now write a chunker to find the simple noun phrase objects of these four verbs and see if they tell you anything interesting about your collection. Don't worry about making the noun phrases perfect; you can use the chunker from the first part of this homework if you like. Print out the common noun phrases and take a look. Write the code below, show some of the output, and then reflect on that output in a few sentences. \n"
},
{
"metadata": {},
"cell_type": "code",
"input": "nps_made = ChunkASection(allMadeSents_tagged,SlightlyModifiedChuangChunker)",
"prompt_number": 101,
"outputs": [],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "code",
"input": "nps_made_fd = nltk.FreqDist(nps_made)\nprint nps_made[0:60]",
"prompt_number": 103,
"outputs": [
{
"output_type": "stream",
"text": "['fuse ', 'fuse ', 'house ', 'own fuses ', 'tin foil ', 'old burnt-out fuse ', '\" ', 'little song ', 'music ', '\" Deedle leet deet ', 'doodle doodle loot doot ', 'deedle deedle leet ', 'doodle loot doot ', 'whole deal ', '\" ', '\" ', 'fifty seconds ', 'people ', 'timings ', 'problems ', 'trifle ', 'tau ', 'top ', 'tau ', 'cosine ', 'kind ', 'gamma ', 'little bit ', 'square root sign ', 'f ', 'x ', 'f times x ', 'dy dx-you ', 'tendency ', \"d's\\xe2 \", 'different sign ', 'amp ', 'sign ', 'day ', 'job ', 'pantry lady ', 'ham sandwich ', 'guy ', 'late shift ', 'Inspired ', 'Leonardo book ', 'gadget ', 'system ', 'strings ', 'weights\\xe2 Coke bottles ', 'water\\xe2 ', 'door ', 'pull-chain light ', 'way ', 'desk ', 'order ', 'switchboard ', 'distance ', 'call ', 'from-it ']\n",
"stream": "stdout"
}
],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "markdown",
"source": "I found this output so interesting. It shows such a variety of the things he made in his initial days,also the beginning of the book, and at the same time gives a feel of the pulse of the book. From math to fuses and even sandwiches. "
},
{
"metadata": {},
"cell_type": "code",
"input": "nps_worked = ChunkASection(allWorkedSents_tagged,SlightlyModifiedChuangChunker)",
"prompt_number": 105,
"outputs": [],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "code",
"input": "nps_worked_fd = nltk.FreqDist(nps_worked)\nprint nps_worked[0:60]",
"prompt_number": 106,
"outputs": [
{
"output_type": "stream",
"text": "['bed ', '\" ', 'cosine ', 'five degrees ', 'five degrees ', 'addition ', 'angle formulas ', 'Beans ', 'one summer ', 'hotel run ', 'aunt ', 'way ', 'world ', 'long hours ', 'day ', 'intellectual characters ', 'sorts ', 'ways ', 'sophomores ', 'tricks ', '\" ', 'case ', 'problem ', 'assistant ', 'Einstein ', 'bit ', 'answer ', 'real motion ', 'matter ', 'guys ', 'door ', 'pain ', 'difficulty ', 'tormentors ', 'one door ', 'white ball ', 'sneakily ', 'woman ', 'time ', 'cashier ', 'cafeteria ', 'white uniform ', 'sleep ', 'night ', 'lot ', 'formulas ', 'Riemann-Zeta function ', 'plastics ', 'new kinds ', 'plastics ', 'time ', 'methyl methacrylate ', 'plexiglass ', 'plate ', 'mathematics ', 'lack ', 'hard work ', 'time ', 'year ', 'tea ']\n",
"stream": "stdout"
}
],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "markdown",
"source": "I wanted to check if worked and made would return similar results but the output is slightly different. While \"made\" gave the actual products he built or used, \"worked\" seems to have returned results related to the environment in which he worked.\nThis looks like a good way of summarizing things,if you can pick the right verbs."
},
{
"metadata": {},
"cell_type": "markdown",
"source": "## 2. Identify Main Topics from WordNet Hypernms ##\nFirst read about the code supplied below; at the end you'll be asked to do an exercise."
},
{
"metadata": {},
"cell_type": "code",
"input": "from nltk.corpus import wordnet as wn\nfrom nltk.corpus import brown\nfrom nltk.corpus import stopwords",
"prompt_number": 108,
"outputs": [],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "markdown",
"source": "This code first pulls out the most frequent words from a section of the brown corpus after removing stop words. It lowercases everything, but should really be doing much smarter things with tokenization and phrases and so on. "
},
{
"metadata": {},
"cell_type": "code",
"input": "def preprocess_terms():\n # select a subcorpus of brown to experiment with\n words = [word.lower() for word in brown.words(categories=\"science_fiction\") if word.lower() not in stopwords.words('english')]\n # count up the words\n fd = nltk.FreqDist(words)\n # show some sample words\n print ' '.join(fd.keys()[100:150])\n return fd\nfd = preprocess_terms()",
"prompt_number": 109,
"outputs": [
{
"output_type": "stream",
"text": "angel around came captain couldn't day face help helva's kind longer look lost must nogol oh outside place saw something words another away called can't come da dead digby gapt give hands however isn't live looked macneff maybe pain part power problem siddo smiled space there's took water yes ago\n",
"stream": "stdout"
}
],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "markdown",
"source": "Then makes a *very naive* guess at which are the most important words. This is where some term weighting should take place."
},
{
"metadata": {},
"cell_type": "code",
"input": "def find_important_terms(fd):\n important_words = fd.keys()[100:500]\n \n \n return important_words\n\nimportant_terms = find_important_terms(fd)",
"prompt_number": 231,
"outputs": [],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "markdown",
"source": "The code below is a very crude way to see what the most common \"topics\" are among the \"important\" words, according to WordNet. It does this by looking at the immediate hypernym of every sense of a wordform for those wordforms that are found to be nouns in WordNet. This is problematic because many of these senses will be incorrect and also often the hypernym elides the specific meaning of the word, but if you compare, say *romance* to *science fiction* in brown, you do see differences in the results. "
},
{
"metadata": {},
"cell_type": "code",
"input": "important_words_tagged = nltk.pos_tag(important_terms)",
"prompt_number": 232,
"outputs": [],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "code",
"input": "important_nouns = [word for (word,tag) in important_words_tagged if tag.startswith('N')]",
"prompt_number": 233,
"outputs": [],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "code",
"input": "important_verbs = [word for (word,tag) in important_words_tagged if tag.startswith('V')]",
"prompt_number": 234,
"outputs": [],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "code",
"input": "important_others = [word for (word,tag) in important_words_tagged if word not in important_nouns and word not in important_verbs]",
"prompt_number": 235,
"outputs": [],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "code",
"input": "#I search for words whose hypernyms that have a high similarity score and remove those words. Therefore we are left with \n#words which have more or less different hypernyms and hence covering broadly all the aspects with a esser word count.\n\n#I wanted to follow that procedure for nouns and use entailments instead of for verbs but wordnet didn't seem to \n#have entailments for any of the verbs\n\ndef RemoveTermsWithSimilarLineage(word,wordlist,hyp):\n for w in wordlist:\n if(len(wn.synsets(w))>0):\n s = wn.synsets(w)[0]\n if len(s.hypernyms()) > 0:\n if(hyp.path_similarity(s.hypernyms()[0]) >= 0.2):\n wordlist.remove(w)\n \ndef SummarizeCategory(ImportantWords):\n #ImportantWords = category\n #print ImportantWords\n #print \"___________________________________________________________\"\n hypterms = []\n for word in ImportantWords:\n if(len(wn.synsets(word,'n'))==0):\n continue\n sn = wn.synsets(word,'n')[0]\n if(len(sn.hypernyms())==0):\n continue\n hyp = sn.hypernyms()[0]\n hypterms.append(hyp)\n RemoveTermsWithSimilarLineage(syn,ImportantWords,hyp)\n\n #print ImportantWords\n #print \"___________________________________________________________\"\n return hypterms\n \n \n#SummarizeCategory(important_nouns)\n#SummarizeVerbs(important_verbs)",
"prompt_number": 237,
"outputs": [],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "code",
"input": "\n# def RemoveTermsWithSimilarEntailments(word,wordlist,ent):\n# for w in wordlist:\n# if(len(wn.synsets(w))>0):\n# s = wn.synsets(w)[0]\n# if len(s.entailments()) > 0:\n# if(ent.path_similarity(s.entailments()[0]) >= 0.2):\n# wordlist.remove(w)\n\n\n# def SummarizeVerbs(verblist):\n# ImportantWords = verblist[0:30]\n# print ImportantWords\n# print \"___________________________________________________________\"\n# entailments = []\n# for word in ImportantWords:\n# if(len(wn.synsets(word,'n'))==0):\n# continue\n# sn = wn.synsets(word,'n')[0]\n# if(len(sn.entailments())==0):\n# continue\n# print sn.entailments()\n# ent = sn.entailments()[0]\n# entailments.append(ent)\n# RemoveTermsWithSimilarEntailments(syn,ImportantWords,ent)\n \n# print ImportantWords\n# print \"___________________________________________________________\"\n# print entailments\n",
"outputs": [],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "code",
"input": "# Count the direct hypernyms for every sense of each wordform.\n# This is very crude. It should convert the wordform to a lemma, and should\n# be smarter about selecting important words and finding two-word phrases, etc.\n\n# Nonetheless, you get intersting differences between, say, scifi and romance.\ndef categories_from_hypernyms(termlist):\n# hypterms = [] \n# for term in termlist: # for each term\n# s = wn.synsets(term.lower(), 'n') # get its nominal synsets\n# for syn in s: # for each synset\n# for hyp in syn.hypernyms(): # It has a list of hypernyms\n# hypterms = hypterms + [hyp.name] # Extract the hypernym name and add to list\n\n# hypfd = nltk.FreqDist(hypterms)\n hypfd = nltk.FreqDist(SummarizeCategory(termlist))\n print \"Show most frequent hypernym results\"\n print termlist\n print SummarizeCategory(termlist)\n print hypfd.items()[:25]\n return [(count, name, wn.synset(name).definition) for (name, count) in hypfd.items()[:25]] \n \ncategories_from_hypernyms(important_nouns)",
"prompt_number": 238,
"outputs": [
{
"output_type": "stream",
"text": "Show most frequent hypernym results\n['came', 'oh', 'something', 'away', 'digby', 'gapt', 'macneff', 'problem', 'siddo', \"there's\", 'yes', 'anything', 'ask', 'believe', 'gone', 'matter', 'skiff', 'sure', 'america', \"didn't\", 'dromozoa', 'foster', 'god', 'grok', 'grow', 'happen', 'human', \"i'll\", \"it's\", 'mary', 'nice', 'okay', 'others', 'stood', 'strange', 'warm', 'beside', 'camp', 'fast', 'find', 'fine', 'grokking', 'heard', 'india', 'kept', 'martian', 'ones', 'ozagen', 'proper', 'remember', 'seem', 'shayol', 'sigmen', 'speak', 'supreme', 'towards', 'yarrow', 'alone', 'anne', 'asia', 'asleep', 'cavity', 'complete', 'develop', 'eat', 'forever', 'gave', 'grown', 'homesick', \"man's\", 'martians', 'number', 'pressure', 'radiation', 'resting']\n[Synset('difficulty.n.03'), Synset('affirmative.n.01'), Synset('concern.n.01'), Synset('hominid.n.01'), Synset('sanction.n.01'), Synset('military_quarters.n.01'), Synset('insight.n.03'), Synset('digit.n.01'), Synset('achillea.n.01'), Synset('hole.n.05'), Synset('amount.n.02'), Synset('energy.n.01')]",
"stream": "stdout"
},
{
"output_type": "stream",
"text": "\n[(Synset('middle_english.n.01'), 1), (Synset('command.n.01'), 1), (Synset('hour.n.02'), 1), (Synset('transportation.n.02'), 1), (Synset('life.n.07'), 1), (Synset('point.n.02'), 1), (Synset('chromatic_color.n.01'), 1), (Synset('union.n.09'), 1), (Synset('commissioned_military_officer.n.01'), 1), (Synset('instrumentality.n.03'), 1), (Synset('external_body_part.n.01'), 1), (Synset('possession.n.02'), 1), (Synset('spiritual_being.n.01'), 1), (Synset('guardianship.n.02'), 1), (Synset('force.n.04'), 1), (Synset('emotional_state.n.01'), 1), (Synset('person.n.01'), 1), (Synset('difficulty.n.03'), 1), (Synset('awareness.n.01'), 1), (Synset('liquid_body_substance.n.01'), 1), (Synset('teaching.n.01'), 1), (Synset('male_sibling.n.01'), 1), (Synset('tense.n.01'), 1), (Synset('formation.n.01'), 1), (Synset('speech.n.02'), 1)]\n",
"stream": "stdout"
}
],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "markdown",
"source": "**Here is the question** Modify this code in some way to do a better job of using WordNet to summarize terms. You can trim senses in a better way, or traverse hypernyms differently. You don't have to use hypernyms; you can use any WordNet relations you like, or chose your terms in another way. You can also use other parts of speech if you like. "
},
{
"metadata": {},
"cell_type": "code",
"input": "",
"outputs": [],
"language": "python",
"trusted": true,
"collapsed": false
}
],
"metadata": {}
}
],
"metadata": {
"name": "",
"signature": "sha256:710f3fcf200fbb9207afb315093cef682a51d1179233b8300d268f6092663a9c"
},
"nbformat": 3
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment