Skip to content

Instantly share code, notes, and snippets.

@danielsgriffin
Created October 6, 2014 06:34
Show Gist options
  • Save danielsgriffin/4e7048081a7b1a1cc149 to your computer and use it in GitHub Desktop.
Save danielsgriffin/4e7048081a7b1a1cc149 to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"worksheets": [
{
"cells": [
{
"metadata": {},
"cell_type": "markdown",
"source": "## 1. Syntactic Patterns for Technical Terms ##"
},
{
"metadata": {},
"cell_type": "code",
"input": "import nltk\nfrom nltk.corpus import brown",
"prompt_number": 42,
"outputs": [],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "markdown",
"source": "As seen in the Chuang et al. paper and in the Manning and Schuetze chapter,\nthere is a well-known part-of-speech based pattern defined by Justeson and Katz\nfor identifying simple noun phrases that often words well for pulling out keyphrases.\n\nChuang et al use this pattern: Technical Term T = (A | N)+ (N | C) | N\n\nBelow, please write a function to define a chunker using the RegexpParser as illustrated in the section *Chunking with Regular Expressions*. You'll need to revise the grammar rules shown there to match the pattern shown above. You can be liberal with your definition of what is meant by *N* here. Also, C refers to cardinal number, which is CD in the brown corpus.\n\n"
},
{
"metadata": {},
"cell_type": "code",
"input": "def technicalchunker(sentence):\n grammar = r'T:{<JJ|N.*>+<N.*|CD>|<N.*>}'\n cp = nltk.RegexpParser(grammar)\n return cp.parse(sentence)\n\n# Testing\ntree = technicalchunker([(\"Rapunzel\", \"NNP\"), (\"let\", \"VBD\"), (\"down\", \"RP\"),\n (\"her\", \"PP$\"), (\"long\", \"JJ\"), (\"golden\", \"JJ\"), (\"hair\", \"NN\")])\nfor subtree in tree.subtrees():\n if subtree.node == 'T': \n leaf = [leaf[0] for leaf in subtree.leaves()]\n print ' '.join(leaf)",
"prompt_number": 43,
"outputs": [
{
"output_type": "stream",
"text": "Rapunzel\nlong golden hair\n",
"stream": "stdout"
}
],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "markdown",
"source": "Below, please write a function to call the chunker, run it on some sentences, and then print out the results for those sentences.\n\nFor uniformity, please run it on sentences 100 through 104 from the tagged brown corpus news category.\n\nThen extract out the phrases themselves using the subtree extraction technique shown in the \n*Exploring Text Corpora* category. (Note: Section 7.4 shows how to get to the actual words in the phrase by using the tree.leaves() command.)"
},
{
"metadata": {},
"cell_type": "code",
"input": "def call_chunker(chunker):\n brown = nltk.corpus.brown\n for sent in brown.tagged_sents(categories='news')[100:105]:\n tree = chunker(sent)\n #print tree\n #print('Extractions:')\n for subtree in tree.subtrees():\n if subtree.node == 'T': \n leaf = [leaf[0] for leaf in subtree.leaves()]\n print '\\t'+' '.join(leaf)\n #print\ncall_chunker(technicalchunker)",
"prompt_number": 44,
"outputs": [
{
"output_type": "stream",
"text": "\tDaniel\n\tfight\n\tmeasure\n\trejection\n\tprevious Legislatures\n\tpublic hearing\n\tHouse Committee\n\tRevenue\n\tTaxation\n\tcommittee rules\n\tsubcommittee\n\tweek\n\tquestions\n\tcommittee members\n\tbankers\n\twitnesses\n\tdoubt\n\tpassage\n\tDaniel\n\testimate\n\tdollars\n\tdeficit\n\tdollars\n\tend\n\tcurrent fiscal year\n\tAug. 31\n\tcommittee\n\tmeasure\n\tmeans\n\tescheat law\n\tbooks\n\tTexas\n\trepublic\n",
"stream": "stdout"
}
],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "markdown",
"source": "## 2. Identify Proper Nouns ##\nFor this next task, write a new version of the chunker, but this time change it in two ways:\n 1. Make it recognize proper nouns\n 2. Make it work on your personal text collection which means that you need to run a tagger over your personal text collection.\n\nNote that the second requirements means that you need to run a tagger over your personal text collection before you design the proper noun recognizer. You can use a pre-trained tagger or train your own on one of the existing tagged collections (brown, conll, or treebank)\n\n"
},
{
"metadata": {},
"cell_type": "markdown",
"source": "**Tagger:** Your code for optionally training tagger, and for definitely running tagger on your personal collection goes here:"
},
{
"metadata": {},
"cell_type": "code",
"input": "# Preparing and tokenizing personal collection\nimport re\nfin = open('911all.txt', 'r')\ntext = fin.read()",
"prompt_number": 45,
"outputs": [],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "code",
"input": "def nineeleventokenize(sent):\n \"\"\"Tokenize calibrated for 9-11 commission report\"\"\"\n # Tokenize\n pattern = r'''(?x) # set flag to allow verbose regexps\n (http://[\\w\\.,-_/]+) # Full URL\n |(www.[\\w\\.,-_/]+) # partial URL\n |(^/[\\w\\.,-_]+) # broken URL\n |([A-Z]\\.)+ # abbreviations, e.g. U.S.A.\n | \\w+([-']+\\w+)* # words with optional internal hyphens\n |\\$?\\d+(\\.\\d+)?%? # currency and percentages, e.g. $12.40, 82%\n | \\.\\.\\. # ellipsis\n | [][.,;\"'?():-_`] # these are separate tokens\n '''\n words = nltk.regexp_tokenize(sent, pattern)\n\n # Remove URLs\n url_pattern = r'''(?x) # set flag to allow verbose regexps\n (http://[\\w\\.,-_/]+) # Full URL\n |(www.[\\w\\.,-_/]+) # partial URL\n |(^/[\\w\\.,-_]+) # broken URL\n '''\n urls = nltk.regexp_tokenize(sent, url_pattern)\n words = [w for w in words if w not in urls] # -62\n\n # Split CamelCase\n for list_i, w in enumerate(words):\n for i in range(len(w)-1):\n if w[0].islower() and w[i+1].isupper():\n if w[i] == '-': \n # This is to ignore the traditional ('mid-January') and nontraditional \n pass \n elif '-' in w[:i]:\n # This is to ignore things such as traditional 'non-CIA' and 'activity-KSM' \n pass\n elif [x for x in w[:i+1] if x.isupper()]:\n # This is to deal with 'headCOUNTERTERRORISM' and 'theTwinTowers' appearing in the list once per capital letter.\n pass\n else:\n words.insert(list_i+1, w[:i+1])\n words.insert(list_i+2, w[i+1:])\n words.pop(list_i)\n return words",
"prompt_number": 46,
"outputs": [],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "code",
"input": "text = text[:1278939] # Avoiding ill-formatted endnotes (destroys pickle)\nsent_tokenizer=nltk.data.load('tokenizers/punkt/english.pickle')\nsents = sent_tokenizer.tokenize(text)\nsents = [sent.replace('\\n', ' ') for sent in sents] # to deal with hard wordwrapping in collection\nsentsPOS = []\nfor sent in sents:\n sent = nineeleventokenize(sent)\n tagged = nltk.pos_tag(sent)\n sentsPOS.append(tagged)",
"prompt_number": 47,
"outputs": [],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "code",
"input": "# I need to correct this in the POS tagger:\n# Quotation marks were showing as results for the propernoun chunking\n# Problem Discovery\nquotemarkpn = []\nfor sent in sentsPOS:\n for tpl in sent:\n if tpl[0] == '\"':\n quotemarkpn.append(tpl)\nprint len(quotemarkpn)\nprint len(set(quotemarkpn))\nqfd = nltk.FreqDist(quotemarkpn)\nfor item in qfd.items(): print item\n \n# Temporary Fix\nfor sent in sentsPOS:\n for i, tpl in enumerate(sent):\n if tpl[0] == '\"':\n newtpl = ('\"', ':')\n sent[i] = newtpl",
"prompt_number": 48,
"outputs": [
{
"output_type": "stream",
"text": "2311\n10\n(('\"', 'NN'), 811)\n(('\"', ':'), 699)\n(('\"', 'NNP'), 470)\n(('\"', 'CD'), 165)\n(('\"', '-NONE-'), 61)\n(('\"', 'VB'), 56)\n(('\"', 'VBP'), 23)\n(('\"', '``'), 17)\n(('\"', '.'), 6)\n(('\"', 'JJ'), 3)\n",
"stream": "stdout"
}
],
"language": "python",
"trusted": true,
"collapsed": true
},
{
"metadata": {},
"cell_type": "code",
"input": "# A reverse of the above case, in this instance I need to change the POS tagger to make 'al' signify a proper noun\n# Problem Discovery\nalpn = []\nfor sent in sentsPOS:\n for tpl in sent:\n if tpl[0] == 'al':\n alpn.append(tpl)\nprint len(alpn)\nprint len(set(alpn))\nalfd = nltk.FreqDist(alpn)\nfor item in alfd.items(): print item\n \n# Temporary Fix\nfor sent in sentsPOS:\n for i, tpl in enumerate(sent):\n if tpl[0] == 'al':\n newtpl = ('al', 'NNP')\n sent[i] = newtpl",
"prompt_number": 55,
"outputs": [
{
"output_type": "stream",
"text": "844\n2\n(('al', 'JJ'), 587)\n(('al', 'NN'), 257)\n",
"stream": "stdout"
}
],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "markdown",
"source": "**Chunker:** Code for the proper noun chunker goes here:"
},
{
"metadata": {},
"cell_type": "code",
"input": "def propernounchunker(sentence):\n grammar = r'''\n ProperNoun:\n {<NNP.*>+<CD>?} # Proper noun(s) followed by cardinal numbers (for \"American 77\", etc.)\n '''\n cp = nltk.RegexpParser(grammar)\n return cp.parse(sentence)",
"prompt_number": 57,
"outputs": [],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "markdown",
"source": "**Test the Chunker:** Test your proper noun recognizer on a lot of sentences to see how well it is working. You might want to add prepositions in order to improve your results. \n"
},
{
"metadata": {},
"cell_type": "code",
"input": "def testchunker(sentence):\n tree = propernounchunker(sentence)\n propernouns = []\n for subtree in tree.subtrees():\n if subtree.node == 'ProperNoun': \n leaf = [leaf[0] for leaf in subtree.leaves()]\n propernouns.append(' '.join(leaf))\n return propernouns\nfor sent in sentsPOS[5200:5205]:\n cleansentw = [w[0] for w in sent]\n cleansent_spaces = ' '.join(cleansentw)\n cleansent = cleansent_spaces.replace(' .', '.').replace(' ,', ',').replace(' :', ':')\n print cleansent\n print sent\n print testchunker(sent)\n print",
"prompt_number": 58,
"outputs": [
{
"output_type": "stream",
"text": "The President and the DCI both told us that these daily sessions provided a useful opportunity for exchanges on intelligence issues.\n[('The', 'DT'), ('President', 'NNP'), ('and', 'CC'), ('the', 'DT'), ('DCI', 'NNP'), ('both', 'DT'), ('told', 'NN'), ('us', 'PRP'), ('that', 'IN'), ('these', 'DT'), ('daily', 'JJ'), ('sessions', 'NNS'), ('provided', 'VBN'), ('a', 'DT'), ('useful', 'JJ'), ('opportunity', 'NN'), ('for', 'IN'), ('exchanges', 'NNS'), ('on', 'IN'), ('intelligence', 'NN'), ('issues', 'NNS'), ('.', '.')]\n['President', 'DCI']\n\nThe President talked with Rice every day, and she in turn talked by phone at least daily with Powell and Rumsfeld.\n[('The', 'DT'), ('President', 'NNP'), ('talked', 'VBD'), ('with', 'IN'), ('Rice', 'NNP'), ('every', 'DT'), ('day', 'NN'), (',', ','), ('and', 'CC'), ('she', 'PRP'), ('in', 'IN'), ('turn', 'NN'), ('talked', 'VBD'), ('by', 'IN'), ('phone', 'NN'), ('at', 'IN'), ('least', 'JJS'), ('daily', 'JJ'), ('with', 'IN'), ('Powell', 'NNP'), ('and', 'CC'), ('Rumsfeld', 'NNP'), ('.', '.')]\n['President', 'Rice', 'Powell', 'Rumsfeld']\n\nAs a result, the President often felt less need for formal meetings.\n[('As', 'IN'), ('a', 'DT'), ('result', 'NN'), (',', ','), ('the', 'DT'), ('President', 'NNP'), ('often', 'RB'), ('felt', 'VBD'), ('less', 'JJR'), ('need', 'NN'), ('for', 'IN'), ('formal', 'JJ'), ('meetings', 'NNS'), ('.', '.')]\n['President']\n\nIf, however, he decided that an event or an issue called for action, Rice would typically call on Hadley to have the Deputies Committee develop and review options.\n[('If', 'IN'), (',', ','), ('however', 'RB'), (',', ','), ('he', 'PRP'), ('decided', 'VBD'), ('that', 'IN'), ('an', 'DT'), ('event', 'NN'), ('or', 'CC'), ('an', 'DT'), ('issue', 'NN'), ('called', 'VBN'), ('for', 'IN'), ('action', 'NN'), (',', ','), ('Rice', 'NNP'), ('would', 'MD'), ('typically', 'RB'), ('call', 'VB'), ('on', 'IN'), ('Hadley', 'NNP'), ('to', 'TO'), ('have', 'VB'), ('the', 'DT'), ('Deputies', 'NNPS'), ('Committee', 'NNP'), ('develop', 'VB'), ('and', 'CC'), ('review', 'VB'), ('options', 'NNS'), ('.', '.')]\n['Rice', 'Hadley', 'Deputies Committee']\n\nThe President said that this process often tried his patience but that he understood the necessity for coordination.\n[('The', 'DT'), ('President', 'NNP'), ('said', 'VBD'), ('that', 'IN'), ('this', 'DT'), ('process', 'NN'), ('often', 'RB'), ('tried', 'VBD'), ('his', 'PRP$'), ('patience', 'NN'), ('but', 'CC'), ('that', 'IN'), ('he', 'PRP'), ('understood', 'VBD'), ('the', 'DT'), ('necessity', 'NN'), ('for', 'IN'), ('coordination', 'NN'), ('.', '.')]\n['President']\n\n",
"stream": "stdout"
}
],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "markdown",
"source": "**FreqDist Results:** After you have your proper noun recognizer working to your satisfaction, below run it over your entire collection, feed the results into a FreqDist, and then print out the top 20 proper nouns by frequency. That code goes here, along with the output:\n"
},
{
"metadata": {},
"cell_type": "code",
"input": "propernouns = []\nfor sent in sentsPOS:\n pn = testchunker(sent)\n if pn:\n for w in pn:\n propernouns.append(w)\n\nfd_propernouns = nltk.FreqDist(propernouns)\n\n# Total words in collection:\nwords = []\nfor sent in sents:\n words = words + nineeleventokenize(sent)\n\nfor x in ['%s (%0.2f%%)' % (word, cnt / float(len(words)) * 100) for word, cnt in fd_propernouns.items()][:20]:\n print x",
"prompt_number": 59,
"outputs": [
{
"output_type": "stream",
"text": "Bin Ladin (0.37%)\nUnited States (0.23%)\nal Qaeda (0.22%)\nU.S. (0.19%)\nCIA (0.19%)\nAfghanistan (0.15%)\nFBI (0.13%)\nTaliban (0.10%)\nKSM (0.09%)\nClarke (0.09%)\nAtta (0.08%)\nPakistan (0.07%)\nFAA (0.07%)\nHazmi (0.07%)\nPresident (0.07%)\nMihdhar (0.06%)\nBinalshibh (0.06%)\nBerger (0.06%)\nBin Ladin's (0.06%)\nFDNY (0.05%)\n",
"stream": "stdout"
}
],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "markdown",
"source": "### For Wednesday ###\nJust FYI, in Wednesday's October 8's assignment, you'll be asked to extend this code a bit more to discover interesting patterns using objects or subjects of verbs, and do a bit of Wordnet grouping. This will be posted soon. Note that these exercises are intended to provide you with functions to use directly in your larger assignment. "
},
{
"metadata": {},
"cell_type": "code",
"input": "",
"outputs": [],
"language": "python",
"trusted": true,
"collapsed": false
}
],
"metadata": {}
}
],
"metadata": {
"name": "",
"signature": "sha256:bca1ec14114b577f82a41faa8769a50688e099162f95ecd87e4de38fc1cd21ea"
},
"nbformat": 3
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment