cmgerber/Keyphrase_Identification_Assignment.ipynb

## Keyphrase_Identification_Assignment.ipynb
{
 "worksheets": [
  {
   "cells": [
    {
     "metadata": {},
     "cell_type": "code",
     "input": "# -*- coding: utf-8 -*-",
     "prompt_number": 1,
     "outputs": [],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "import pandas as pd\nimport nltk\nfrom nltk.corpus import brown\nimport codecs\nimport unicodedata\nimport re\nfrom copy import deepcopy\nfrom pyUtil import easyPickle as pickle\nfrom pyUtil import flattenList as flatten",
     "prompt_number": 3,
     "outputs": [],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "markdown",
     "source": "pyUtil contains modules that I wrote\n\nunicodedata is a library for normalizing unicode in ascii"
    },
    {
     "metadata": {},
     "cell_type": "heading",
     "source": "Overview",
     "level": 3
    },
    {
     "metadata": {},
     "cell_type": "markdown",
     "source": "I created three algorithms for creating gists of documents of collections of documents.\n\n**Algorithm 1** is a basic algorithm that uses a series of ngrams (from 5 to 1). It then pulls out up to 25 ngram from each ngram type (i.e. 5gram or 4 gram) if the frequency of that ngram is between a changing lower bound (based on the type of ngram) and and upper bound of 50. It also makes sure that the smaller ngram does not exist in the ngram above it (a 4gram is not contained in a 5gram for example).\n\n**Algorithm 2** is specialized for my clinical criteria corpus. Since the clinical criteria corpus has ~2000 separate documents this was taken into account when creating the algorithm. This algorithm finds useful verbs (handpicked for my clinical corpus and based on frequency for other corpus's). Once it has the verbs it pulls out all the sentences that contain at least one of the verbs. It then chunk the sentences pulling out the verbs and associated adjectives and nouns. For my corpus it also pulls out sentences that have age related data in them as that is an important aspect to the criteria. If there is only a single document it combines the chunks and uses them as a gist. If there are more than one documents input then it combines the gists for all the individual documents and then uses algorithm 1 to create an overall gist for all the documents.\n\n**Algorithm 3** is specialized for the brown news corpus. This algorithm uses WordNet and POS to create a gist. It first pulls out important terms based on the words being in the middle range of frequency in the document. It then matches those terms with a tagged version of the corpus to add POS tags to the important terms. It then finds the hypernyms of all the important terms (based on term and POS) and creates a dictionary that has the hypernyms for keys and the associated terms as values. The gist is created out of the top 50 most frequent hypernyms and 5 associated terms for each hypernym.\n\n**Best algorithms for corpus:**\n\nMystery: Algorithm 3\n\nBrown News: Algorithm 3\n\nClinical Criteria: Algorithm 2"
    },
    {
     "metadata": {},
     "cell_type": "heading",
     "source": "Data Imports",
     "level": 3
    },
    {
     "metadata": {},
     "cell_type": "markdown",
     "source": "import text corpuses"
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "brown_tech = brown.words(categories='learned')\ncriteria_text = codecs.open('../Data/ct_criteria_colin.txt',\n                            encoding=\"utf-8\")\nmystery_text = codecs.open('../Data/mystery.txt',\n                            encoding=\"utf-8\")",
     "prompt_number": 530,
     "outputs": [],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "markdown",
     "source": "convert criteria_text to tokens, sentences and documents"
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "criteria_text = criteria_text.readlines()",
     "prompt_number": 244,
     "outputs": [],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "markdown",
     "source": "normalize all unicode characters to ascii"
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "def norm_unicode(text):\n    '''this function takes in a list of strings, and \n    normalizes each word in each string from unicode\n    characters to equivalent (or closest) ascii \n    characters'''\n    text_ascii = []\n    for doc in text:\n        re_combine = []\n        for word in doc.split():\n            word = unicodedata.normalize('NFKD', word).encode('ascii','ignore')\n            re_combine.append(word)\n        text_ascii.append(' '.join(re_combine))\n    return text_ascii",
     "prompt_number": 245,
     "outputs": [],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "criteria_text_ascii = norm_unicode(criteria_text)",
     "prompt_number": 246,
     "outputs": [],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "markdown",
     "source": "Get criteria_text sentences"
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "criteria_text_ascii_sent = [re.split(' - ', line) for line in criteria_text_ascii]",
     "prompt_number": 247,
     "outputs": [],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "sent_tokenizer=nltk.data.load('tokenizers/punkt/english.pickle')",
     "prompt_number": 248,
     "outputs": [],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "markdown",
     "source": "Run the punkt sentence tokenizer over the aproximated sentences"
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "def sent_token(text):\n    sentence_groups = []\n    for sent_group in text:\n        group_holder = []\n        for sent in sent_group:\n            group_holder.append(sent_tokenizer.tokenize(sent))\n        sentence_groups.append(group_holder)\n        del group_holder\n    return sentence_groups",
     "prompt_number": 249,
     "outputs": [],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "criteria_text_ascii_sent = sent_token(criteria_text_ascii_sent)",
     "prompt_number": 250,
     "outputs": [],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "markdown",
     "source": "Flatten the documents to contain just a list of strings where each string is a sentence"
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "def flatten_docs(text):\n    result = []\n    for doc in text:\n        result.append(flatten.flatten(doc))\n    return result",
     "prompt_number": 251,
     "outputs": [],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "criteria_text_docs = flatten_docs(criteria_text_ascii_sent)",
     "prompt_number": 252,
     "outputs": [],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "markdown",
     "source": "Create list of sentences"
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "criteria_text_sents = flatten.flatten(criteria_text_docs)",
     "prompt_number": 253,
     "outputs": [],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "markdown",
     "source": "Tokens for sents"
    },
    {
     "metadata": {},
     "cell_type": "markdown",
     "source": "Tokenize the two lists"
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "#patter for tokenizing\npattern = r'''(?x)    # set flag to allow verbose regexps\n        ([A-Z]\\.)+        # abbreviations, e.g. U.S.A\n        | \\w+([-‘]\\w+)*        # words with optional internal hyphens\n        | \\$?\\d+(\\.\\d+)?%?  # currency and percentages, e.g. $12.40, 82%\n        | \\.\\.\\.            # ellipsis...   \n        | [][.,;\"'?():\\-_`]+  # these are separate tokens\n        '''",
     "prompt_number": 30,
     "outputs": [],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "criteria_text_sents = [nltk.regexp_tokenize(sent, pattern) for sent\n                         in criteria_text_sents]",
     "prompt_number": 559,
     "outputs": [],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "markdown",
     "source": "For docs"
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "def doc_token(text):\n    result = []\n    for doc in text:\n        doc_text = []\n        for sent in doc:\n            doc_text.append(nltk.regexp_tokenize(sent, pattern))\n        result.append(doc_text)\n    return result",
     "outputs": [],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "criteria_text_docs_token = doc_token(criteria_text_docs)",
     "outputs": [],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "markdown",
     "source": "Get all tokens"
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "criteria_text_tokens = [nltk.regexp_tokenize(sent, pattern) for sent\n                         in criteria_text_sents]",
     "outputs": [],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "heading",
     "source": "Comparing taggers",
     "level": 4
    },
    {
     "metadata": {},
     "cell_type": "markdown",
     "source": "First I am going to try using a backoff tagger with the brown techical corpus for the training corpus"
    },
    {
     "metadata": {},
     "cell_type": "markdown",
     "source": "Pattern for regex fallback tagger"
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "word_patterns = [\n    (r'^-?[0-9]+(.[0-9]+)?$', 'CD'),\n    (r'.*ould$', 'MD'),\n    (r'.*ing$', 'VBG'),\n    (r'.*ed$', 'VBD'),\n    (r'.*ness$', 'NN'),\n    (r'.*ment$', 'NN'),\n    (r'.*ful$', 'JJ'),\n    (r'.*ious$', 'JJ'),\n    (r'.*ble$', 'JJ'),\n    (r'.*ic$', 'JJ'),\n    (r'.*ive$', 'JJ'),\n    (r'.*ic$', 'JJ'),\n    (r'.*est$', 'JJ'),\n    (r'^a$', 'PREP'),\n    (r'.*', 'NN')\n]",
     "outputs": [],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "markdown",
     "source": "Backoff tagger function"
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "def backoff_tagger(tagged_sents, tagger_classes, backoff=None):\n    if not backoff:\n        backoff = tagger_classes[0](tagged_sents)\n        del tagger_classes[0]\n\n    for cls in tagger_classes:\n        tagger = cls(tagged_sents, backoff=backoff)\n        backoff = tagger\n\n    return backoff",
     "outputs": [],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "markdown",
     "source": "raubt will be the basic tagger\n\nIt will be trained on a subset of the technical category of the brown corpus"
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "brown_tech_tagged = brown.tagged_sents(categories='learned')",
     "outputs": [],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "markdown",
     "source": "get the first 90% of the corpus"
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "brown_tech_train = brown_tech_tagged[:int(len(brown_tech_tagged)*.9)]",
     "outputs": [],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "raubt_tagger = backoff_tagger(brown_tech_tagged, [nltk.tag.AffixTagger,\n    nltk.tag.UnigramTagger, nltk.tag.BigramTagger, nltk.tag.TrigramTagger],\n    backoff=nltk.tag.RegexpTagger(word_patterns))",
     "outputs": [],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "markdown",
     "source": "Tag corpus with backoff tagger"
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "def doc_tagger(text):\n    result = []\n    for doc in text:\n        doc_text = []\n        for sent in doc:\n            doc_text.append(raubt_tagger.tag(sent))\n        result.append(doc_text)\n    return result",
     "outputs": [],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "criteria_text_docs_tagged = doc_tagger(criteria_text_docs_token)",
     "outputs": [],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "markdown",
     "source": "Test for null tag values"
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "for doc in criteria_text_docs_tagged:\n    for sent in doc:\n        for word in sent:\n            if word[1] is None:\n                print word",
     "outputs": [],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "markdown",
     "source": "Try pos_tag"
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "def doc_tagger_pos(text):\n    result = []\n    for doc in text:\n        doc_text = []\n        for sent in doc:\n            doc_text.append(nltk.pos_tag(sent))\n        result.append(doc_text)\n    return result",
     "outputs": [],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "criteria_text_docs_tagged_pos = doc_tagger_pos(criteria_text_docs_token)",
     "outputs": [],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "criteria_text_docs_tagged_pos[:10]",
     "outputs": [],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "markdown",
     "source": "Save tagged corpuses"
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "pickle.save_object(criteria_text_docs_tagged_pos,\n                   'criteria_corpus_pos_tagged.pkl')",
     "outputs": [],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "pickle.save_object(criteria_text_docs_tagged,\n                   'criteria_corpus_backoff_tagged.pkl')",
     "outputs": [],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "df = pd.DataFrame(criteria_text_docs_tagged_pos)",
     "outputs": [],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "markdown",
     "source": "Read in tagged corpus"
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "criteria_text_docs_tagged_pos =pickle.open_object('criteria_corpus_pos_tagged.pkl')\ncriteria_text_docs_tagged =pickle.open_object('criteria_corpus_backoff_tagged.pkl')",
     "prompt_number": 5,
     "outputs": [],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "heading",
     "source": "Tag brown text",
     "level": 3
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "brown_news = brown.sents(categories='news')",
     "prompt_number": 13,
     "outputs": [],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "brown_news_tag = []\nfor sent in brown_news:\n    brown_news_tag.append(nltk.pos_tag(sent))",
     "prompt_number": 15,
     "outputs": [],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "pickle.save_object(brown_news_tag, 'brown_new_tag_pos.pkl')",
     "prompt_number": 17,
     "outputs": [],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "heading",
     "source": "Prepare Mystery Text",
     "level": 3
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "mystery_text = mystery_text.readlines()",
     "prompt_number": 531,
     "outputs": [],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "markdown",
     "source": "Remove newlines"
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "mystery_text_combined = []\nfor sent in mystery_text:\n    mystery_text_combined.append(sent.strip())\nmystery_text_combined = ' '.join(mystery_text_combined)",
     "prompt_number": 535,
     "outputs": [],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "markdown",
     "source": "Convert text into sentences and tokens"
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "mystery_text_sents = sent_tokenizer.tokenize(mystery_text_combined)",
     "prompt_number": 536,
     "outputs": [],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "markdown",
     "source": "Get Tokens"
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "#patter for tokenizing\npattern = r'''(?x)    # set flag to allow verbose regexps\n        ([A-Z]\\.)+        # abbreviations, e.g. U.S.A\n        | \\w+([-‘]\\w+)*        # words with optional internal hyphens\n        | \\$?\\d+(\\.\\d+)?%?  # currency and percentages, e.g. $12.40, 82%\n        | \\.\\.\\.            # ellipsis...   \n        | [][.,;\"'?():\\-_`]+  # these are separate tokens\n        '''",
     "prompt_number": 545,
     "outputs": [],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "mystery_text_sents = [nltk.regexp_tokenize(sent, pattern) for sent\n                         in mystery_text_sents]",
     "prompt_number": 546,
     "outputs": [],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "markdown",
     "source": "Tag text"
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "mystery_text_tag = []\nfor sent in mystery_text_sents:\n    mystery_text_tag.append(nltk.pos_tag(sent))",
     "prompt_number": 548,
     "outputs": [],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "pickle.save_object(mystery_text_tag, 'mystery_text_tag_pos.pkl')",
     "prompt_number": 549,
     "outputs": [],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "heading",
     "source": "Algorithms",
     "level": 2
    },
    {
     "metadata": {},
     "cell_type": "markdown",
     "source": "There will be 3 different algorithms, a basic one, one customized to the brown corpus and one customized to the criteria text"
    },
    {
     "metadata": {},
     "cell_type": "heading",
     "source": "**Algorithm 1 - Basic**",
     "level": 3
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "from nltk.util import ngrams\nfrom nltk import FreqDist\nimport string\nfrom nltk.corpus import stopwords",
     "prompt_number": 120,
     "outputs": [],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "markdown",
     "source": "Remove basic punctiation and stopwords"
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "def remove_punct(text):\n    return [[word for word in sent if word[0] not in string.punctuation] for sent in text]",
     "prompt_number": 153,
     "outputs": [],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "def remove_stop(text):\n    return [[word for word in sent if word.lower() not in stopwords.words('english')] for sent in text]",
     "prompt_number": 167,
     "outputs": [],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "brown_news_no_punct = remove_punct(brown_news)\nbrown_news_no_punct = remove_stop(brown_news_no_punct)",
     "prompt_number": 168,
     "outputs": [],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "def multinNgram(n, text):\n    '''This funciton loops through ngrams of length 1 to n.\n    It only keeps the ngram if it is not contained in a larger\n    ngram.'''\n    result = {}\n    flat_list = flatten.flatten(text)\n    for num in range(n, 0, -1):\n        result[num] = []\n        ngram = ngrams(flat_list, num)\n        result[num] = [' '.join(gram) for gram in ngram]\n    return result",
     "prompt_number": 169,
     "outputs": [],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "multiGrams = multinNgram(4, brown_news_no_punct)",
     "prompt_number": 188,
     "outputs": [],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "markdown",
     "source": "Get ngrams where the frequency is between a changing lower bound (based on the type of ngram) and upper bound of 50. It also checks to make sure that a lower ngram does not exist inside the ngram above it."
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "def get_gist(multiGrams):\n    low_base_cuttoff = 1\n    high_cutoff = 50\n    ngram_dict = {}\n\n    for ngram in range(len(multiGrams), 0, -1):\n        fd = FreqDist(multiGrams[ngram])\n        try:\n            if ngram == len(multiGrams):\n                word, count = zip(*[(gram[0], gram[1]) for gram in fd.items() if gram[1]> low_base_cuttoff and gram[1]<high_cutoff][:25])\n            else:\n                word, count = zip(*[(gram[0], gram[1]) for gram in fd.items() if gram[1]> low_base_cuttoff and gram[1]<high_cutoff\n                                    and any(gram[0] not in g for g in multiGrams[ngram+1])][:25])\n            ngram_dict['%dgram' % (ngram)] = word\n        except:\n            continue\n        low_base_cuttoff += 2\n        #ngram_dict['%dgram_count' % (ngram)] = count\n    return ngram_dict",
     "prompt_number": 256,
     "outputs": [],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "markdown",
     "source": "Combine into gist"
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "gist = []\nfor gram in ngram_dict:\n    gist.append(ngram_dict[gram])",
     "prompt_number": 219,
     "outputs": [],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "gist = flatten.flatten(gist)",
     "prompt_number": 220,
     "outputs": [],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "len(' '.join(gist))",
     "prompt_number": 221,
     "outputs": [
      {
       "text": "1418",
       "output_type": "pyout",
       "metadata": {},
       "prompt_number": 221
      }
     ],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "def alg_1(text):\n    text_no_punct = remove_punct(text)\n    text_no_punct = remove_stop(text_no_punct)\n    multiGrams = multinNgram(4, text_no_punct)\n    ngram_dict = get_gist(multiGrams)\n    gist = []\n    for gram in ngram_dict:\n        gist.append(ngram_dict[gram])\n    gist = flatten.flatten(gist)\n    print gist\n    print 'length', len(' '.join(gist))",
     "prompt_number": 239,
     "outputs": [],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "alg_1(brown_news)",
     "prompt_number": 241,
     "outputs": [
      {
       "output_type": "stream",
       "text": "[u'Dow Jones industrial average', u'full amount performance bond', u'Sen. George Parkhouse Dallas', u\"I'm sorry don't book\", u'12 months ended February', u'Mrs. J. Clinton Bowman', u'entertain members Book Club', u'organization actually owns university', u'President John F. Kennedy', u'victory Chicago White Sox', u'hit home runs hope', u'late 1960 early 1961', u'jury failed reach verdict', u'said Annapolis Jan. 7', u'prospective students others seeking', u'civil defense program Mr.', u\"St. Patrick's Day Purse\", u'American Catholic higher education', u'announce engagement daughter Miss', u'given marriage father wore', u'58th precinct 23d ward', u'group orientation group identification', u'first three months 1961', u'debentures due June 1', u'cent 3 per cent', u'stock', u'Belgians', u'charter', u'second', u'errors', u'whose', u'Church', u'reported', u'reports', u'military', u'explained', u'brought', u'unit', u'music', u'strike', u'successful', u'hole', u'hold', u'example', u'want', u'hot', u'L.', u'effective', u'headquarters', u'Harris', u'La Dolce Vita', u'potato chip industry', u'aged care plan', u'Catholic higher education', u'10 per cent', u'American Catholic higher', u'per cent interest', u'four home runs', u'New York Yankees', u'4 per cent', u'New York City', u'National Football League', u'home rule charter', u'Mr. Hawksley said', u'first time', u'home run', u'real estate', u'Mrs. James', u'tax bill', u'May 1', u'said yesterday', u'Rhode Island', u'collective bargaining', u'Palmer Player', u'Junior Achievement', u'vice president', u'higher education', u'Mr. Hawksley', u'White House', u'Mrs. William', u'12 months', u'Mr. Martinelli', u'years ago', u'sales tax', u'State College', u'million dollars', u'Country Club', u'two years', u'St. Louis']\nlength 1418\n",
       "stream": "stdout"
      }
     ],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "alg_1(mystery_text_sents)",
     "prompt_number": 554,
     "outputs": [
      {
       "output_type": "stream",
       "text": "[u'February rising 3 5', u'said re going put', u'350 last month Exports', u'world cereals trade revised', u'coarse grain output likely', u'46 mln bushels corn', u'March beginning April trade', u'695 000 tonnes 1986', u'board also adjusted export', u'80 1 84 marks', u'23 8 29 3', u'0 05 pct weight', u'capacity 80 4 pct', u'prospects 1987 88 many', u'58 7 billion Thailand', u'last month Stocks Sept', u'dlr profit end year', u'non-communist countertrade partners help', u'higher foodgrain production signifying', u'OUTPUT 1987 U.N. Food', u'around 145 30 yen', u'000 km north told', u'75 European currency units', u'soybean oil 300 290', u'000 tonnes asbestos fibre', u'four', u'looking', u'SPR', u'grains', u'Paul', u'feasibility', u'second', u'Pampa', u'increasing', u'reported', u'reports', u'unit', u'strike', u'brings', u'99', u'98', u'90', u'hold', u'95', u'94', u'97', u'96', u'Imports', u'Colombia', u'La', u'March 20 Analysts', u'nine months ended', u'Trading Corp MMTC', u'March almost nothing', u'International Coffee Organization', u'1 mln dlrs', u'mln tonnes target', u'previous brackets follows', u'87 season 6', u'However figures exclude', u'78 10 mln', u'tonnes free market', u'tonnes last month', u'3 92 octane', u'Agriculture Secretary Richard', u'2 5 pct', u'000 tonnes wheat', u'nations committed Paris', u'500 000 tonnes', u'dealers said central', u'private sources 1985', u'0 7 pct', u'paid non-convertible Indian', u'residual fuel demand', u'143 55 yen', u'government would', u'April 1', u'April 8', u'interest rates', u'Paris Accord', u'31 1987', u'31 1986', u'failure monsoon', u'50 dlrs', u'500 mln', u'goods services', u'1 pct', u'00 mln', u'33 mln', u'Ltd said', u'gallon effective', u'1 mln', u'said Total', u'current levels', u'000 tons', u'Cape Spencer', u'trade data', u'pct February', u'97 octane', u'May 5']\nlength 1441\n",
       "stream": "stdout"
      }
     ],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "alg_1(criteria_text_sents)",
     "prompt_number": 568,
     "outputs": [
      {
       "output_type": "stream",
       "text": "['disease DISEASE CHARACTERISTICS Histologically', 'therapy prior biologic therapy', '70 years age 2', 'RLS symptoms present least', '6 months entry Group', 'Criteria Pregnant lactating females', 'blood test twice upper', '3 months stable dose', 'Female patients pregnant breastfeeding', 'AST ALT 2 times', 'years old Exclusion Criteria', 'study referring clinician indicating', 'opinion investigator would pose', 'surgery within 1 month', '3 weeks since prior', 'parental authority registered French', 'radiation therapy antibody based', 'patients must pathologically confirmed', 'potential pregnant breast feeding', 'calendar months prior Enrollment', 'P450 3A4 inhibitors within', 'week gestation birth Follow-up', 'liver metastases Serum bilirubin', 'scheduled efficacy evaluations Weeks', 'proven invasive breast cancer', 'anaemia', 'localized', 'Western', 'cytochrome', 'Hamilton', 'College', 'pericardial', 'circumstances', 'RLS', 'travel', 'aPTT', 'L.', 'anaphylaxis', 'fit', 'subcutaneous', 'cDGA', 'service', 'needed', 'mouth', 'Pediatric', 'kinase', 'plate', 'rectal', 'patch', 'precluding', 'prior screening 5', 'prior screening 6', 'significant cardiac disease', 'years male female', 'prior screening 2', 'surgery within last', '1 000 copies', 'epithelial ovarian primary', 'gastrointestinal endocrine metabolic', 'chronic alcohol consumption', 'planned use study', 'informed consent 9', 'informed consent 2', 'informed consent 3', 'informed consent 4', 'informed consent 5', 'informed consent 6', 'informed consent 7', 'participate study 2', 'prior screening Exclusion', 'method contraception avoid', 'informed consent comply', 'Inclusion Criteria Eligible', 'Female participants childbearing', 'using effective method', 'test HIV', 'Screening laboratory', 'Criteria age', '1 Presence', 'subject unable', 'immunoglobulins blood', 'study vaccines', 'Subjects use', 'kg body', 'expected require', 'CNS involvement', 'Bank criteria', 'systemic chemotherapy', '1 Subjects', 'viral infection', 'absolute neutrophil', 'Presence active', 'GSK Medical', 'may increase', 'cardiovascular renal', 'Consent form', '4 days', 'cancer head', 'old inclusive', 'disease diabetes']\nlength 1876\n",
       "stream": "stdout"
      }
     ],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "heading",
     "source": "Algorith 2 - Custom Criteria Corpus",
     "level": 3
    },
    {
     "metadata": {},
     "cell_type": "markdown",
     "source": "Flatten the corpus to get out all the verbs"
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "flat_list = flatten.flatten(flatten.flatten(criteria_text_docs_tagged))",
     "prompt_number": 230,
     "outputs": [],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "#get the verbs out\nverbs = [word for word in flat_list if word[1].startswith('V')]",
     "prompt_number": 232,
     "outputs": [],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "fd_verbs = FreqDist(verbs)",
     "prompt_number": 234,
     "outputs": [],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "[word[0][0] for word in fd_verbs.most_common(5)]",
     "prompt_number": 359,
     "outputs": [
      {
       "text": "['informed', 'following', 'defined', 'known', 'treated']",
       "output_type": "pyout",
       "metadata": {},
       "prompt_number": 359
      }
     ],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "markdown",
     "source": "Will use the following verbs to try to grab the relevant diseases that are required for each document criteria:\n\n* known\n* diagnosed\n* syndrome\n* specified\n* documented"
    },
    {
     "metadata": {},
     "cell_type": "markdown",
     "source": "Get sentences in documents with these verbs"
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "def get_specific_sent(text, spec_words):\n    specific_sents = {}\n    \n    \n    for num, doc in enumerate(text):\n        specific_sents[num] = []\n        for sent in doc:\n            for word in sent:\n                if word[0].lower() in spec_words:\n                    specific_sents[num].append(sent)\n                    break\n    return specific_sents",
     "prompt_number": 363,
     "outputs": [],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "markdown",
     "source": "Same function for single doc"
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "def get_specific_sent_signle_doc(text, spec_words):\n\n    specific_sents = []\n    for sent in text:\n        for word in sent:\n            if word[0].lower() in spec_words:\n                specific_sents.append(sent)\n                break\n    return specific_sents",
     "prompt_number": 362,
     "outputs": [],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "verbs = ['known', 'diagnosed', 'syndrome', 'specified', 'documented']\nverb_sents = get_specific_sent(criteria_text_docs_tagged, verbs)",
     "prompt_number": 302,
     "outputs": [],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "markdown",
     "source": "Function to get chunks"
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "def chunker(tagged_corpus, chunk_reg):\n    \n    cp = nltk.RegexpParser(chunk_reg)\n    \n    results = []\n    \n    for sents in tagged_corpus:\n        tree = cp.parse(sents)\n        for subtree in tree.subtrees():\n            if subtree.label() == 'CHUNK':\n                results.append(subtree[:])\n    return results",
     "prompt_number": 293,
     "outputs": [],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "def get_chunks(text, chunk_reg):\n    chunks_dict = {}\n    for doc in text:\n        chunks_dict[doc] = chunker(text[doc], chunk_reg)\n    return chunks_dict",
     "prompt_number": 292,
     "outputs": [],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "markdown",
     "source": "Get adjective/nouns that are in the selected verb sentences"
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "chunk_reg = r\"\"\"\n      CHUNK: {<VBN>.*<JJ|N.*>+<(N.*|CD)|N.*>?}\n    \"\"\"",
     "prompt_number": 295,
     "outputs": [],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "chunks_dict_verb_noun = get_chunks(verb_sents, chunk_reg)",
     "prompt_number": 339,
     "outputs": [],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "markdown",
     "source": "Pull out required ages"
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "age_sents = get_specific_sent(criteria_text_docs_tagged, ['age', 'ages'])",
     "prompt_number": 340,
     "outputs": [],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "markdown",
     "source": "Flatten docs"
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "def flat_doc(text):\n    for doc in text:\n        text[doc] = flatten.flatten(text[doc])\n    return text",
     "prompt_number": 310,
     "outputs": [],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "chunks_dict_verb_noun = flat_doc(chunks_dict_verb_noun)\nage_sents = flat_doc(age_sents)",
     "prompt_number": 341,
     "outputs": [],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "def get_doc_desc(num):\n    print ' '.join([word[0] for word in chunks_dict_verb_noun[num]])\n    print ' '.join([word[0] for word in age_sents[num]])",
     "prompt_number": 322,
     "outputs": [],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "get_doc_desc(4)",
     "prompt_number": 323,
     "outputs": [
      {
       "output_type": "stream",
       "text": "suspected immune dysfunction Known true hypersensitivity Known personal history\nHealthy child 11 to 13 years of age previously vaccinated in Study F05-TdI-301 Exclusion Criteria :\n",
       "stream": "stdout"
      }
     ],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "markdown",
     "source": "Add the gists together"
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "def combine_docs(main_text, add_text):\n    for doc in add_text:\n        for word in add_text[doc]:\n            main_text[doc].append(word)\n    return main_text",
     "prompt_number": 342,
     "outputs": [],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "combined_docs_list = combine_docs(chunks_dict_verb_noun, age_sents)",
     "prompt_number": 343,
     "outputs": [],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "all_doc_gists = flatten.flatten(combined_docs_list.values())",
     "prompt_number": 348,
     "outputs": [],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "fd = FreqDist(all_doc_gists)",
     "prompt_number": 350,
     "outputs": [],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "markdown",
     "source": "Strip POS so that algorithm one can be run on in"
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "all_doc_gists_no_pos = [[word[0] for word in sent] for sent in combined_docs_list.values()]",
     "prompt_number": 356,
     "outputs": [],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "alg_1(all_doc_gists_no_pos)",
     "prompt_number": 358,
     "outputs": [
      {
       "output_type": "stream",
       "text": "['age pregnant women excluded', 'cervix PRIOR CONCURRENT THERAPY', 'cell carcinoma supraglottic larynx', 'second malignancy within 5', 'months age antithrombin less', '3 months age protein', 'specified Hematopoietic See Disease', 'least 45 years age', 'disease Age 18 years', 'known HIV negative partner', 'older meets exclusion criterion', 'Known positive test Known', 'chronic liver disease known', 'age older 18 years', 'documented history known history', 'dL patient 3 months', 'Surgery specified PATIENT CHARACTERISTICS', 'confirmed supratentorial malignant primary', 'excluded study eligible future', 'risk following reasons Age', 'transformed lymphoma known posttransplantation', 'age Age 6 months', '18 years age Known', 'contraception required fertile patients', '50 years age inclusive', 'eligible', 'hormone', 'risk', 'progression', 'Less', '000', 'THERAPY', 'Negative', 'selected', 'men', 'active', '100', 'obtained', 'total', 'negative', 'm2', 'mL', 'adult', '90', 'ml', 'Cardiovascular', 'give', 'assent', 'provide', 'lesions', 'greater 2 times', 'limit normal age', '18 45 years', 'Known human immunodeficiency', 'Male female subject', 'years age Male', 'older 18 years', 'Male female 18', 'age 18 older', 'years age Known', 'Known brain metastases', 'min 1 73', 'age Known history', 'inclusive Male female', 'childbearing age must', 'CHARACTERISTICS Age known', 'known hypersensitivity Male', 'therapy See Disease', 'sensitivity Male female', 'expectancy specified Hematopoietic', 'Platelet count least', 'least 1 500', 'Endocrine therapy specified', 'Age greater equal', 'Patients age 18', 'pregnancy test', 'years older', 'Creatinine less', 'patients age', 'female age', 'signing informed', 'count least', '6 years', 'Visit 1', 'known contraindications', 'specified Surgery', 'years suspected', '5 times', 'women age', '18-65 years', 'metastases PATIENT', 'specified Hepatic', 'g dL', 'hypersensitivity reaction', '60 years', 'age suspected', 'hypersensitivity Male', 'maximum SC', 'effective contraception', 'standard therapy']\nlength 1745\n",
       "stream": "stdout"
      }
     ],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "def alg_2(text, multi_doc_switch=2):\n    '''Enter 1 for multi_doc_switch if there is only 1 document\n    otherwise enter 2\n    This funciton takes tagged text as input\n    When only 1 document is used alg_1 is not run. Alg_1 is used to \n    combine results from many documents.'''\n    if multi_doc_switch == 1:\n        #get top verbs\n        flat_list = flatten.flatten(text)\n        verbs = [word for word in flat_list if word[1].startswith('V')]\n        fd_verbs = FreqDist(verbs)\n        verb_list = [word[0][0] for word in fd_verbs.most_common(10)]\n        #get the sentences that have those verbs in them\n        verb_sents = get_specific_sent_signle_doc(text, verb_list)\n        #get chunks\n        chunk_reg = r\"\"\"\n          CHUNK: {<VBN>.*<JJ|N.*>+<(N.*|CD)|N.*>?}\n        \"\"\"\n        chunks_dict_verb_noun = chunker(verb_sents, chunk_reg)\n        #strip POS so it can be run through first algorithm\n        all_doc_gists_no_pos = [[word[0] for word in sent] for sent in chunks_dict_verb_noun]\n        #print flatten.flatten(all_doc_gists_no_pos)\n        print [' '.join(l) for l in all_doc_gists_no_pos][:100]\n        print 'length', len(' '.join([' '.join(l) for l in all_doc_gists_no_pos][:100]))\n        \n    else:\n        #get verb sentences for criteria docs\n        verbs = ['known', 'diagnosed', 'syndrome', 'specified', 'documented']\n        verb_sents = get_specific_sent(text, verbs)\n        #get chunks (verb/nouns)\n        chunk_reg = r\"\"\"\n          CHUNK: {<VBN>.*<JJ|N.*>+<(N.*|CD)|N.*>?}\n        \"\"\"\n        chunks_dict_verb_noun = get_chunks(verb_sents, chunk_reg)\n        #get age sentnces\n        age_sents = get_specific_sent(text, ['age', 'ages'])\n        #flatten the docs\n        chunks_dict_verb_noun = flat_doc(chunks_dict_verb_noun)\n        age_sents = flat_doc(age_sents)\n        #combine the results\n        combined_docs_list = combine_docs(chunks_dict_verb_noun, age_sents)\n        #strip the POS so it can be run through the first algorith\n        all_doc_gists_no_pos = [[word[0] for word in sent] for sent in combined_docs_list.values()]\n        #run first algorithm on sentences\n        alg_1(all_doc_gists_no_pos)",
     "prompt_number": 528,
     "outputs": [],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "alg_2(brown_news_tag, 1)",
     "prompt_number": 529,
     "outputs": [
      {
       "output_type": "stream",
       "text": "[u'accepted practices', u'seen fit', u'elected servants', u'married Aug. 2', u'been mayor', u'held Tuesday night', u'held Sept. 8', u'featured speaker', u'registered voters', u'repealed outright', u'increased federal support', u'defeated Felix Bush', u'veiled threats', u'reduced cost', u'proposed constitutional amendment', u'sought later', u'gotten Chairman Bill Hollowell', u'veiled effort', u'taken Education courses', u'named Dr. Clarence Charles Clark', u'been re-elected', u'limited skills', u'stolen property', u'involved composition', u'dismissed yesterday', u'aged care plan', u'socialized medicine', u'qualified young people', u'expected lines', u'enacted last year', u'denied immediate action', u'been conspicuous', u'rumored ready', u'left basic positions unchanged', u'spokesmen insist', u'been time', u'served notice', u'held responsible', u'prepared contingency plans', u'determined neighbors', u'qualified city residents', u'made available', u'proposed committee', u'advised local police', u'known labor-management expert', u'consulted several Superior Court justices', u'required non-partisan ballot', u'heard yesterday', u'staggered intervals', u'made Sunday', u'discredited carcass', u'tattered remains', u'proposed law', u'supported today', u'appointed state warden', u'introduced Monday', u'honored yesterday', u'reported Wagner plan', u'wanted Mr. Screvane', u'alienated voters', u'publicized scandals', u'attracted new attention', u'passed last Monday', u'uncommitted nations', u'disclosed today', u'eighteen months', u'announced Saturday', u'elected leader', u'given commitments', u'divided camps', u'caused legislators', u'token integration', u'needed funds', u'anticipated revenues', u'paid anyway', u'declared bankrupt', u'filed later', u'increased license fee', u'disclosed Tuesday', u'injured today', u'outspoken critic', u'been active', u'contested indecisive elections', u'ruled Turkey', u'demanded pledges', u'appointed temporary assistant district attorney', u'announced Monday', u'become necessary', u'asked Monday', u'organized labor', u'taken Friday', u'felt necessary', u'been willing', u'called witnesses', u'scattered hits', u'belated spring training', u'famed Yankee Clipper', u'expected late next summer', u'featured race', u'ten days']\nlength 1877\n",
       "stream": "stdout"
      }
     ],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "alg_2(mystery_text_tag, 1)",
     "prompt_number": 552,
     "outputs": [
      {
       "output_type": "stream",
       "text": "[u'posted cargo prices', u'said wholesale prices', u'been empty', u'started today', u'posted prices', u'assigned quota', u'posted prices', u'posted cargo prices', u'undisclosed terms', u'refined product', u'announced indefinite suspension', u'centred near', u'refined products', u'imported oil', u'imported oil', u'based tax', u'posted prices', u'posted prices', u'delivered prices', u'refined product', u'unleaded gasoline', u'said wholesale prices', u'seen likely', u'reduced stocks', u'increased runs', u'held refinery', u'posted prices', u'achieved strong sales', u'prompted other oil firms', u'unleaded gasoline', u'leaded gasoline', u'leaded gaosline', u'made available', u'leaded fuel', u'shut twice', u'unleaded gasoline', u'remained unchanged', u'unleaded gasoline', u'unleaded gasoline', u'unleaded gasoline', u'unleaded gasoline', u'said Roger Hemminghaus', u'escalated Mideast fighting', u'been slow', u'provided support', u'supported pork bellies', u'sold contracts', u'leaded gasoline', u'unleaded fuel', u'unleaded fuel', u'undisclosed terms', u'refined product', u'affected Brazilian exports', u'rejected offers', u'helped bolster ethanol production', u'made suggestions', u'given consideration', u'finished energy goods', u'finished energy goods', u'finished energy goods', u'televised address', u'included rice', u'continued use', u'called unchanged', u'updated o reflect 1982-84', u'remained unchanged', u'posted cargo prices', u'escalated Mideast fighting', u'been slow', u'provided support', u'supported pork bellies', u'sold contracts', u'traded options facility', u'Traded options', u'finished energy goods', u'finished energy goods', u'tigtened prompt deliveries', u'posted prices', u'targetted countries', u'estimated record 6', u'targetted countries', u'affected Kenya', u'elevated levels', u'discontinued laboratory testing', u'shown abnormal readings', u'increased production', u'stored water volumes', u'reduced hydro-electric power', u'recorded temperatures three', u'affected areas', u'increased rains', u'experienced record', u'refined palm oil', u'traded FOB', u'needed first', u'been disappointing', u'unlimited amount', u'held steady', u'stopped offering coffee', u'had trouble finding']\nlength 1821\n",
       "stream": "stdout"
      }
     ],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "heading",
     "source": "Algorithm 3 - Brown News",
     "level": 3
    },
    {
     "metadata": {},
     "cell_type": "markdown",
     "source": "This algorithm will use WordNet to help create the gist"
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "from nltk.corpus import wordnet as wn",
     "prompt_number": 380,
     "outputs": [],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "markdown",
     "source": "Remove punctuation and stop words from text"
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "def remove_punct(text):\n    return [[word for word in sent if word[0] not in string.punctuation] for sent in text]\ndef remove_stop(text):\n    return [[word for word in sent if word.lower() not in stopwords.words('english')] for sent in text]",
     "prompt_number": 366,
     "outputs": [],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "def get_terms(text):\n    #remove punctuation and stop words\n    text = remove_punct(text)\n    text = remove_stop(text)\n    #convert to FreqDist\n    return FreqDist(flatten.flatten(text)), text",
     "prompt_number": 369,
     "outputs": [],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "fd, brown_news = get_terms(brown_news)",
     "prompt_number": 370,
     "outputs": [],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "markdown",
     "source": "Find words that are in the middle range of frequency in the document"
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "def find_important_terms(fd):\n    important_words = []\n    #if the frequency is less than the top 50 words and more than 5 occurances\n    for word in fd.keys():\n        if fd[word] < fd.most_common(50)[-1][1] and fd[word] > 5:\n            important_words.append(word)\n    return important_words",
     "prompt_number": 374,
     "outputs": [],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "important_terms = find_important_terms(fd)",
     "prompt_number": 375,
     "outputs": [],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "len(important_terms)",
     "prompt_number": 377,
     "outputs": [
      {
       "text": "1904",
       "output_type": "pyout",
       "metadata": {},
       "prompt_number": 377
      }
     ],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "markdown",
     "source": "Match the important terms with the most common version in the tagged corpus to get an associated POS for the wordnet."
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "def match_tagged(important_terms, tagged_text):\n    #get FreqDist of tagged text\n    fd = FreqDist(flatten.flatten(tagged_text))\n    #create sorted list of fd\n    sort_fd = sorted(fd.items(), key= lambda x: x[1], reverse=True)\n    #iterate through important terms and match to most common tagged text\n    tagged_important_terms = []\n    for word in important_terms:\n        for tag_word in sort_fd:\n            if word == tag_word[0][0]:\n                tagged_important_terms.append(tag_word[0])\n                break\n    return tagged_important_terms",
     "prompt_number": 394,
     "outputs": [],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "tagged_important_terms = match_tagged(important_terms, brown_news_tag)",
     "prompt_number": 395,
     "outputs": [],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "len(tagged_important_terms)",
     "prompt_number": 397,
     "outputs": [
      {
       "text": "1904",
       "output_type": "pyout",
       "metadata": {},
       "prompt_number": 397
      }
     ],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "markdown",
     "source": "Get hypernyms of the important terms"
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "def categories_from_hypernyms(termlist):\n    hypterms = {}\n    for term in termlist:\n        #check POS\n        if term[1][0] == 'V':\n            POS = wn.VERB\n        elif term[1][0] == 'J':\n            POS = wn.ADJ\n        elif term[1][0] == 'R':\n            POS = wn.ADV\n        else:\n            POS = wn.NOUN\n        # get its nominal synsets\n        s = wn.synsets(term[0].lower(), POS)\n        for syn in s:\n            #list of hypernyms\n            for hyp in syn.hypernyms():\n                # Extract the hypernym name and add to dict along with the original term\n                if hyp.name() not in hypterms:\n                    hypterms[hyp.name()] = []\n                hypterms[hyp.name()].append(term[0])\n    return hypterms\n    hypfd = nltk.FreqDist(hypterms.keys())\n    print \"Show most frequent hypernym results\"\n    return [(count, name, wn.synset(name).definition()) for (name, count) in hypfd.most_common(25)] ",
     "prompt_number": 445,
     "outputs": [],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "hypterms = categories_from_hypernyms(tagged_important_terms)",
     "prompt_number": 446,
     "outputs": [],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "markdown",
     "source": "Sort the dictionary based on how many terms a hypernym has"
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "hypterms = sorted(hypterms.items(), key= lambda x: len(x[1]), reverse=True)",
     "prompt_number": 447,
     "outputs": [],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "markdown",
     "source": "Pair down the terms, take the top 50 most common and for each hypernym take only 5 associated words from a set of the associated words"
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "gist = [(term[0], ' '.join(list(set(term[1][:5])))) for term in hypterms][:50]",
     "prompt_number": 470,
     "outputs": [],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "df = pd.DataFrame(gist, columns=['Hypernym', 'Terms'])",
     "prompt_number": 474,
     "outputs": [],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "print df.to_string(justify='right' , columns=[0,1], index=False)",
     "prompt_number": 477,
     "outputs": [
      {
       "output_type": "stream",
       "text": "                     Hypernym                                              Terms\n                      be.v.01                            needed worked hold want\n                  change.v.02                                   gone turned take\n                  travel.v.01                          following continue played\n                    make.v.03                    brought worked organized played\n                     act.v.01                                    continue played\n                  person.n.01                    player life men relations child\n                activity.n.01              training support negotiations service\n                      be.v.03                            following gone continue\n                  change.v.01                         brought recommended turned\n                     get.v.01                           found turned take accept\n             time_period.n.01                                 schools hours life\n             communicate.v.02                         paying informed asked sign\n                    move.v.02                              worked working played\n                   state.n.02                Union union feeling power situation\n           large_integer.n.01                         100 million 17 yards grand\n                evaluate.v.02                 valued hold pass consider believes\n            organization.n.01                   Union union company unit parties\n              collection.n.01                                 books Library laws\n            change_state.v.01                         gone worked turned working\n                    move.v.03                        turned listed pitching take\n                    body.n.02                schools vote University jury Church\n                  inform.v.01                        reported explained learning\n                building.n.01                              schools theaters Hall\n                    give.v.03                  paying allowed let returned offer\n                     use.v.01                              worked working played\n               statement.n.01                          things bid result formula\n                    have.v.02                      read saying say take combined\n              experience.v.01                            found see knew saw came\n                 message.n.02              petitions opinion statements petition\n               happening.n.01                Union union errors example accident\n                  happen.v.01                           resulted gone break came\n                    work.n.01                    project operations jobs service\n               condition.n.01        situation climate Place order participation\n                     see.v.05                       receive valued hold consider\n administrative_district.n.01                         County country states city\n                 content.n.05               food issue belief experience centers\n                   tract.n.01                                  Field right yards\n                   digit.n.01                                     One 3 2 five 6\n                    unit.n.03                          company crew Company team\n                  permit.v.01             permit suffered pass including allowed\n                    room.n.01                                    door Hall floor\n                 succeed.v.01                        hitting worked working pass\n                    have.v.01                                          hold kept\n                     act.v.02                     following play followed played\n               structure.n.01          building establishment housing door floor\n                 concept.n.01                      section fact possibility laws\n               gathering.n.01                            course crew crowd floor\n                location.n.01                    Southeast line base bases South\n                 perform.v.03                                        play played\n                   state.v.01  introduced explained representing announced pr...\n",
       "stream": "stdout"
      }
     ],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "gist_len = list(df.Hypernym) + list(df.Terms)\nlen(' '.join(gist_len))",
     "prompt_number": 478,
     "outputs": [
      {
       "text": "2035",
       "output_type": "pyout",
       "metadata": {},
       "prompt_number": 478
      }
     ],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "def alg_3(text, tagged_text):\n    '''This function requires both untagged and tagged text as input'''\n    #get the frequency distribution of the terms\n    fd, text = get_terms(text)\n    #get important terms\n    important_terms = find_important_terms(fd)\n    #match important terms to tagged text to get POS\n    tagged_important_terms = match_tagged(important_terms, tagged_text)\n    #get hypernyms of the important terms and the terms that are associated with them\n    hypterms = categories_from_hypernyms(tagged_important_terms)\n    #Sort the result based on how many terms a hypernym has\n    hypterms = sorted(hypterms.items(), key= lambda x: len(x[1]), reverse=True)\n    #Pair down the terms, take the top 50 most common and for each hypernym take \n    #only 5 associated words from a set of the associated words\n    gist = [(term[0], ' '.join(list(set(term[1][:5])))) for term in hypterms][:50]\n    #create dataframe for formatting\n    df = pd.DataFrame(gist, columns=['Hypernym', 'Terms'])\n    print df.to_string(justify='right' , columns=[0,1], index=False)",
     "prompt_number": 479,
     "outputs": [],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "alg_3(brown_news, brown_news_tag)",
     "prompt_number": 480,
     "outputs": [
      {
       "output_type": "stream",
       "text": "                     Hypernym                                              Terms\n                      be.v.01                            needed worked hold want\n                  change.v.02                                   gone turned take\n                  travel.v.01                          following continue played\n                    make.v.03                    brought worked organized played\n                     act.v.01                                    continue played\n                  person.n.01                    player life men relations child\n                activity.n.01              training support negotiations service\n                      be.v.03                            following gone continue\n                  change.v.01                         brought recommended turned\n                     get.v.01                           found turned take accept\n             time_period.n.01                                 schools hours life\n             communicate.v.02                         paying informed asked sign\n                    move.v.02                              worked working played\n                   state.n.02                Union union feeling power situation\n           large_integer.n.01                         100 million 17 yards grand\n                evaluate.v.02                 valued hold pass consider believes\n            organization.n.01                   Union union company unit parties\n              collection.n.01                                 books Library laws\n            change_state.v.01                         gone worked turned working\n                    move.v.03                        turned listed pitching take\n                    body.n.02                schools vote University jury Church\n                  inform.v.01                        reported explained learning\n                building.n.01                              schools theaters Hall\n                    give.v.03                  paying allowed let returned offer\n                     use.v.01                              worked working played\n               statement.n.01                          things bid result formula\n                    have.v.02                      read saying say take combined\n              experience.v.01                            found see knew saw came\n                 message.n.02              petitions opinion statements petition\n               happening.n.01                Union union errors example accident\n                  happen.v.01                           resulted gone break came\n                    work.n.01                    project operations jobs service\n               condition.n.01        situation climate Place order participation\n                     see.v.05                       receive valued hold consider\n administrative_district.n.01                         County country states city\n                 content.n.05               food issue belief experience centers\n                   tract.n.01                                  Field right yards\n                   digit.n.01                                     One 3 2 five 6\n                    unit.n.03                          company crew Company team\n                  permit.v.01             permit suffered pass including allowed\n                    room.n.01                                    door Hall floor\n                 succeed.v.01                        hitting worked working pass\n                    have.v.01                                          hold kept\n                     act.v.02                     following play followed played\n               structure.n.01          building establishment housing door floor\n                 concept.n.01                      section fact possibility laws\n               gathering.n.01                            course crew crowd floor\n                location.n.01                    Southeast line base bases South\n                 perform.v.03                                        play played\n                   state.v.01  introduced explained representing announced pr...\n",
       "stream": "stdout"
      }
     ],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "alg_3(mystery_text_sents, mystery_text_tag)",
     "prompt_number": 553,
     "outputs": [
      {
       "output_type": "stream",
       "text": "                      Hypernym                                           Terms\n                       be.v.01                               looking hold want\n                   change.v.02                         falling going take came\n                   change.v.01                       brought brings increasing\n                     make.v.03                          brought brings causing\n              time_period.n.01                             season drought life\n                   travel.v.01                          push ease ranging came\n            large_integer.n.01                               120 90 100 30 144\n                 activity.n.01                support line negotiations DEMAND\n                      act.v.01                            going continued take\n                      get.v.01                               found take accept\n                statement.n.01               formula estimate ESTIMATES result\n      indefinite_quantity.n.01       increases Production yields worth Reserve\n                       be.v.03                             going followed came\n                 evaluate.v.02           believed hold ranging accept consider\n                     move.v.02                       push shipped raised drive\n              communicate.v.02                               asked signed give\n                   inform.v.01                                 reported showed\n                   cereal.n.01                    GRAIN CORN Rice Grain grains\n                    state.n.02             Union situation Agency action power\n             change_state.v.01                   improved falling going reduce\n                  message.n.02              Imports direction statements offer\n                     give.v.03                      provide give allow allowed\n                   happen.v.01                     resulted falling going came\n gregorian_calendar_month.n.01                   MAY February MARCH August Aug\n                      see.v.05              received hold held consider expect\n                   person.n.01         prospects life quarter owners Authority\n  administrative_district.n.01                  states country reserve Reserve\n                 commerce.n.01         marketing distribution Trading BUSINESS\n               collection.n.01                CROPS planting data crop package\n                     move.v.03                    taken started runs take give\n                condition.n.01          situation improvement order way DEMAND\n     executive_department.n.01             states commerce ENERGY Energy Labor\n                commodity.n.01         product Imports EXPORTS FUTURES Exports\n         avoirdupois_unit.n.01                GRAIN quarter ounce Grain grains\n                    slope.n.01                         banks rises RAISES bank\n              information.n.01             REPORT Program fact reports details\n                    value.n.02                               Prices costs cost\n                   symbol.n.01             dollars pound number crowns DOLLARS\n                 location.n.01                     point line parts South part\n                   change.n.03                                      move moves\n                   assets.n.01  securities credits amount investment Resources\n                  express.v.02                               saying raised say\n                   direct.v.04                         held runs hold led give\n                    digit.n.01                         four Seven five TWO One\n                 decrease.v.02                                      cut reduce\n                   handle.n.01                          CROPS STOCKS CROP crop\n                    point.n.02                         position spot positions\n                    class.n.03      world agriculture MARKET Agriculture Labor\n               speech_act.n.01     Agreement speech statements agreement offer\n                foodstuff.n.02                 cocoa GRAIN Grain cereal grains\n",
       "stream": "stdout"
      }
     ],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "alg_3(criteria_text_sents, flatten.flatten(criteria_text_docs_tagged))",
     "prompt_number": 567,
     "outputs": [
      {
       "output_type": "stream",
       "text": "                  Hypernym                                              Terms\n                   be.v.01                 needed entering working fit tested\n               change.v.01                        corrected complicated makes\n               change.v.02           obtained develop undergone produce makes\n          time_period.n.01                     past yrs hospitalization night\n               person.n.01              self Males subjects antagonist Cancer\n                  act.v.01                attacks using foreseen Forced start\n                 make.v.03            initiate makes initiated working caused\n             activity.n.01                     operation measurements service\n                state.n.02                situation death integrity disorders\n        large_integer.n.01                                   G twelve 60 l MS\n                  get.v.01            obtained making express makes inherited\n               travel.v.01                           travel followed Advanced\n            condition.n.01  situation safety hospitalization participation...\n             evaluate.v.02             Expected tested assessed placed Rating\n          communicate.v.02                              visit speaking render\n               inform.v.01                           Rating staging indicated\n     metallic_element.n.01                                  PR Calcium E U PD\n               letter.n.02                                      PS PI E MS PE\n              symptom.n.01        hyperlipidemia atrophy Pain anaemia effects\n                 move.v.02                            placed working Advanced\n                   be.v.03                          extended followed sitting\n                 have.v.02                       bearing Taking take combined\n            happening.n.01  Modification episodes beginning alterations ex...\n             medicine.n.02       remedies azathioprine placebo Dose acyclovir\n           collection.n.01                 content pack Mass combination hand\n            body_part.n.01               process Chest tissue stomach systems\n                 give.v.03            administer provide administering render\n               permit.v.01                     Allowed allowing admitted gave\n              content.n.05     culture subjects Centers experience substances\n   physical_condition.n.01  abnormalities sterility gestation disorders se...\n                group.n.01                         races men Mass systems Men\n                 move.v.03                                start beats reached\n     chemical_element.n.01                       boron PS phosphorus Oxygen N\n       bodily_process.n.01       reactions intake Intake Consumption Response\n  indefinite_quantity.n.01               limitations output production limits\n           experience.v.01                            meet see live See meets\n            pathology.n.02   stenosis stricture ascites fibrosis osteoporosis\n                digit.n.01                                9 seven One Two one\n              consume.v.02                     using trying Used used smoking\n             property.n.02           strength levels composition degree level\n             relation.n.01              Function functions parts relationship\n                think.v.03                   evaluate given studied give gave\n              provide.v.02                treating meet gave fulfilling meets\n                 show.v.04  demonstrates screened presenting demonstrated ...\n                  use.v.01                      recurred put extended working\n               change.n.01                           death mutations mutation\n                point.n.02                beginning nodes root Centers source\n create_by_mental_act.v.01                drawn conceive develop makes making\n         change_state.v.01         become relapsed worsening conceive working\n                  see.v.05             Expected making include makes Received\n",
       "stream": "stdout"
      }
     ],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "",
     "outputs": [],
     "language": "python",
     "trusted": true,
     "collapsed": false
    }
   ],
   "metadata": {}
  }
 ],
 "metadata": {
  "name": "",
  "signature": "sha256:7febc996d3dc06dbe3244767e9f131133b2d2f049cfe687c9c9ce1d6d630a7cf",
  "gist_id": "04953e74b5ab1158e523"
 },
 "nbformat": 3
}