skystrife/KDD 2017.ipynb

## KDD 2017.ipynb
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "# Part 1: Feature Engineering for Text Data with MeTA"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "In this part of the tutorial, we'll explore how to go from raw text data to feature representations for documents using MeTA. Everything downstream depends on this representation, so it's important that we spend some time talking about the many different ways you can analyze documents into feature representations."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "First, we'll import the `metapy` python bindings."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 85,
   "metadata": {
    "collapsed": true,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "import metapy"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "For reference, this tutorial was written agains the following metapy version:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 86,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'0.2.6'"
      ]
     },
     "execution_count": 86,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "metapy.__version__"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "If you'd like, you can tell MeTA to log to stderr so you can get progress output when running long-running function calls."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 87,
   "metadata": {
    "collapsed": true,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "metapy.log_to_stderr()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "Now, let's create a document with some content."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 88,
   "metadata": {
    "collapsed": true,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "doc = metapy.index.Document()\n",
    "doc.content(\"I said that I can't believe that it only costs $19.95!\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "MeTA provides a stream-based interface for performing document tokenization. Each stream starts off with a Tokenizer object, and in most cases you should use the [Unicode standard aware](http://site.icu-project.org) `ICUTokenizer`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 89,
   "metadata": {
    "collapsed": true,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "tok = metapy.analyzers.ICUTokenizer()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "Tokenizers operate on raw text and provide an Iterable that spits out the individual text tokens. Let's try running just the `ICUTokenizer` to see what it does."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 90,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['<s>',\n",
       " 'I',\n",
       " 'said',\n",
       " 'that',\n",
       " 'I',\n",
       " \"can't\",\n",
       " 'believe',\n",
       " 'that',\n",
       " 'it',\n",
       " 'only',\n",
       " 'costs',\n",
       " '$',\n",
       " '19.95',\n",
       " '!',\n",
       " '</s>']"
      ]
     },
     "execution_count": 90,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tok.set_content(doc.content()) # this could be any string\n",
    "[token for token in tok]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "One thing that you likely immediately notice is the insertion of these pseudo-XML looking `<s>` and `</s>` tags. These are called \"sentence boundary tags\". As a side-effect, a default-construted `ICUTokenizer` discovers the sentences in a document by delimiting them with the sentence boundary tags. Let's try tokenizing a multi-sentence document to see what that looks like."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 91,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['<s>',\n",
       " 'I',\n",
       " 'said',\n",
       " 'that',\n",
       " 'I',\n",
       " \"can't\",\n",
       " 'believe',\n",
       " 'that',\n",
       " 'it',\n",
       " 'only',\n",
       " 'costs',\n",
       " '$',\n",
       " '19.95',\n",
       " '!',\n",
       " '</s>',\n",
       " '<s>',\n",
       " 'I',\n",
       " 'could',\n",
       " 'only',\n",
       " 'find',\n",
       " 'it',\n",
       " 'for',\n",
       " 'more',\n",
       " 'than',\n",
       " '$',\n",
       " '30',\n",
       " 'before',\n",
       " '.',\n",
       " '</s>']"
      ]
     },
     "execution_count": 91,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "doc.content(\"I said that I can't believe that it only costs $19.95! I could only find it for more than $30 before.\")\n",
    "tok.set_content(doc.content())\n",
    "[token for token in tok]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "Most of the information retrieval techniques you have likely been learning about in this class don't need to concern themselves with finding the boundaries between separate sentences in a document, but later today we'll explore a scenario where this might matter more.\n",
    "\n",
    "Let's pass a flag to the `ICUTokenizer` constructor to disable sentence boundary tags for now."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 92,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['I',\n",
       " 'said',\n",
       " 'that',\n",
       " 'I',\n",
       " \"can't\",\n",
       " 'believe',\n",
       " 'that',\n",
       " 'it',\n",
       " 'only',\n",
       " 'costs',\n",
       " '$',\n",
       " '19.95',\n",
       " '!',\n",
       " 'I',\n",
       " 'could',\n",
       " 'only',\n",
       " 'find',\n",
       " 'it',\n",
       " 'for',\n",
       " 'more',\n",
       " 'than',\n",
       " '$',\n",
       " '30',\n",
       " 'before',\n",
       " '.']"
      ]
     },
     "execution_count": 92,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tok = metapy.analyzers.ICUTokenizer(suppress_tags=True)\n",
    "tok.set_content(doc.content())\n",
    "[token for token in tok]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "I mentioned earlier that MeTA treats tokenization as a *streaming* process, and that it *starts* with a tokenizer. As you've learned, for optimal search performance it's often beneficial to modify the raw underlying tokens of a document, and thus change its representation, before adding it to an inverted index structure for searching.\n",
    "\n",
    "The \"intermediate\" steps in the tokenization stream are represented with objects called Filters. Each filter consumes the content of a previous filter (or a tokenizer) and modifies the tokens coming out of the stream in some way.\n",
    "\n",
    "Let's start by using a simple filter that can help eliminate a lot of noise that we might encounter when tokenizing web documents: a `LengthFilter`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 93,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['said',\n",
       " 'that',\n",
       " \"can't\",\n",
       " 'believe',\n",
       " 'that',\n",
       " 'it',\n",
       " 'only',\n",
       " 'costs',\n",
       " '19.95',\n",
       " 'could',\n",
       " 'only',\n",
       " 'find',\n",
       " 'it',\n",
       " 'for',\n",
       " 'more',\n",
       " 'than',\n",
       " '30',\n",
       " 'before']"
      ]
     },
     "execution_count": 93,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tok = metapy.analyzers.LengthFilter(tok, min=2, max=30)\n",
    "tok.set_content(doc.content())\n",
    "[token for token in tok]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "Here, we can see that the `LengthFilter` is consuming our original `ICUTokenizer`. It modifies the token stream by only emitting tokens that are of a minimum length of 2 and a maximum length of 30. This can get rid of a lot of punctuation tokens, but also excessively long tokens such as URLs."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "Another common trick is to remove stopwords. (Can anyone tell me what a stopword is?) In MeTA, this is done using a `ListFilter`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 94,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "File ‘lemur-stopwords.txt’ already there; not retrieving.\r\n",
      "\r\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "[\"can't\", 'believe', 'costs', '19.95', 'find', '30']"
      ]
     },
     "execution_count": 94,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "!wget -nc https://raw.githubusercontent.com/meta-toolkit/meta/master/data/lemur-stopwords.txt\n",
    "\n",
    "tok = metapy.analyzers.ListFilter(tok, \"lemur-stopwords.txt\", metapy.analyzers.ListFilter.Type.Reject)\n",
    "tok.set_content(doc.content())\n",
    "[token for token in tok]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "Here we've downloaded a common list of stopwords obtained from the [Lemur project](http://lemurproject.org) and created a `ListFilter` to reject any tokens that occur in that list of words.\n",
    "\n",
    "You can see how much of a difference removing stopwords can make on the size of a document's token stream! This translates to a lot of space savings in the inverted index as well."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "Another common filter that people use is called a stemmer, or lemmatizer. This kind of filter tries to modify individual tokens in such a way that different inflected forms of a word all reduce to the same representation. This lets you, for example, find documents about a \"run\" when you search \"running\" or \"runs\". A common stemmer is the [Porter2 Stemmer](http://snowball.tartarus.org/algorithms/english/stemmer.html), which MeTA has an implementation of. Let's try it!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 95,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[\"can't\", 'believ', 'cost', '19.95', 'find', '30']"
      ]
     },
     "execution_count": 95,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tok = metapy.analyzers.Porter2Filter(tok)\n",
    "tok.set_content(doc.content())\n",
    "[token for token in tok]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "Notice how \"believe\" becomes \"believ\" and \"costs\" becomes \"cost\". Stemming can help search by allowing queries to return more matched documents by relaxing what it means for a document to match a query term. Note that it's important to ensure that queries are tokenized in the *exact same way* as your documents were before indexing them. If you ignore this, your query is unlikely to contain the raw token \"believ\" and you'll miss a lot of results."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "Finally, after you've got the token stream configured the way you'd like, it's time to analyze the document by consuming each token from its token stream and performing some actions based on these tokens. In the simplest case, which often is enough for \"good enough\" search results, our action can simply be counting how many times these tokens occur.\n",
    "\n",
    "For clarity, let's switch back to a simpler token stream first. Write me a token stream that tokenizes using the Unicode standard, and then lowercases each token. (Hint: `help(metapy.analyzers)`.)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 96,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['i',\n",
       " 'said',\n",
       " 'that',\n",
       " 'i',\n",
       " \"can't\",\n",
       " 'believe',\n",
       " 'that',\n",
       " 'it',\n",
       " 'only',\n",
       " 'costs',\n",
       " '$',\n",
       " '19.95',\n",
       " '!',\n",
       " 'i',\n",
       " 'could',\n",
       " 'only',\n",
       " 'find',\n",
       " 'it',\n",
       " 'for',\n",
       " 'more',\n",
       " 'than',\n",
       " '$',\n",
       " '30',\n",
       " 'before',\n",
       " '.']"
      ]
     },
     "execution_count": 96,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tok = metapy.analyzers.ICUTokenizer(suppress_tags=True)\n",
    "tok = metapy.analyzers.LowercaseFilter(tok)\n",
    "tok.set_content(doc.content())\n",
    "[token for token in tok]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "Now, let's count how often each individual token appears in the stream. You might have called this representation the \"bag of words\" representation, but it is also often called \"unigram word counts\". In MeTA, classes that consume a token stream and emit a document representation are called Analyzers."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 97,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true,
    "scrolled": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "I said that I can't believe that it only costs $19.95! I could only find it for more than $30 before.\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "{'!': 1,\n",
       " '$': 2,\n",
       " '.': 1,\n",
       " '19.95': 1,\n",
       " '30': 1,\n",
       " 'before': 1,\n",
       " 'believe': 1,\n",
       " \"can't\": 1,\n",
       " 'costs': 1,\n",
       " 'could': 1,\n",
       " 'find': 1,\n",
       " 'for': 1,\n",
       " 'i': 3,\n",
       " 'it': 2,\n",
       " 'more': 1,\n",
       " 'only': 2,\n",
       " 'said': 1,\n",
       " 'than': 1,\n",
       " 'that': 2}"
      ]
     },
     "execution_count": 97,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ana = metapy.analyzers.NGramWordAnalyzer(1, tok)\n",
    "print(doc.content())\n",
    "ana.analyze(doc)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "If you noticed the name of the analyzer, you might have realized that you can count not just individual tokens, but groups of them. \"Unigram\" means \"1-gram\", and we count individual tokens. \"Bigram\" means \"2-gram\", and we count adjacent tokens together as a group. Let's try that now."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 98,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{('!', 'i'): 1,\n",
       " ('$', '19.95'): 1,\n",
       " ('$', '30'): 1,\n",
       " ('19.95', '!'): 1,\n",
       " ('30', 'before'): 1,\n",
       " ('before', '.'): 1,\n",
       " ('believe', 'that'): 1,\n",
       " (\"can't\", 'believe'): 1,\n",
       " ('costs', '$'): 1,\n",
       " ('could', 'only'): 1,\n",
       " ('find', 'it'): 1,\n",
       " ('for', 'more'): 1,\n",
       " ('i', \"can't\"): 1,\n",
       " ('i', 'could'): 1,\n",
       " ('i', 'said'): 1,\n",
       " ('it', 'for'): 1,\n",
       " ('it', 'only'): 1,\n",
       " ('more', 'than'): 1,\n",
       " ('only', 'costs'): 1,\n",
       " ('only', 'find'): 1,\n",
       " ('said', 'that'): 1,\n",
       " ('than', '$'): 1,\n",
       " ('that', 'i'): 1,\n",
       " ('that', 'it'): 1}"
      ]
     },
     "execution_count": 98,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ana = metapy.analyzers.NGramWordAnalyzer(2, tok)\n",
    "ana.analyze(doc)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "Now the individual \"tokens\" we're counting are pairs of tokens. You can analyze any n-gram of tokens you would like to in this way (and this is a simple way to attempt to support phrase search). Note, however, that as you increase the size of the n-grams you are counting, you are also increasing (exponentially!) the number of possible n-grams you could observe, so there's no free lunch here.\n",
    "\n",
    "This analysis pipeline feeds both the creation of the `InvertedIndex`, which is used for search applications, and the `ForwardIndex`, which is used for topic modeling and classification applications. For classification, sometimes looking at n-grams of characters is useful."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 99,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{(' ', '$', '1', '9'): 1,\n",
       " (' ', '$', '3', '0'): 1,\n",
       " (' ', 'I', ' ', 'c'): 2,\n",
       " (' ', 'b', 'e', 'f'): 1,\n",
       " (' ', 'b', 'e', 'l'): 1,\n",
       " (' ', 'c', 'a', 'n'): 1,\n",
       " (' ', 'c', 'o', 's'): 1,\n",
       " (' ', 'c', 'o', 'u'): 1,\n",
       " (' ', 'f', 'i', 'n'): 1,\n",
       " (' ', 'f', 'o', 'r'): 1,\n",
       " (' ', 'i', 't', ' '): 2,\n",
       " (' ', 'm', 'o', 'r'): 1,\n",
       " (' ', 'o', 'n', 'l'): 2,\n",
       " (' ', 's', 'a', 'i'): 1,\n",
       " (' ', 't', 'h', 'a'): 3,\n",
       " ('!', ' ', 'I', ' '): 1,\n",
       " ('$', '1', '9', '.'): 1,\n",
       " ('$', '3', '0', ' '): 1,\n",
       " (\"'\", 't', ' ', 'b'): 1,\n",
       " ('.', '9', '5', '!'): 1,\n",
       " ('0', ' ', 'b', 'e'): 1,\n",
       " ('1', '9', '.', '9'): 1,\n",
       " ('3', '0', ' ', 'b'): 1,\n",
       " ('5', '!', ' ', 'I'): 1,\n",
       " ('9', '.', '9', '5'): 1,\n",
       " ('9', '5', '!', ' '): 1,\n",
       " ('I', ' ', 'c', 'a'): 1,\n",
       " ('I', ' ', 'c', 'o'): 1,\n",
       " ('I', ' ', 's', 'a'): 1,\n",
       " ('a', 'i', 'd', ' '): 1,\n",
       " ('a', 'n', ' ', '$'): 1,\n",
       " ('a', 'n', \"'\", 't'): 1,\n",
       " ('a', 't', ' ', 'I'): 1,\n",
       " ('a', 't', ' ', 'i'): 1,\n",
       " ('b', 'e', 'f', 'o'): 1,\n",
       " ('b', 'e', 'l', 'i'): 1,\n",
       " ('c', 'a', 'n', \"'\"): 1,\n",
       " ('c', 'o', 's', 't'): 1,\n",
       " ('c', 'o', 'u', 'l'): 1,\n",
       " ('d', ' ', 'i', 't'): 1,\n",
       " ('d', ' ', 'o', 'n'): 1,\n",
       " ('d', ' ', 't', 'h'): 1,\n",
       " ('e', ' ', 't', 'h'): 2,\n",
       " ('e', 'f', 'o', 'r'): 1,\n",
       " ('e', 'l', 'i', 'e'): 1,\n",
       " ('e', 'v', 'e', ' '): 1,\n",
       " ('f', 'i', 'n', 'd'): 1,\n",
       " ('f', 'o', 'r', ' '): 1,\n",
       " ('f', 'o', 'r', 'e'): 1,\n",
       " ('h', 'a', 'n', ' '): 1,\n",
       " ('h', 'a', 't', ' '): 2,\n",
       " ('i', 'd', ' ', 't'): 1,\n",
       " ('i', 'e', 'v', 'e'): 1,\n",
       " ('i', 'n', 'd', ' '): 1,\n",
       " ('i', 't', ' ', 'f'): 1,\n",
       " ('i', 't', ' ', 'o'): 1,\n",
       " ('l', 'd', ' ', 'o'): 1,\n",
       " ('l', 'i', 'e', 'v'): 1,\n",
       " ('l', 'y', ' ', 'c'): 1,\n",
       " ('l', 'y', ' ', 'f'): 1,\n",
       " ('m', 'o', 'r', 'e'): 1,\n",
       " ('n', ' ', '$', '3'): 1,\n",
       " ('n', \"'\", 't', ' '): 1,\n",
       " ('n', 'd', ' ', 'i'): 1,\n",
       " ('n', 'l', 'y', ' '): 2,\n",
       " ('o', 'n', 'l', 'y'): 2,\n",
       " ('o', 'r', ' ', 'm'): 1,\n",
       " ('o', 'r', 'e', ' '): 1,\n",
       " ('o', 'r', 'e', '.'): 1,\n",
       " ('o', 's', 't', 's'): 1,\n",
       " ('o', 'u', 'l', 'd'): 1,\n",
       " ('r', ' ', 'm', 'o'): 1,\n",
       " ('r', 'e', ' ', 't'): 1,\n",
       " ('s', ' ', '$', '1'): 1,\n",
       " ('s', 'a', 'i', 'd'): 1,\n",
       " ('s', 't', 's', ' '): 1,\n",
       " ('t', ' ', 'I', ' '): 1,\n",
       " ('t', ' ', 'b', 'e'): 1,\n",
       " ('t', ' ', 'f', 'o'): 1,\n",
       " ('t', ' ', 'i', 't'): 1,\n",
       " ('t', ' ', 'o', 'n'): 1,\n",
       " ('t', 'h', 'a', 'n'): 1,\n",
       " ('t', 'h', 'a', 't'): 2,\n",
       " ('t', 's', ' ', '$'): 1,\n",
       " ('u', 'l', 'd', ' '): 1,\n",
       " ('v', 'e', ' ', 't'): 1,\n",
       " ('y', ' ', 'c', 'o'): 1,\n",
       " ('y', ' ', 'f', 'i'): 1}"
      ]
     },
     "execution_count": 99,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tok = metapy.analyzers.CharacterTokenizer()\n",
    "ana = metapy.analyzers.NGramWordAnalyzer(4, tok)\n",
    "ana.analyze(doc)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "Different analyzers can be combined together to create document representations that have many unique perspectives. Once things start to get more complicated, we recommend using a configuration file to specify each of the analyzers you wish to combine for your document representation."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "Now, let's explore something a little bit different. MeTA also has a natural language processing (NLP) component, which currently supports two major NLP tasks: part-of-speech tagging and syntactic parsing.\n",
    "\n",
    "(Does anyone know what part-of-speech tagging is?) POS tagging is a task in NLP that involves identifying a type for each word in a sentence. For example, POS tagging can be used to identify all of the nouns in a sentence, or all of the verbs, or adjectives, or... This is useful as first step towards developing an understanding of the meaning of a particular sentence."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "MeTA places its POS tagging component in its \"sequences\" library. Let's play with some sequences first to get an idea of how they work. We'll start of by creating a sequence."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 100,
   "metadata": {
    "collapsed": true,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "seq = metapy.sequence.Sequence()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "Now, we can add individual words to this sequence. Sequences consist of a list of `Observation`s, which are essentially (word, tag) pairs. If we don't yet know the tags for a `Sequence`, we can just add individual words and leave the tags unset. Words are called \"symbols\" in the library terminology."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 101,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(The, ???), (dog, ???), (ran, ???), (across, ???), (the, ???), (park, ???), (., ???)\n"
     ]
    }
   ],
   "source": [
    "for word in [\"The\", \"dog\", \"ran\", \"across\", \"the\", \"park\", \".\"]:\n",
    "    seq.add_symbol(word)\n",
    "print(seq)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "The printed form of the sequence shows that we do not yet know the tags for each word. Let's fill them in by using a pre-trained POS-tagger model that's distributed with MeTA."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 102,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "File ‘greedy-perceptron-tagger.tar.gz’ already there; not retrieving.\n",
      "\n",
      "perceptron-tagger/\n",
      "perceptron-tagger/feature.mapping.gz\n",
      "perceptron-tagger/label.mapping\n",
      "perceptron-tagger/tagger.model.gz\n"
     ]
    }
   ],
   "source": [
    "!wget -nc https://github.com/meta-toolkit/meta/releases/download/v3.0.1/greedy-perceptron-tagger.tar.gz\n",
    "!tar xvf greedy-perceptron-tagger.tar.gz"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 103,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      " > Loading feature mapping: [================================] 100% ETA 00:00:00 \n",
      " \n"
     ]
    }
   ],
   "source": [
    "tagger = metapy.sequence.PerceptronTagger(\"perceptron-tagger/\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "Now let's fill in the missing tags in our sentence based on the best guess this model has."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 104,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(The, DT), (dog, NN), (ran, VBD), (across, IN), (the, DT), (park, NN), (., .)\n"
     ]
    }
   ],
   "source": [
    "tagger.tag(seq)\n",
    "print(seq)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "Each tag indicates the type of a word, and this particular tagger was trained to output the tags present in the [Penn Treebank tagset](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html).\n",
    "\n",
    "But what if we want to POS-tag a document?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 105,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "I said that I can't believe that it only costs $19.95! I could only find it for more than $30 before.\n"
     ]
    }
   ],
   "source": [
    "print(doc.content())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "We need a way of going from a document to a list of `Sequence`s, each representing an individual sentence. I'll get you started."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 106,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['<s>',\n",
       " 'I',\n",
       " 'said',\n",
       " 'that',\n",
       " 'I',\n",
       " 'ca',\n",
       " \"n't\",\n",
       " 'believe',\n",
       " 'that',\n",
       " 'it',\n",
       " 'only',\n",
       " 'costs',\n",
       " '$',\n",
       " '19.95',\n",
       " '!',\n",
       " '</s>',\n",
       " '<s>',\n",
       " 'I',\n",
       " 'could',\n",
       " 'only',\n",
       " 'find',\n",
       " 'it',\n",
       " 'for',\n",
       " 'more',\n",
       " 'than',\n",
       " '$',\n",
       " '30',\n",
       " 'before',\n",
       " '.',\n",
       " '</s>']"
      ]
     },
     "execution_count": 106,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tok = metapy.analyzers.ICUTokenizer() # keep sentence boundaries!\n",
    "tok = metapy.analyzers.PennTreebankNormalizer(tok)\n",
    "tok.set_content(doc.content())\n",
    "[token for token in tok]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "(Notice that the `PennTreebankNormalizer` modifies some tokens to better match the conventions of the Penn Treebank training data. This should help improve performance a little.)\n",
    "\n",
    "Now, write me a function that can take a token stream that contains sentence boundary tags and returns a list of `Sequence` objects. Don't include the sentence boundary tags in the actual `Sequence` objects."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 107,
   "metadata": {
    "collapsed": true,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "def extract_sequences(tok):\n",
    "    sequences = []\n",
    "    for token in tok:\n",
    "        if token == '<s>':\n",
    "            sequences.append(metapy.sequence.Sequence())\n",
    "        elif token != '</s>':\n",
    "            sequences[-1].add_symbol(token)            \n",
    "    return sequences"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 108,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(I, PRP), (said, VBD), (that, IN), (I, PRP), (ca, MD), (n't, RB), (believe, VB), (that, IN), (it, PRP), (only, RB), (costs, VBZ), ($, $), (19.95, CD), (!, .)\n",
      "(I, PRP), (could, MD), (only, RB), (find, VB), (it, PRP), (for, IN), (more, JJR), (than, IN), ($, $), (30, CD), (before, IN), (., .)\n"
     ]
    }
   ],
   "source": [
    "tok.set_content(doc.content())\n",
    "for seq in extract_sequences(tok):\n",
    "    tagger.tag(seq)\n",
    "    print(seq)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "This is still a rather shallow understanding of these sentences. The next major leap is to parse these sequences of POS-tagged words to obtain a tree for each sentence. These trees, in our case, will represent the hierarchical phrase structure of a single sentence by grouping together tokens that belong to one phrase together, and showing how small phrases combine into larger phrases, and eventually a sentence.\n",
    "\n",
    "Let's try parsing the sentences in our document using a pre-tranned constituency parser that's distributed with MeTA."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 109,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "File ‘greedy-constituency-parser.tar.gz’ already there; not retrieving.\n",
      "\n",
      "parser/\n",
      "parser/parser.trans.gz\n",
      "parser/parser.model.gz\n"
     ]
    }
   ],
   "source": [
    "!wget -nc https://github.com/meta-toolkit/meta/releases/download/v3.0.1/greedy-constituency-parser.tar.gz\n",
    "!tar xvf greedy-constituency-parser.tar.gz"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 110,
   "metadata": {
    "collapsed": true,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "parser = metapy.parser.Parser(\"parser/\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 111,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "I could only find it for more than $ 30 before .\n",
      "(I, PRP), (could, MD), (only, RB), (find, VB), (it, PRP), (for, IN), (more, JJR), (than, IN), ($, $), (30, CD), (before, IN), (., .)\n",
      "(ROOT\n",
      "  (S\n",
      "    (NP (PRP I))\n",
      "    (VP\n",
      "      (MD could)\n",
      "      (ADVP (RB only))\n",
      "      (VP\n",
      "        (VB find)\n",
      "        (NP (PRP it))\n",
      "        (PP\n",
      "          (IN for)\n",
      "          (NP\n",
      "            (QP\n",
      "              (JJR more)\n",
      "              (IN than)\n",
      "              ($ $)\n",
      "              (CD 30))))\n",
      "        (ADVP (IN before))))\n",
      "    (. .)))\n",
      "\n"
     ]
    }
   ],
   "source": [
    "print(' '.join([obs.symbol for obs in seq]))\n",
    "print(seq)\n",
    "tree = parser.parse(seq)\n",
    "print(tree.pretty_str())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "(You can also play with this with a [prettier online demo](https://meta-toolkit.org/nlp-demo.html).)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "We can now parse all of the sentences in our document."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 112,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true,
    "scrolled": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(ROOT\n",
      "  (S\n",
      "    (NP (PRP I))\n",
      "    (VP\n",
      "      (VBD said)\n",
      "      (SBAR\n",
      "        (IN that)\n",
      "        (S\n",
      "          (NP (PRP I))\n",
      "          (VP\n",
      "            (MD ca)\n",
      "            (RB n't)\n",
      "            (VP\n",
      "              (VB believe)\n",
      "              (SBAR\n",
      "                (IN that)\n",
      "                (S\n",
      "                  (NP (PRP it))\n",
      "                  (ADVP (RB only))\n",
      "                  (VP\n",
      "                    (VBZ costs)\n",
      "                    (NP\n",
      "                      ($ $)\n",
      "                      (CD 19.95))))))))))\n",
      "    (. !)))\n",
      "\n",
      "(ROOT\n",
      "  (S\n",
      "    (NP (PRP I))\n",
      "    (VP\n",
      "      (MD could)\n",
      "      (ADVP (RB only))\n",
      "      (VP\n",
      "        (VB find)\n",
      "        (NP (PRP it))\n",
      "        (PP\n",
      "          (IN for)\n",
      "          (NP\n",
      "            (QP\n",
      "              (JJR more)\n",
      "              (IN than)\n",
      "              ($ $)\n",
      "              (CD 30))))\n",
      "        (ADVP (IN before))))\n",
      "    (. .)))\n",
      "\n"
     ]
    }
   ],
   "source": [
    "tok.set_content(doc.content())\n",
    "for seq in extract_sequences(tok):\n",
    "    tagger.tag(seq)\n",
    "    print(parser.parse(seq).pretty_str())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "Now that we know how POS-tagging and syntactic parsing works in MeTA, let's explore some features that we can add to our document representations using these techniques.\n",
    "\n",
    "The simplest feature we can imagine that uses the POS-taggged sequences might be n-grams of POS tags. (As a quick detour, we'll need to download and extract a CRF-based POS tagging model.)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 113,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "File ‘crf.tar.gz’ already there; not retrieving.\r\n",
      "\r\n"
     ]
    }
   ],
   "source": [
    "!wget -nc https://github.com/meta-toolkit/meta/releases/download/v3.0.1/crf.tar.gz\n",
    "!tar xf crf.tar.gz"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "Now, we can use the following analysis pipeline to get n-gram POS tag features by using the `NGRamPOSAnalyzer`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 114,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      " > Loading feature mapping: [================================] 100% ETA 00:00:00 \n",
      " \n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "{('$', 'CD'): 2,\n",
       " ('CD', '.'): 1,\n",
       " ('CD', 'RB'): 1,\n",
       " ('IN', '$'): 1,\n",
       " ('IN', 'JJR'): 1,\n",
       " ('IN', 'PRP'): 2,\n",
       " ('JJR', 'IN'): 1,\n",
       " ('MD', 'RB'): 2,\n",
       " ('PRP', 'IN'): 1,\n",
       " ('PRP', 'MD'): 2,\n",
       " ('PRP', 'RB'): 1,\n",
       " ('PRP', 'VBD'): 1,\n",
       " ('RB', '.'): 1,\n",
       " ('RB', 'VB'): 2,\n",
       " ('RB', 'VBZ'): 1,\n",
       " ('VB', 'IN'): 1,\n",
       " ('VB', 'PRP'): 1,\n",
       " ('VBD', 'IN'): 1,\n",
       " ('VBZ', '$'): 1}"
      ]
     },
     "execution_count": 114,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tok = metapy.analyzers.ICUTokenizer()\n",
    "tok = metapy.analyzers.PennTreebankNormalizer(tok)\n",
    "ana = metapy.analyzers.NGramPOSAnalyzer(2, tok, 'crf')\n",
    "ana.analyze(doc)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "We can also parse the sentences in the document and extract a number of different structural features from the parse trees using a `TreeAnalyzer`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 115,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      " > Loading feature mapping: [================================] 100% ETA 00:00:00 \n",
      " \n"
     ]
    }
   ],
   "source": [
    "ana = metapy.analyzers.TreeAnalyzer(tok, 'perceptron-tagger', 'parser')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "The `TreeAnalyzer` has a function `add()` that takes `TreeFeaturizer` subclasses. Conceptually, the extraction of structural features from parse trees looks something like this:\n",
    "\n",
    "1. The tokenizer is run until a full sentence is read.\n",
    "2. The greedy perceptron tagger is run to tag the words in the sentence.\n",
    "3. The shift-reduce constituency parser is run to produce a parse tree.\n",
    "4. Each `TreeFeaturizer` that is part of the `TreeAnalayzer` is run over the parse tree to produce features.\n",
    "\n",
    "This process is repeated for each sentence found in the document.\n",
    "\n",
    "Let's try adding just one `TreeFeaturizer` to the analyzer for now and see what features we get."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 116,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'depth-12': 1, 'depth-8': 1}"
      ]
     },
     "execution_count": 116,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ana.add(metapy.analyzers.DepthFeaturizer())\n",
    "ana.analyze(doc)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "The featurizer we used here simply extracts the depth of each subtree and creates a new feature for each depth encountered.\n",
    "\n",
    "We can also see some features that utilize the structure of the trees if we use some different `TreeFeaturizer`s."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 117,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      " > Loading feature mapping: [================================] 100% ETA 00:00:00 \n",
      " \n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "{'subtree-($)': 2,\n",
       " 'subtree-(.)': 2,\n",
       " 'subtree-(ADVP (IN))': 1,\n",
       " 'subtree-(ADVP (RB))': 2,\n",
       " 'subtree-(CD)': 2,\n",
       " 'subtree-(IN)': 5,\n",
       " 'subtree-(JJR)': 1,\n",
       " 'subtree-(MD)': 2,\n",
       " 'subtree-(NP ($) (CD))': 1,\n",
       " 'subtree-(NP (PRP))': 5,\n",
       " 'subtree-(NP (QP))': 1,\n",
       " 'subtree-(PP (IN) (NP))': 1,\n",
       " 'subtree-(PRP)': 5,\n",
       " 'subtree-(QP (JJR) (IN) ($) (CD))': 1,\n",
       " 'subtree-(RB)': 3,\n",
       " 'subtree-(ROOT (S))': 2,\n",
       " 'subtree-(S (NP) (ADVP) (VP))': 1,\n",
       " 'subtree-(S (NP) (VP) (.))': 2,\n",
       " 'subtree-(S (NP) (VP))': 1,\n",
       " 'subtree-(SBAR (IN) (S))': 2,\n",
       " 'subtree-(VB)': 2,\n",
       " 'subtree-(VBD)': 1,\n",
       " 'subtree-(VBZ)': 1,\n",
       " 'subtree-(VP (MD) (ADVP) (VP))': 1,\n",
       " 'subtree-(VP (MD) (RB) (VP))': 1,\n",
       " 'subtree-(VP (VB) (NP) (PP) (ADVP))': 1,\n",
       " 'subtree-(VP (VB) (SBAR))': 1,\n",
       " 'subtree-(VP (VBD) (SBAR))': 1,\n",
       " 'subtree-(VP (VBZ) (NP))': 1}"
      ]
     },
     "execution_count": 117,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ana = metapy.analyzers.TreeAnalyzer(tok, 'perceptron-tagger', 'parser')\n",
    "ana.add(metapy.analyzers.SubtreeFeaturizer())\n",
    "ana.analyze(doc)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "The `SubtreeFeaturizer` creates a new feature for each unique subtree seen in the data, to a depth of 1. This can create quite a lot of features, but describes how the sentence is decomposed structureally. This kind of feature is also known as a \"rewrite rule\" feature.\n",
    "\n",
    "We can also ignore the labels of the subtrees entirely and just extract their structure if we use a `SkeletonFeaturizer`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 118,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      " > Loading feature mapping: [================================] 100% ETA 00:00:00 \n",
      " \n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "{'(((())(()(()((())(()()(()(()((())(())(()(()())))))))))()))': 1,\n",
       " '(((())(()(())(()(())(()((()()()())))(())))()))': 1,\n",
       " '((()()()()))': 1,\n",
       " '((())(()(()((())(()()(()(()((())(())(()(()())))))))))())': 1,\n",
       " '((())(()(())(()(())(()((()()()())))(())))())': 1,\n",
       " '((())(()()(()(()((())(())(()(()())))))))': 1,\n",
       " '((())(())(()(()())))': 1,\n",
       " '(()((()()()())))': 1,\n",
       " '(()((())(()()(()(()((())(())(()(()()))))))))': 1,\n",
       " '(()((())(())(()(()()))))': 1,\n",
       " '(()(()((())(()()(()(()((())(())(()(()())))))))))': 1,\n",
       " '(()(()((())(())(()(()())))))': 1,\n",
       " '(()(()()))': 1,\n",
       " '(()(())(()((()()()())))(()))': 1,\n",
       " '(()(())(()(())(()((()()()())))(())))': 1,\n",
       " '(()()(()(()((())(())(()(()()))))))': 1,\n",
       " '(()()()())': 1,\n",
       " '(()())': 1,\n",
       " '(())': 8,\n",
       " '()': 26}"
      ]
     },
     "execution_count": 118,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ana = metapy.analyzers.TreeAnalyzer(tok, 'perceptron-tagger', 'parser')\n",
    "ana.add(metapy.analyzers.SkeletonFeaturizer())\n",
    "ana.analyze(doc)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "Play with the other featurizers to see what they do!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "In practice, it is often beneficial to combine multiple feature sets together. We can do this with a `MultiAnalyzer`. Let's combine unigram words, bigram POS tags, and rewrite rules for our document feature representation.\n",
    "\n",
    "We can certainly do this programmatically, but doing so can become tedious quite quickly. Instead, let's use MeTA's configuration file format to specify our analyzer, which we can then load in one line of code. MeTA uses [TOML](https://en.wikipedia.org/wiki/TOML) configuration files for all of its configuration. If you haven't heard of TOML before, don't panic! It's a very simple, readable format that looks like old school INI files.\n",
    "\n",
    "Let's create a simple configuration file now."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 119,
   "metadata": {
    "collapsed": true,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "config = \"\"\"stop-words = \"lemur-stopwords.txt\"\n",
    "\n",
    "[[analyzers]]\n",
    "method = \"ngram-word\"\n",
    "ngram = 1\n",
    "filter = \"default-unigram-chain\"\n",
    "\n",
    "[[analyzers]]\n",
    "method = \"ngram-pos\"\n",
    "ngram = 2\n",
    "filter = [{type = \"icu-tokenizer\"}, {type = \"ptb-normalizer\"}]\n",
    "crf-prefix = \"crf\"\n",
    "\n",
    "[[analyzers]]\n",
    "method = \"tree\"\n",
    "filter = [{type = \"icu-tokenizer\"}, {type = \"ptb-normalizer\"}]\n",
    "features = [\"subtree\"]\n",
    "tagger = \"perceptron-tagger/\"\n",
    "parser = \"parser/\"\n",
    "\"\"\"\n",
    "with open('config.toml', 'w') as f:\n",
    "    f.write(config)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "Each `[[analyzers]]` block defines another analyzer to combine for our feature representation. Since \"ngram-word\" is such a common analyzer, we have defined some default filter chains that can be used with shortcuts. \"default-unigram-chain\" is a filter chain suitable for unigram words; \"default-chain\" is a filter chain suitable for bigram words and above.\n",
    "\n",
    "We can now load an analyzer from this configuration file like so:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 120,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      " > Loading feature mapping: [================================] 100% ETA 00:00:00 \n",
      " \n",
      " > Loading feature mapping: [================================] 100% ETA 00:00:00  \n",
      " \n"
     ]
    }
   ],
   "source": [
    "ana = metapy.analyzers.load('config.toml')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "Now let's see what we get!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 121,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'$_CD': 2,\n",
       " 'CD_.': 1,\n",
       " 'CD_RB': 1,\n",
       " 'IN_$': 1,\n",
       " 'IN_JJR': 1,\n",
       " 'IN_PRP': 2,\n",
       " 'JJR_IN': 1,\n",
       " 'MD_RB': 2,\n",
       " 'PRP_IN': 1,\n",
       " 'PRP_MD': 2,\n",
       " 'PRP_RB': 1,\n",
       " 'PRP_VBD': 1,\n",
       " 'RB_.': 1,\n",
       " 'RB_VB': 2,\n",
       " 'RB_VBZ': 1,\n",
       " 'VBD_IN': 1,\n",
       " 'VBZ_$': 1,\n",
       " 'VB_IN': 1,\n",
       " 'VB_PRP': 1,\n",
       " 'believ': 1,\n",
       " \"can't\": 1,\n",
       " 'cost': 1,\n",
       " 'find': 1,\n",
       " 'subtree-($)': 2,\n",
       " 'subtree-(.)': 2,\n",
       " 'subtree-(ADVP (IN))': 1,\n",
       " 'subtree-(ADVP (RB))': 2,\n",
       " 'subtree-(CD)': 2,\n",
       " 'subtree-(IN)': 5,\n",
       " 'subtree-(JJR)': 1,\n",
       " 'subtree-(MD)': 2,\n",
       " 'subtree-(NP ($) (CD))': 1,\n",
       " 'subtree-(NP (PRP))': 5,\n",
       " 'subtree-(NP (QP))': 1,\n",
       " 'subtree-(PP (IN) (NP))': 1,\n",
       " 'subtree-(PRP)': 5,\n",
       " 'subtree-(QP (JJR) (IN) ($) (CD))': 1,\n",
       " 'subtree-(RB)': 3,\n",
       " 'subtree-(ROOT (S))': 2,\n",
       " 'subtree-(S (NP) (ADVP) (VP))': 1,\n",
       " 'subtree-(S (NP) (VP) (.))': 2,\n",
       " 'subtree-(S (NP) (VP))': 1,\n",
       " 'subtree-(SBAR (IN) (S))': 2,\n",
       " 'subtree-(VB)': 2,\n",
       " 'subtree-(VBD)': 1,\n",
       " 'subtree-(VBZ)': 1,\n",
       " 'subtree-(VP (MD) (ADVP) (VP))': 1,\n",
       " 'subtree-(VP (MD) (RB) (VP))': 1,\n",
       " 'subtree-(VP (VB) (NP) (PP) (ADVP))': 1,\n",
       " 'subtree-(VP (VB) (SBAR))': 1,\n",
       " 'subtree-(VP (VBD) (SBAR))': 1,\n",
       " 'subtree-(VP (VBZ) (NP))': 1}"
      ]
     },
     "execution_count": 121,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ana.analyze(doc)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "# Part 2: Information Retrieval with MeTA"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "In this part of the tutorial, we'll play with the first major application of MeTA: search engines. We will be having the first contest in this part! Once we finish going through how to create an inverted index, search it, and evaluate retrieval algorithms, I will give you instructions on how to participate in the competition. There will be a leader board to keep track of the best submissions, and I intend on leaving it running until the end of the conference for people to play around with."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "Let's get a publicly available retrieval dataset with relevance judgments first."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 122,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "--2017-08-16 19:19:41--  https://meta-toolkit.org/data/2016-11-10/cranfield.tar.gz\n",
      "Loaded CA certificate '/etc/ssl/certs/ca-certificates.crt'\n",
      "Resolving meta-toolkit.org... 50.116.41.177, 2600:3c02::f03c:91ff:feae:b777\n",
      "Connecting to meta-toolkit.org|50.116.41.177|:443... connected.\n",
      "HTTP request sent, awaiting response... 304 Not Modified\n",
      "File ‘cranfield.tar.gz’ not modified on server. Omitting download.\n",
      "\n"
     ]
    }
   ],
   "source": [
    "!wget -N https://meta-toolkit.org/data/2016-11-10/cranfield.tar.gz\n",
    "!tar xf cranfield.tar.gz"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "We're going to add a flag to our corpus' configuration file to force it to store full text for later."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 123,
   "metadata": {
    "collapsed": true,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "with open('cranfield/tutorial.toml', 'w') as f:\n",
    "    f.write('type = \"line-corpus\"\\n')\n",
    "    f.write('store-full-text = true\\n')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "Now, let's set up a MeTA configuration file up to index the `cranfield` dataset we just downloaded using the default unigram words filter chain."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 124,
   "metadata": {
    "collapsed": true,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "config = \"\"\"prefix = \".\" # tells MeTA where to search for datasets\n",
    "\n",
    "dataset = \"cranfield\" # a subfolder under the prefix directory\n",
    "corpus = \"tutorial.toml\" # a configuration file for the corpus specifying its format & additional args\n",
    "\n",
    "index = \"cranfield-idx\" # subfolder of the current working directory to place index files\n",
    "\n",
    "query-judgements = \"cranfield/cranfield-qrels.txt\" # file containing the relevance judgments for this dataset\n",
    "\n",
    "stop-words = \"lemur-stopwords.txt\"\n",
    "\n",
    "[[analyzers]]\n",
    "method = \"ngram-word\"\n",
    "ngram = 1\n",
    "filter = \"default-unigram-chain\"\n",
    "\"\"\"\n",
    "with open('cranfield-config.toml', 'w') as f:\n",
    "    f.write(config)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "Let's index our data using the `InvertedIndex` format. In a search engine, we want to quickly determine what documents mention a specific query term, so the `InvertedIndex` stores a mapping from term to a list of documents that contain that term (along with how many times they do)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 125,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "1502921982: [info]     Loading index from disk: cranfield-idx/inv (/tmp/pip-bneszy3v-build/deps/meta/src/index/inverted_index.cpp:171)\n",
      "1502921982: [info]     Loading index from disk: cranfield-idx/inv (/tmp/pip-bneszy3v-build/deps/meta/src/index/inverted_index.cpp:171)\n"
     ]
    }
   ],
   "source": [
    "inv_idx = metapy.index.make_inverted_index('cranfield-config.toml')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "This may take a minute at first, since the index needs to be built. Subsequent calls to `make_inverted_index` with this config file will simply load the index, which will not take any time.\n",
    "\n",
    "Here's how we can interact with the index object:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 126,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1400"
      ]
     },
     "execution_count": 126,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "inv_idx.num_docs()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 127,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "4137"
      ]
     },
     "execution_count": 127,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "inv_idx.unique_terms()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 128,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "87.17857360839844"
      ]
     },
     "execution_count": 128,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "inv_idx.avg_doc_length()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 129,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "122050"
      ]
     },
     "execution_count": 129,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "inv_idx.total_corpus_terms()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "Let's search our index. We'll start by creating a ranker:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 130,
   "metadata": {
    "collapsed": true,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "ranker = metapy.index.OkapiBM25()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "Now we need a query. Let's create an example query."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 131,
   "metadata": {
    "collapsed": true,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "query = metapy.index.Document()\n",
    "query.content(\"flow equilibrium\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "Now we can use this to search our index like so:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 132,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[(235, 6.424363136291504),\n",
       " (1009, 6.096038818359375),\n",
       " (1229, 5.877272129058838),\n",
       " (1251, 5.866937160491943),\n",
       " (316, 5.859640121459961)]"
      ]
     },
     "execution_count": 132,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "top_docs = ranker.score(inv_idx, query, num_results=5)\n",
    "top_docs"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "We are returned a ranked list of *(doc_id, score)* pairs. The scores are from the ranker, which in this case was Okapi BM25. Since the `tutorial.toml` file we created for the cranfield dataset has `store-full-text = true`, we can verify the content of our top documents by inspecting the document metadata field \"content\"."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 133,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1. criteria for thermodynamic equilibrium in gas flow . when gases flow at high velocity, the rates of internal processes may not be fast enough to maintain thermodynamic equilibrium .  by defining quasi-equilibrium in flow as the condition in which the...\n",
      "\n",
      "2. free-flight measurements of the static and dynamic . air-flow properties in nozzles were calculated and charted for equilibrium flow and two types of frozen flows .  in one type of frozen flow, air was assumed to be in equilibrium from the nozzle res...\n",
      "\n",
      "3. hypersonic nozzle expansion of air with atom recombination present . an experimental investigation on the expansion of high- temperature, high-pressure air to hypersonic flow mach numbers in a conical nozzle of a hypersonic shock tunnel has been carr...\n",
      "\n",
      "4. on the approach to chemical and vibrational equilibrium behind a strong normal shock wave . the concurrent approach to chemical and vibrational equilibrium of a pure diatomic gas passing through a strong normal shock wave is investigated .  it is dem...\n",
      "\n",
      "5. non-equilibrium flow of an ideal dissociating gas . the theory of an'ideal dissociating'gas developed by lighthill/1957/for conditions of thermodynamic equilibrium is extended to non-equilibrium conditions by postulating a simple rate equation for th...\n",
      "\n"
     ]
    }
   ],
   "source": [
    "for num, (d_id, _) in enumerate(top_docs):\n",
    "    content = inv_idx.metadata(d_id).get('content')\n",
    "    print(\"{}. {}...\\n\".format(num + 1, content[0:250]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "Since we have the queries file and relevance judgements, we can do an IR evaluation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 134,
   "metadata": {
    "collapsed": true,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "ev = metapy.index.IREval('cranfield-config.toml')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "We will loop over the queries file and add each result to the `IREval` object `ev`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 135,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Query 1 average precision: 0.24166666666666664\n",
      "Query 2 average precision: 0.4196428571428571\n",
      "Query 3 average precision: 0.6383928571428572\n",
      "Query 4 average precision: 0.25\n",
      "Query 5 average precision: 0.3333333333333333\n",
      "Query 6 average precision: 0.125\n",
      "Query 7 average precision: 0.11666666666666665\n",
      "Query 8 average precision: 0.1\n",
      "Query 9 average precision: 0.6388888888888888\n",
      "Query 10 average precision: 0.0625\n",
      "Query 11 average precision: 0.09285714285714286\n",
      "Query 12 average precision: 0.18\n",
      "Query 13 average precision: 0.0\n",
      "Query 14 average precision: 0.5\n",
      "Query 15 average precision: 1.0\n",
      "Query 16 average precision: 0.16666666666666666\n",
      "Query 17 average precision: 0.08333333333333333\n",
      "Query 18 average precision: 0.3333333333333333\n",
      "Query 19 average precision: 0.0\n",
      "Query 20 average precision: 0.4302469135802469\n",
      "Query 21 average precision: 0.0\n",
      "Query 22 average precision: 0.0\n",
      "Query 23 average precision: 0.19952380952380952\n",
      "Query 24 average precision: 0.3333333333333333\n",
      "Query 25 average precision: 0.6507936507936507\n",
      "Query 26 average precision: 0.19444444444444442\n",
      "Query 27 average precision: 0.12962962962962962\n",
      "Query 28 average precision: 0.0\n",
      "Query 29 average precision: 0.35\n",
      "Query 30 average precision: 0.023809523809523808\n",
      "Query 31 average precision: 0.0\n",
      "Query 32 average precision: 0.1111111111111111\n",
      "Query 33 average precision: 0.6388888888888888\n",
      "Query 34 average precision: 0.1111111111111111\n",
      "Query 35 average precision: 0.0\n",
      "Query 36 average precision: 0.5\n",
      "Query 37 average precision: 0.0\n",
      "Query 38 average precision: 0.0\n",
      "Query 39 average precision: 0.1\n",
      "Query 40 average precision: 0.045\n",
      "Query 41 average precision: 0.6666666666666666\n",
      "Query 42 average precision: 0.16714285714285712\n",
      "Query 43 average precision: 0.4583333333333333\n",
      "Query 44 average precision: 0.0\n",
      "Query 45 average precision: 0.1\n",
      "Query 46 average precision: 0.4058333333333334\n",
      "Query 47 average precision: 0.27341269841269844\n",
      "Query 48 average precision: 0.17666666666666667\n",
      "Query 49 average precision: 0.1\n",
      "Query 50 average precision: 0.05555555555555555\n",
      "Query 51 average precision: 0.4730952380952381\n",
      "Query 52 average precision: 0.47916666666666663\n",
      "Query 53 average precision: 0.15222222222222223\n",
      "Query 54 average precision: 0.05555555555555555\n",
      "Query 55 average precision: 0.19444444444444445\n",
      "Query 56 average precision: 0.1\n",
      "Query 57 average precision: 0.03333333333333333\n",
      "Query 58 average precision: 0.0380952380952381\n",
      "Query 59 average precision: 0.027777777777777776\n",
      "Query 60 average precision: 0.42000000000000004\n",
      "Query 61 average precision: 0.5638888888888889\n",
      "Query 62 average precision: 0.0\n",
      "Query 63 average precision: 0.0\n",
      "Query 64 average precision: 0.5\n",
      "Query 65 average precision: 0.24\n",
      "Query 66 average precision: 0.02857142857142857\n",
      "Query 67 average precision: 0.575\n",
      "Query 68 average precision: 0.04\n",
      "Query 69 average precision: 0.02857142857142857\n",
      "Query 70 average precision: 0.05\n",
      "Query 71 average precision: 0.017857142857142856\n",
      "Query 72 average precision: 0.12\n",
      "Query 73 average precision: 0.4680952380952381\n",
      "Query 74 average precision: 0.020833333333333332\n",
      "Query 75 average precision: 0.0\n",
      "Query 76 average precision: 0.07142857142857142\n",
      "Query 77 average precision: 0.2333333333333333\n",
      "Query 78 average precision: 0.7222222222222222\n",
      "Query 79 average precision: 0.0\n",
      "Query 80 average precision: 0.0\n",
      "Query 81 average precision: 0.75\n",
      "Query 82 average precision: 0.24\n",
      "Query 83 average precision: 0.0625\n",
      "Query 84 average precision: 0.3\n",
      "Query 85 average precision: 0.25\n",
      "Query 86 average precision: 0.5833333333333333\n",
      "Query 87 average precision: 0.0\n",
      "Query 88 average precision: 0.6496031746031746\n",
      "Query 89 average precision: 0.05555555555555555\n",
      "Query 90 average precision: 0.15607142857142856\n",
      "Query 91 average precision: 0.2577160493827161\n",
      "Query 92 average precision: 0.5014285714285714\n",
      "Query 93 average precision: 0.5\n",
      "Query 94 average precision: 0.5264285714285715\n",
      "Query 95 average precision: 0.5\n",
      "Query 96 average precision: 0.38976190476190475\n",
      "Query 97 average precision: 0.15416666666666665\n",
      "Query 98 average precision: 0.0\n",
      "Query 99 average precision: 0.18333333333333335\n",
      "Query 100 average precision: 0.16666666666666663\n",
      "Query 101 average precision: 0.6958333333333333\n",
      "Query 102 average precision: 0.3214285714285714\n",
      "Query 103 average precision: 0.0\n",
      "Query 104 average precision: 0.06666666666666667\n",
      "Query 105 average precision: 0.3833333333333333\n",
      "Query 106 average precision: 0.38571428571428573\n",
      "Query 107 average precision: 0.17261904761904762\n",
      "Query 108 average precision: 0.5901360544217686\n",
      "Query 109 average precision: 0.0\n",
      "Query 110 average precision: 0.125\n",
      "Query 111 average precision: 0.08333333333333333\n",
      "Query 112 average precision: 0.25\n",
      "Query 113 average precision: 0.08333333333333333\n",
      "Query 114 average precision: 0.0\n",
      "Query 115 average precision: 0.05\n",
      "Query 116 average precision: 0.05\n",
      "Query 117 average precision: 0.0\n",
      "Query 118 average precision: 0.21666666666666667\n",
      "Query 119 average precision: 1.0\n",
      "Query 120 average precision: 0.39589947089947086\n",
      "Query 121 average precision: 0.369047619047619\n",
      "Query 122 average precision: 0.21164021164021163\n",
      "Query 123 average precision: 0.0\n",
      "Query 124 average precision: 0.0\n",
      "Query 125 average precision: 0.2095238095238095\n",
      "Query 126 average precision: 0.20833333333333331\n",
      "Query 127 average precision: 0.05\n",
      "Query 128 average precision: 0.0\n",
      "Query 129 average precision: 0.369047619047619\n",
      "Query 130 average precision: 0.5\n",
      "Query 131 average precision: 0.10238095238095238\n",
      "Query 132 average precision: 0.48476190476190484\n",
      "Query 133 average precision: 0.05215419501133787\n",
      "Query 134 average precision: 0.25\n",
      "Query 135 average precision: 0.3839285714285714\n",
      "Query 136 average precision: 0.3333333333333333\n",
      "Query 137 average precision: 0.225\n",
      "Query 138 average precision: 0.1\n",
      "Query 139 average precision: 0.0\n",
      "Query 140 average precision: 0.13888888888888887\n",
      "Query 141 average precision: 0.075\n",
      "Query 142 average precision: 0.0\n",
      "Query 143 average precision: 0.7\n",
      "Query 144 average precision: 0.28439153439153436\n",
      "Query 145 average precision: 0.21995464852607707\n",
      "Query 146 average precision: 0.5833333333333333\n",
      "Query 147 average precision: 0.22666666666666666\n",
      "Query 148 average precision: 0.16666666666666666\n",
      "Query 149 average precision: 0.24861111111111106\n",
      "Query 150 average precision: 0.8333333333333333\n",
      "Query 151 average precision: 0.0\n",
      "Query 152 average precision: 0.0\n",
      "Query 153 average precision: 0.2738095238095238\n",
      "Query 154 average precision: 0.8333333333333333\n",
      "Query 155 average precision: 0.125\n",
      "Query 156 average precision: 0.5607142857142857\n",
      "Query 157 average precision: 0.29861111111111105\n",
      "Query 158 average precision: 0.3625\n",
      "Query 159 average precision: 0.043402777777777776\n",
      "Query 160 average precision: 0.1\n",
      "Query 161 average precision: 0.5\n",
      "Query 162 average precision: 0.10416666666666666\n",
      "Query 163 average precision: 0.24444444444444446\n",
      "Query 164 average precision: 0.31805555555555554\n",
      "Query 165 average precision: 0.5833333333333333\n",
      "Query 166 average precision: 0.013888888888888888\n",
      "Query 167 average precision: 0.5\n",
      "Query 168 average precision: 0.08333333333333333\n",
      "Query 169 average precision: 0.25\n",
      "Query 170 average precision: 0.5694444444444444\n",
      "Query 171 average precision: 0.6388888888888888\n",
      "Query 172 average precision: 0.6791666666666667\n",
      "Query 173 average precision: 1.0\n",
      "Query 174 average precision: 0.03333333333333333\n",
      "Query 175 average precision: 0.02\n",
      "Query 176 average precision: 0.0\n",
      "Query 177 average precision: 0.5888888888888889\n",
      "Query 178 average precision: 0.3333333333333333\n",
      "Query 179 average precision: 0.29166666666666663\n",
      "Query 180 average precision: 0.33095238095238094\n",
      "Query 181 average precision: 0.2\n",
      "Query 182 average precision: 0.5833333333333333\n",
      "Query 183 average precision: 0.5580952380952381\n",
      "Query 184 average precision: 0.21428571428571427\n",
      "Query 185 average precision: 0.6388888888888888\n",
      "Query 186 average precision: 0.16619047619047617\n",
      "Query 187 average precision: 0.13888888888888887\n",
      "Query 188 average precision: 0.3196428571428571\n",
      "Query 189 average precision: 0.05952380952380952\n",
      "Query 190 average precision: 0.3\n",
      "Query 191 average precision: 0.05333333333333333\n",
      "Query 192 average precision: 0.5666666666666667\n",
      "Query 193 average precision: 0.8282627865961198\n",
      "Query 194 average precision: 0.041666666666666664\n",
      "Query 195 average precision: 0.05555555555555555\n",
      "Query 196 average precision: 0.11666666666666665\n",
      "Query 197 average precision: 0.5555555555555555\n",
      "Query 198 average precision: 0.44375\n",
      "Query 199 average precision: 0.025\n",
      "Query 200 average precision: 0.1851851851851852\n",
      "Query 201 average precision: 0.3583333333333333\n",
      "Query 202 average precision: 0.19166666666666665\n",
      "Query 203 average precision: 0.1595238095238095\n",
      "Query 204 average precision: 0.0\n",
      "Query 205 average precision: 0.75\n",
      "Query 206 average precision: 0.2619047619047619\n",
      "Query 207 average precision: 0.14444444444444443\n",
      "Query 208 average precision: 0.6916666666666668\n",
      "Query 209 average precision: 0.05555555555555556\n",
      "Query 210 average precision: 0.27777777777777773\n",
      "Query 211 average precision: 0.0125\n",
      "Query 212 average precision: 0.4891666666666666\n",
      "Query 213 average precision: 0.4625\n",
      "Query 214 average precision: 0.125\n",
      "Query 215 average precision: 0.0\n",
      "Query 216 average precision: 0.0\n",
      "Query 217 average precision: 0.2\n",
      "Query 218 average precision: 0.24285714285714283\n",
      "Query 219 average precision: 0.0\n",
      "Query 220 average precision: 0.13333333333333333\n",
      "Query 221 average precision: 0.275\n",
      "Query 222 average precision: 0.5978835978835979\n",
      "Query 223 average precision: 0.44166666666666665\n",
      "Query 224 average precision: 0.04285714285714286\n",
      "Query 225 average precision: 0.1775\n"
     ]
    }
   ],
   "source": [
    "num_results = 10\n",
    "with open('cranfield/cranfield-queries.txt') as query_file:\n",
    "    for query_num, line in enumerate(query_file):\n",
    "        query.content(line.strip())\n",
    "        results = ranker.score(inv_idx, query, num_results)                            \n",
    "        avg_p = ev.avg_p(results, query_num + 1, num_results)\n",
    "        print(\"Query {} average precision: {}\".format(query_num + 1, avg_p))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "Afterwards, we can get the mean average precision of all the queries."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 136,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.25511867318944054"
      ]
     },
     "execution_count": 136,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ev.map()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "In the competition, you should try experimenting with different rankers, ranker parameters, tokenization, and filters. What combination can give you the best results?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "Lastly, it's possible to define your own ranking function in Python."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 137,
   "metadata": {
    "collapsed": true,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "class SimpleRanker(metapy.index.RankingFunction):                                            \n",
    "    \"\"\"                                                                          \n",
    "    Create a new ranking function in Python that can be used in MeTA.             \n",
    "    \"\"\"                                                                          \n",
    "    def __init__(self, some_param=1.0):                                             \n",
    "        self.param = some_param\n",
    "        # You *must* invoke the base class __init__() here!\n",
    "        super(SimpleRanker, self).__init__()                                        \n",
    "                                                                                 \n",
    "    def score_one(self, sd):\n",
    "        \"\"\"\n",
    "        You need to override this function to return a score for a single term.\n",
    "        For fields available in the score_data sd object,\n",
    "        @see https://meta-toolkit.org/doxygen/structmeta_1_1index_1_1score__data.html\n",
    "        \"\"\"\n",
    "        return (self.param + sd.doc_term_count) / (self.param * sd.doc_unique_terms + sd.doc_size)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "**COMPETITION TIME**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "# Part 3: Document Classification with MeTA"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "In this part of the tutorial, we'll play with the next major application for MeTA: creating classifiers. We will be having the second contest in this part! Once we finish going through how to create a forward index, train classifiers on top of it, and perform classifier evaluation and cross validation, I will give you instructions on how to participate in the competition (it will be similar to the first competition). Again, there will be another leader board to keep track of the best submissions, and I intend on leaving it running until the end of the conference for people to play around with."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "Let's switch back to using the `ceeaus` dataset we downloaded before. If you're just joining us, grab it now:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 138,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "--2017-08-16 19:19:42--  https://meta-toolkit.org/data/2016-01-26/ceeaus.tar.gz\n",
      "Loaded CA certificate '/etc/ssl/certs/ca-certificates.crt'\n",
      "Resolving meta-toolkit.org... 50.116.41.177, 2600:3c02::f03c:91ff:feae:b777\n",
      "Connecting to meta-toolkit.org|50.116.41.177|:443... connected.\n",
      "HTTP request sent, awaiting response... 304 Not Modified\n",
      "File ‘ceeaus.tar.gz’ not modified on server. Omitting download.\n",
      "\n"
     ]
    }
   ],
   "source": [
    "!wget -N https://meta-toolkit.org/data/2016-01-26/ceeaus.tar.gz\n",
    "!tar xf ceeaus.tar.gz"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "We'll also need our standard stopword list. Grab it now if you don't already have it:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 139,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "--2017-08-16 19:19:43--  https://raw.githubusercontent.com/meta-toolkit/meta/master/data/lemur-stopwords.txt\n",
      "Loaded CA certificate '/etc/ssl/certs/ca-certificates.crt'\n",
      "Resolving raw.githubusercontent.com... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...\n",
      "Connecting to raw.githubusercontent.com|151.101.0.133|:443... connected.\n",
      "HTTP request sent, awaiting response... 200 OK\n",
      "Length: 2747 (2.7K) [text/plain]\n",
      "Saving to: ‘lemur-stopwords.txt’\n",
      "\n",
      "lemur-stopwords.txt 100%[===================>]   2.68K  --.-KB/s    in 0s      \n",
      "\n",
      "Last-modified header missing -- time-stamps turned off.\n",
      "2017-08-16 19:19:43 (63.8 MB/s) - ‘lemur-stopwords.txt’ saved [2747/2747]\n",
      "\n"
     ]
    }
   ],
   "source": [
    "!wget -N https://raw.githubusercontent.com/meta-toolkit/meta/master/data/lemur-stopwords.txt"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "Let's create our MeTA configuration file for this part of the tutorial. We'll be using standard unigram words for now, but you're strongly encouraged to play with different features for the competition!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 140,
   "metadata": {
    "collapsed": true,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "config = \"\"\"prefix = \".\"\n",
    "dataset = \"ceeaus\"\n",
    "corpus = \"line.toml\"\n",
    "index = \"ceeaus-idx\"\n",
    "stop-words = \"lemur-stopwords.txt\"\n",
    "\n",
    "[[analyzers]]\n",
    "method = \"ngram-word\"\n",
    "ngram = 1\n",
    "filter = \"default-unigram-chain\"\n",
    "\"\"\"\n",
    "with open('ceeaus-config.toml', 'w') as f:\n",
    "    f.write(config)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "Now, let's index this dataset. Since we are doing classification experiments, we will most likely be concerning ourselves with a `ForwardIndex`, since we want to map document ids to their feature vector representations."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 141,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "1502921983: [info]     Loading index from disk: ceeaus-idx/fwd (/tmp/pip-bneszy3v-build/deps/meta/src/index/forward_index.cpp:171)\n",
      "1502921983: [info]     Loading index from disk: ceeaus-idx/fwd (/tmp/pip-bneszy3v-build/deps/meta/src/index/forward_index.cpp:171)\n"
     ]
    }
   ],
   "source": [
    "fidx = metapy.index.make_forward_index('ceeaus-config.toml')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "Note that the feature set used for classification depends on your settings in the configuration file _at the time of indexing_. If you want to play with different feature sets, remember to change your `analyzer` pipeline in the configuration file, and also to **reindex** your documents!\n",
    "\n",
    "Here, we've just chosen simple unigram words. This is actually a surprisingly good baseline feature set for many text classification problems.\n",
    "\n",
    "Now that we have a `ForwardIndex` on disk, we need to load the documents we want to start playing with into memory. Since this is a small enough dataset, let's load the whole thing into memory at once.\n",
    "\n",
    "We need to decide what kind of dataset we're using. MeTA has classes for binary classification (`BinaryDataset`) and multi-class classification (`MulticlassDataset`), which you should choose from depending on the kind of classification problem you're dealing with. Let's see how many labels we have in our corpus."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 142,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "3"
      ]
     },
     "execution_count": 142,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "fidx.num_labels()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "Since this is more than 2, we likely want a `MulticlassDataset` so we can learn a classifier that can predict which of these three labels a document should have. (But we might be interested in only determining one particular class from the rest, in which case we might actually want a `BinaryDataset`.)\n",
    "\n",
    "For now, let's focus on the multi-class case, as that likely makes the most sense for this kind of data. Let's load or documents."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 143,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      " \r",
      " > Loading instances into memory: [>                         ]   0% ETA 00:00:00 \r",
      " > Loading instances into memory: [>                         ]   0% ETA 00:00:00 \r",
      " > Loading instances into memory: [==========================] 100% ETA 00:00:00 \r",
      " > Loading instances into memory: [==========================] 100% ETA 00:00:00 \n",
      " \n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "1008"
      ]
     },
     "execution_count": 143,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dset = metapy.classify.MulticlassDataset(fidx)\n",
    "len(dset)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "We have 1008 documents, split across three labels. What are our labels?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 144,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'chinese', 'english', 'japanese'}"
      ]
     },
     "execution_count": 144,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "set([dset.label(instance) for instance in dset])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "This dataset is a small collection of essays written by a bunch of students with different first languages. Our goal will be to try to identify whether an essay was written by a native-Chinese speaker, a native-English speaker, or a native-Japanese speaker.\n",
    "\n",
    "Now, because these in-memory datasets can potentially be quite large, it's beneficial to not make unnecessary copies of them to, for example, create a new list that's shuffled that contains the same documents. In most cases, you'll be operating with a `DatasetView` (either `MulticlassDatasetView` or `BinaryDatasetView`) so that you can do things like shuffle or rotate the contents of a dataset without having to actually modify it. Doing so is pretty easy: you can use Python's slicing API, or you can just construct one directly."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 145,
   "metadata": {
    "collapsed": true,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "view = dset[0:len(dset)+1]\n",
    "# or\n",
    "view = metapy.classify.MulticlassDatasetView(dset)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "Now we can, for example, shuffle this view without changing the underlying datsaet."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 146,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "869 vs 0\n"
     ]
    }
   ],
   "source": [
    "view.shuffle()\n",
    "print(\"{} vs {}\".format(view[0].id, dset[0].id))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "The view has been shuffled and now has documents in random order (useful in many cases to make sure that you don't have clumps of the same-labeled documents together, or to just permute the documents in a stochastic learning algorithm), but the underlying dataset is still sorted by id.\n",
    "\n",
    "We can also use this slicing API to create a random training and testing set from our shuffled views (views also support slicing). Let's make a 75-25 split of training-testing data. (Note that's really important that we already shuffled the view!)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 147,
   "metadata": {
    "collapsed": true,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "training = view[0:int(0.75*len(view))]\n",
    "testing = view[int(0.75*len(view)):]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "Now, we're ready to train a classifier! Let's start with very simple one: [Naive Bayes](https://en.wikipedia.org/wiki/Naive_Bayes_classifier).\n",
    "\n",
    "In MeTA, construction of a classifier implies training of that model. Let's train a Naive Bayes classifier on our training view now."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 148,
   "metadata": {
    "collapsed": true,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "nb = metapy.classify.NaiveBayes(training)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "We can now classify individual documents like so."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 149,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'japanese'"
      ]
     },
     "execution_count": 149,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "nb.classify(testing[0].weights)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "We might be more interested in how well we classify the testing set."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 150,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "            chinese   english   japanese  \n",
      "          ------------------------------\n",
      "  chinese | \u001b[1m0.96\u001b[22m      -         0.04      \n",
      "  english | -         \u001b[1m0.909\u001b[22m     0.0909    \n",
      " japanese | 0.0155    0.0155    \u001b[1m0.969\u001b[22m     \n",
      "\n",
      "\n"
     ]
    }
   ],
   "source": [
    "mtrx = nb.test(testing)\n",
    "print(mtrx)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "The `test()` method of MeTA's classifiers returns to you a `ConfusionMatrix`, which contains useful information about what kinds of mistakes your classifier is making.\n",
    "\n",
    "(Note that, due to the random shuffling, you might see different results than we do here.)\n",
    "\n",
    "For example, we can see that this classifier seems to have some trouble with confusing native-Chinese students' essays with those of native-Japanese students. We can tell that by looking at the rows of the confusion matrix. Each row tells you what fraction of documents with that _true_ label were assigned the label for each column by the classifier. In the case of the native-Chinese label, we can see that 25% of the time they were miscategorized as being native-Japanese.\n",
    "\n",
    "The `ConfusionMatrix` also computes a lot of metrics that are commonly used in classifier evaluation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 151,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "------------------------------------------------------------\n",
      "\u001b[1mClass\u001b[22m       \u001b[1mF1 Score\u001b[22m    \u001b[1mPrecision\u001b[22m   \u001b[1mRecall\u001b[22m      \u001b[1mClass Dist\u001b[22m  \n",
      "------------------------------------------------------------\n",
      "chinese     0.923       0.889       0.96        0.0992      \n",
      "english     0.909       0.909       0.909       0.131       \n",
      "japanese    0.974       0.979       0.969       0.77        \n",
      "------------------------------------------------------------\n",
      "\u001b[1mTotal\u001b[22m       \u001b[1m0.961\u001b[22m       \u001b[1m0.961\u001b[22m       \u001b[1m0.96\u001b[22m        \n",
      "------------------------------------------------------------\n",
      "252 predictions attempted, overall accuracy: 0.96\n",
      "\n"
     ]
    }
   ],
   "source": [
    "mtrx.print_stats()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "If we want to make sure that the classifier isn't overfitting to our training data, a common approach is to do [cross-validation](https://en.wikipedia.org/wiki/Cross-validation_(statistics)). Let's run CV for our Naive Bayes classifier across the whole dataset, using 5-folds, to get an idea of how well we might generalize to new data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 152,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "1502921983: [info]     Cross-validating fold 1/5 (/tmp/pip-bneszy3v-build/deps/meta/include/meta/classify/classifier/classifier.h:103)\n",
      "1502921983: [info]     Cross-validating fold 1/5 (/tmp/pip-bneszy3v-build/deps/meta/include/meta/classify/classifier/classifier.h:103)\n",
      "1502921983: [info]     Cross-validating fold 2/5 (/tmp/pip-bneszy3v-build/deps/meta/include/meta/classify/classifier/classifier.h:103)\n",
      "1502921983: [info]     Cross-validating fold 2/5 (/tmp/pip-bneszy3v-build/deps/meta/include/meta/classify/classifier/classifier.h:103)\n",
      "1502921983: [info]     Cross-validating fold 3/5 (/tmp/pip-bneszy3v-build/deps/meta/include/meta/classify/classifier/classifier.h:103)\n",
      "1502921983: [info]     Cross-validating fold 3/5 (/tmp/pip-bneszy3v-build/deps/meta/include/meta/classify/classifier/classifier.h:103)\n",
      "1502921983: [info]     Cross-validating fold 4/5 (/tmp/pip-bneszy3v-build/deps/meta/include/meta/classify/classifier/classifier.h:103)\n",
      "1502921983: [info]     Cross-validating fold 4/5 (/tmp/pip-bneszy3v-build/deps/meta/include/meta/classify/classifier/classifier.h:103)\n",
      "1502921983: [info]     Cross-validating fold 5/5 (/tmp/pip-bneszy3v-build/deps/meta/include/meta/classify/classifier/classifier.h:103)\n",
      "1502921983: [info]     Cross-validating fold 5/5 (/tmp/pip-bneszy3v-build/deps/meta/include/meta/classify/classifier/classifier.h:103)\n"
     ]
    }
   ],
   "source": [
    "mtrx = metapy.classify.cross_validate(lambda fold: metapy.classify.NaiveBayes(fold), view, 5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "`cross_validate()` returns a `ConfusionMatrix` just like `test()` does. We give it a function to use to create the trained classifiers for each fold, and then pass in the dataset view containing all of our documents, and the number of folds we want to use.\n",
    "\n",
    "Let's see how we did."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 153,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "            chinese   english   japanese  \n",
      "          ------------------------------\n",
      "  chinese | \u001b[1m0.868\u001b[22m     0.011     0.121     \n",
      "  english | 0.0342    \u001b[1m0.918\u001b[22m     0.0479    \n",
      " japanese | 0.0195    0.00911   \u001b[1m0.971\u001b[22m     \n",
      "\n",
      "\n",
      "------------------------------------------------------------\n",
      "\u001b[1mClass\u001b[22m       \u001b[1mF1 Score\u001b[22m    \u001b[1mPrecision\u001b[22m   \u001b[1mRecall\u001b[22m      \u001b[1mClass Dist\u001b[22m  \n",
      "------------------------------------------------------------\n",
      "chinese     0.832       0.798       0.868       0.0905      \n",
      "english     0.931       0.944       0.918       0.145       \n",
      "japanese    0.974       0.976       0.971       0.764       \n",
      "------------------------------------------------------------\n",
      "\u001b[1mTotal\u001b[22m       \u001b[1m0.955\u001b[22m       \u001b[1m0.956\u001b[22m       \u001b[1m0.954\u001b[22m       \n",
      "------------------------------------------------------------\n",
      "1005 predictions attempted, overall accuracy: 0.954\n",
      "\n"
     ]
    }
   ],
   "source": [
    "print(mtrx)\n",
    "mtrx.print_stats()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "Now let's do the same thing, but for an arguably stronger baseline: [SVM](https://en.wikipedia.org/wiki/Support_vector_machine).\n",
    "\n",
    "MeTA's implementation of SVM is actually an approximation using [stochastic gradient descent](https://en.wikipedia.org/wiki/Stochastic_gradient_descent) on the [hinge loss](https://en.wikipedia.org/wiki/Hinge_loss). It's implemented as a `BinaryClassifier`, so we will need to adapt it before it can be used to solve our multi-class clasification problem.\n",
    "\n",
    "MeTA provides two different adapters for this scenario: [One-vs-All](https://en.wikipedia.org/wiki/Multiclass_classification#One-vs.-rest) and [One-vs-One](https://en.wikipedia.org/wiki/Multiclass_classification#One-vs.-one)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 154,
   "metadata": {
    "collapsed": true,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "ova = metapy.classify.OneVsAll(training, metapy.classify.SGD, loss_id='hinge')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "We construct the `OneVsAll` reduction by providing it the training documents, the name of a binary classifier, and then (as keyword arguments) any additional arguments to that chosen classifier. In this case, we use `loss_id` to specify the loss function to use.\n",
    "\n",
    "We can now use `OneVsAll` just like any other classifier."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 155,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "            chinese   english   japanese  \n",
      "          ------------------------------\n",
      "  chinese | \u001b[1m0.72\u001b[22m      -         0.28      \n",
      "  english | -         \u001b[1m0.909\u001b[22m     0.0909    \n",
      " japanese | -         0.0103    \u001b[1m0.99\u001b[22m      \n",
      "\n",
      "\n",
      "------------------------------------------------------------\n",
      "\u001b[1mClass\u001b[22m       \u001b[1mF1 Score\u001b[22m    \u001b[1mPrecision\u001b[22m   \u001b[1mRecall\u001b[22m      \u001b[1mClass Dist\u001b[22m  \n",
      "------------------------------------------------------------\n",
      "chinese     0.837       1           0.72        0.0992      \n",
      "english     0.923       0.938       0.909       0.131       \n",
      "japanese    0.97        0.95        0.99        0.77        \n",
      "------------------------------------------------------------\n",
      "\u001b[1mTotal\u001b[22m       \u001b[1m0.953\u001b[22m       \u001b[1m0.954\u001b[22m       \u001b[1m0.952\u001b[22m       \n",
      "------------------------------------------------------------\n",
      "252 predictions attempted, overall accuracy: 0.952\n",
      "\n"
     ]
    }
   ],
   "source": [
    "mtrx = ova.test(testing)\n",
    "print(mtrx)\n",
    "mtrx.print_stats()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 156,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "            chinese   english   japanese  \n",
      "          ------------------------------\n",
      "  chinese | \u001b[1m0.835\u001b[22m     0.022     0.143     \n",
      "  english | -         \u001b[1m0.911\u001b[22m     0.089     \n",
      " japanese | 0.00391   0.00651   \u001b[1m0.99\u001b[22m      \n",
      "\n",
      "\n",
      "------------------------------------------------------------\n",
      "\u001b[1mClass\u001b[22m       \u001b[1mF1 Score\u001b[22m    \u001b[1mPrecision\u001b[22m   \u001b[1mRecall\u001b[22m      \u001b[1mClass Dist\u001b[22m  \n",
      "------------------------------------------------------------\n",
      "chinese     0.894       0.962       0.835       0.0905      \n",
      "english     0.93        0.95        0.911       0.145       \n",
      "japanese    0.978       0.967       0.99        0.764       \n",
      "------------------------------------------------------------\n",
      "\u001b[1mTotal\u001b[22m       \u001b[1m0.964\u001b[22m       \u001b[1m0.964\u001b[22m       \u001b[1m0.964\u001b[22m       \n",
      "------------------------------------------------------------\n",
      "1005 predictions attempted, overall accuracy: 0.964\n",
      "\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "1502921983: [info]     Cross-validating fold 1/5 (/tmp/pip-bneszy3v-build/deps/meta/include/meta/classify/classifier/classifier.h:103)\n",
      "1502921983: [info]     Cross-validating fold 1/5 (/tmp/pip-bneszy3v-build/deps/meta/include/meta/classify/classifier/classifier.h:103)\n",
      "1502921983: [info]     Cross-validating fold 2/5 (/tmp/pip-bneszy3v-build/deps/meta/include/meta/classify/classifier/classifier.h:103)\n",
      "1502921983: [info]     Cross-validating fold 2/5 (/tmp/pip-bneszy3v-build/deps/meta/include/meta/classify/classifier/classifier.h:103)\n",
      "1502921983: [info]     Cross-validating fold 3/5 (/tmp/pip-bneszy3v-build/deps/meta/include/meta/classify/classifier/classifier.h:103)\n",
      "1502921983: [info]     Cross-validating fold 3/5 (/tmp/pip-bneszy3v-build/deps/meta/include/meta/classify/classifier/classifier.h:103)\n",
      "1502921983: [info]     Cross-validating fold 4/5 (/tmp/pip-bneszy3v-build/deps/meta/include/meta/classify/classifier/classifier.h:103)\n",
      "1502921983: [info]     Cross-validating fold 4/5 (/tmp/pip-bneszy3v-build/deps/meta/include/meta/classify/classifier/classifier.h:103)\n",
      "1502921983: [info]     Cross-validating fold 5/5 (/tmp/pip-bneszy3v-build/deps/meta/include/meta/classify/classifier/classifier.h:103)\n",
      "1502921983: [info]     Cross-validating fold 5/5 (/tmp/pip-bneszy3v-build/deps/meta/include/meta/classify/classifier/classifier.h:103)\n"
     ]
    }
   ],
   "source": [
    "mtrx = metapy.classify.cross_validate(lambda fold: metapy.classify.OneVsAll(fold, metapy.classify.SGD, loss_id='hinge'), view, 5)\n",
    "print(mtrx)\n",
    "mtrx.print_stats()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "That should be enough to get you started! Try looking at `help(metapy.classify)` for a list of what's included in the bindings."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true,
    "deletable": true,
    "editable": true
   },
   "source": [
    "**COMPETITION TIME**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "# Part 4: Topic Modeling"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "In this part of the tutorial we will discuss how to run a topic model over data indexed as a `ForwardIndex`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "We will need to index our data to proceed. We eventually want to be able to extract the bag-of-words representation for our individual documents, so we will want a `ForwardIndex` in this case."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 157,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "1502921983: [info]     Loading index from disk: ceeaus-idx/fwd (/tmp/pip-bneszy3v-build/deps/meta/src/index/forward_index.cpp:171)\n",
      "1502921983: [info]     Loading index from disk: ceeaus-idx/fwd (/tmp/pip-bneszy3v-build/deps/meta/src/index/forward_index.cpp:171)\n"
     ]
    }
   ],
   "source": [
    "fidx = metapy.index.make_forward_index('ceeaus-config.toml')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "Just like in classification, the feature set used for the topic modeling will be the feature set used at the time of indexing, so if you want to play with a different set of features (like bigram words), you will need to re-index your data.\n",
    "\n",
    "For now, we've just stuck with the default filter chain for unigram words, so we're operating in the traditional bag-of-words space.\n",
    "\n",
    "Let's load our documents into memory to run the topic model inference now."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 158,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      " \r",
      " > Loading instances into memory: [>                         ]   0% ETA 00:00:00 \r",
      " > Loading instances into memory: [>                         ]   0% ETA 00:00:00 \r",
      " > Loading instances into memory: [==========================] 100% ETA 00:00:00 \r",
      " > Loading instances into memory: [==========================] 100% ETA 00:00:00 \n",
      " \n"
     ]
    }
   ],
   "source": [
    "dset = metapy.learn.Dataset(fidx)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "Now, let's try to find some topics for this dataset. To do so, we're going to use a generative model called a topic model.\n",
    "\n",
    "There are many different topic models in the literature, but the most commonly used topic model is Latent Dirichlet Allocation. Here, we propose that there are K topics (represented with a categorical distribution over words) $\\phi_k$ from which all of our documents are genereated. These K topics are modeled as being sampled from a Dirichlet distribution with parameter $\\vec{\\alpha}$. Then, to generate a document $d$, we first sample a distribution over the K topics $\\theta_d$ from another Dirichlet distribution with parameter $\\vec{\\beta}$. Then, for each word in this document, we first sample a topic identifier $z \\sim \\theta_d$ and then the word by drawing from the topic we selected ($w \\sim \\phi_z$). Refer to the [Wikipedia article on LDA](https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation) for more information.\n",
    "\n",
    "The goal of running inference for an LDA model is to infer the latent variables $\\phi_k$ and $\\theta_d$ for all of the $K$ topics and $D$ documents, respectively. MeTA provides a number of different inference algorithms for LDA, as each one entails a different set of trade-offs (inference in LDA is intractable, so all inference algorithms are approximations; different algorithms entail different approximation guarantees, running times, and required memroy consumption). For now, let's run a Variational Infernce algorithm called CVB0 to find two topics. (In practice you will likely be finding many more topics than just two, but this is a very small toy dataset.)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 159,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Initialization: [============================================] 100% ETA 00:00:00 \n",
      " \n",
      "Iteration 1 maximum change in gamma: 1.94892                                     \n",
      "Iteration 1 maximum change in gamma: 1.94892                                    \n",
      "Iteration 2 maximum change in gamma: 0.489304                                    \n",
      "Iteration 2 maximum change in gamma: 0.489304                                   \n",
      "Iteration 3 maximum change in gamma: 0.353439                                     \n",
      "Iteration 3 maximum change in gamma: 0.353439                                   \n",
      "Iteration 4 maximum change in gamma: 0.437895                                     \n",
      "Iteration 4 maximum change in gamma: 0.437895                                   \n",
      "Iteration 5 maximum change in gamma: 0.646495                                     \n",
      "Iteration 5 maximum change in gamma: 0.646495                                   \n",
      "Iteration 6 maximum change in gamma: 1.08145                                      \n",
      "Iteration 6 maximum change in gamma: 1.08145                                    \n",
      "Iteration 7 maximum change in gamma: 1.37714                                     \n",
      "Iteration 7 maximum change in gamma: 1.37714                                    \n",
      "Iteration 8 maximum change in gamma: 1.5234                                      \n",
      "Iteration 8 maximum change in gamma: 1.5234                                     \n",
      "Iteration 9 maximum change in gamma: 1.41999                                      \n",
      "Iteration 9 maximum change in gamma: 1.41999                                    \n",
      "Iteration 10 maximum change in gamma: 1.35315                                     \n",
      "Iteration 10 maximum change in gamma: 1.35315                                   \n",
      "Iteration 11 maximum change in gamma: 1.16084                                     \n",
      "Iteration 11 maximum change in gamma: 1.16084                                   \n",
      "Iteration 12 maximum change in gamma: 0.985118                                    \n",
      "Iteration 12 maximum change in gamma: 0.985118                                  \n",
      "Iteration 13 maximum change in gamma: 0.529624                                   \n",
      "Iteration 13 maximum change in gamma: 0.529624                                  \n",
      "Iteration 14 maximum change in gamma: 0.505101                                   \n",
      "Iteration 14 maximum change in gamma: 0.505101                                  \n",
      "Iteration 15 maximum change in gamma: 0.357362                                    \n",
      "Iteration 15 maximum change in gamma: 0.357362                                   \n",
      "Iteration 16 maximum change in gamma: 0.33901                                    \n",
      "Iteration 16 maximum change in gamma: 0.33901                                   \n",
      "Iteration 17 maximum change in gamma: 0.277928                                    \n",
      "Iteration 17 maximum change in gamma: 0.277928                                  \n",
      "Iteration 18 maximum change in gamma: 0.242615                                    \n",
      "Iteration 18 maximum change in gamma: 0.242615                                  \n",
      "Iteration 19 maximum change in gamma: 0.237642                                   \n",
      "Iteration 19 maximum change in gamma: 0.237642                                  \n",
      "Iteration 20 maximum change in gamma: 0.198027                                   \n",
      "Iteration 20 maximum change in gamma: 0.198027                                  \n",
      "Iteration 21 maximum change in gamma: 0.192858                                    \n",
      "Iteration 21 maximum change in gamma: 0.192858                                  \n",
      "Iteration 22 maximum change in gamma: 0.185038                                    \n",
      "Iteration 22 maximum change in gamma: 0.185038                                  \n",
      "Iteration 23 maximum change in gamma: 0.168724                                    \n",
      "Iteration 23 maximum change in gamma: 0.168724                                  \n",
      "Iteration 24 maximum change in gamma: 0.157681                                    \n",
      "Iteration 24 maximum change in gamma: 0.157681                                  \n",
      "Iteration 25 maximum change in gamma: 0.13898                                    \n",
      "Iteration 25 maximum change in gamma: 0.13898                                    \n",
      "Iteration 26 maximum change in gamma: 0.131065                                   \n",
      "Iteration 26 maximum change in gamma: 0.131065                                  \n",
      "Iteration 27 maximum change in gamma: 0.126334                                    \n",
      "Iteration 27 maximum change in gamma: 0.126334                                  \n",
      "Iteration 28 maximum change in gamma: 0.148569                                    \n",
      "Iteration 28 maximum change in gamma: 0.148569                                  \n",
      "Iteration 29 maximum change in gamma: 0.177806                                    \n",
      "Iteration 29 maximum change in gamma: 0.177806                                  \n",
      "Iteration 30 maximum change in gamma: 0.19599                                     \n",
      "Iteration 30 maximum change in gamma: 0.19599                                   \n",
      "Iteration 31 maximum change in gamma: 0.195208                                   \n",
      "Iteration 31 maximum change in gamma: 0.195208                                  \n",
      "Iteration 32 maximum change in gamma: 0.207592                                   \n",
      "Iteration 32 maximum change in gamma: 0.207592                                  \n",
      "Iteration 33 maximum change in gamma: 0.222097                                    \n",
      "Iteration 33 maximum change in gamma: 0.222097                                  \n",
      "Iteration 34 maximum change in gamma: 0.209845                                    \n",
      "Iteration 34 maximum change in gamma: 0.209845                                   \n",
      "Iteration 35 maximum change in gamma: 0.211747                                   \n",
      "Iteration 35 maximum change in gamma: 0.211747                                  \n",
      "Iteration 36 maximum change in gamma: 0.185753                                    \n",
      "Iteration 36 maximum change in gamma: 0.185753                                  \n",
      "Iteration 37 maximum change in gamma: 0.142088                                   \n",
      "Iteration 37 maximum change in gamma: 0.142088                                  \n",
      "Iteration 38 maximum change in gamma: 0.096915                                   \n",
      "Iteration 38 maximum change in gamma: 0.096915                                  \n",
      "Iteration 39 maximum change in gamma: 0.0608104                                   \n",
      "Iteration 39 maximum change in gamma: 0.0608104                                 \n",
      "Iteration 40 maximum change in gamma: 0.0361621                                   \n",
      "Iteration 40 maximum change in gamma: 0.0361621                                 \n",
      "Iteration 41 maximum change in gamma: 0.0208601                                   \n",
      "Iteration 41 maximum change in gamma: 0.0208601                                  \n",
      "Iteration 42 maximum change in gamma: 0.0192793                                  \n",
      "Iteration 42 maximum change in gamma: 0.0192793                                 \n",
      "Iteration 43 maximum change in gamma: 0.0184543                                  \n",
      "Iteration 43 maximum change in gamma: 0.0184543                                 \n",
      "Iteration 44 maximum change in gamma: 0.0176882                                  \n",
      "Iteration 44 maximum change in gamma: 0.0176882                                 \n",
      "Iteration 45 maximum change in gamma: 0.0169772                                   \n",
      "Iteration 45 maximum change in gamma: 0.0169772                                 \n",
      "Iteration 46 maximum change in gamma: 0.0163172                                   \n",
      "Iteration 46 maximum change in gamma: 0.0163172                                 \n",
      "Iteration 47 maximum change in gamma: 0.0157038                                   \n",
      "Iteration 47 maximum change in gamma: 0.0157038                                 \n",
      "Iteration 48 maximum change in gamma: 0.0151331                                   \n",
      "Iteration 48 maximum change in gamma: 0.0151331                                 \n",
      "Iteration 49 maximum change in gamma: 0.0146011                                  \n",
      "Iteration 49 maximum change in gamma: 0.0146011                                 \n",
      "Iteration 50 maximum change in gamma: 0.0141041                                  \n",
      "Iteration 50 maximum change in gamma: 0.0141041                                 \n",
      "Iteration 51 maximum change in gamma: 0.0136389                                   \n",
      "Iteration 51 maximum change in gamma: 0.0136389                                 \n",
      "Iteration 52 maximum change in gamma: 0.0132024                                   \n",
      "Iteration 52 maximum change in gamma: 0.0132024                                  \n",
      "Iteration 53 maximum change in gamma: 0.0127917                                  \n",
      "Iteration 53 maximum change in gamma: 0.0127917                                 \n",
      "Iteration 54 maximum change in gamma: 0.0124045                                   \n",
      "Iteration 54 maximum change in gamma: 0.0124045                                 \n",
      "Iteration 55 maximum change in gamma: 0.0120384                                  \n",
      "Iteration 55 maximum change in gamma: 0.0120384                                 \n",
      "Iteration 56 maximum change in gamma: 0.0116915                                  \n",
      "Iteration 56 maximum change in gamma: 0.0116915                                 \n",
      "Iteration 57 maximum change in gamma: 0.0113617                                   \n",
      "Iteration 57 maximum change in gamma: 0.0113617                                 \n",
      "Iteration 58 maximum change in gamma: 0.0110477                                   \n",
      "Iteration 58 maximum change in gamma: 0.0110477                                 \n",
      "Iteration 59 maximum change in gamma: 0.0107477                                   \n",
      "Iteration 59 maximum change in gamma: 0.0107477                                 \n",
      "Iteration 60 maximum change in gamma: 0.0104606                                   \n",
      "Iteration 60 maximum change in gamma: 0.0104606                                 \n",
      "Iteration 61 maximum change in gamma: 0.0101851                                  \n",
      "Iteration 61 maximum change in gamma: 0.0101851                                 \n",
      "Iteration 62 maximum change in gamma: 0.00992002                                 \n",
      "Iteration 62 maximum change in gamma: 0.00992002                                \n",
      "Iteration 63 maximum change in gamma: 0.00966452                                  \n",
      "Iteration 63 maximum change in gamma: 0.00966452                                \n",
      "Iteration 64 maximum change in gamma: 0.00941766                                  \n",
      "Iteration 64 maximum change in gamma: 0.00941766                                 \n",
      "Iteration 65 maximum change in gamma: 0.00917865                                 \n",
      "Iteration 65 maximum change in gamma: 0.00917865                                \n",
      "Iteration 66 maximum change in gamma: 0.00908822                                  \n",
      "Iteration 66 maximum change in gamma: 0.00908822                                \n",
      "Iteration 67 maximum change in gamma: 0.0091286                                  \n",
      "Iteration 67 maximum change in gamma: 0.0091286                                 \n",
      "Iteration 68 maximum change in gamma: 0.00916622                                 \n",
      "Iteration 68 maximum change in gamma: 0.00916622                                \n",
      "Iteration 69 maximum change in gamma: 0.00920064                                  \n",
      "Iteration 69 maximum change in gamma: 0.00920064                                \n",
      "Iteration 70 maximum change in gamma: 0.00923141                                  \n",
      "Iteration 70 maximum change in gamma: 0.00923141                                \n",
      "Iteration 71 maximum change in gamma: 0.00925807                                  \n",
      "Iteration 71 maximum change in gamma: 0.00925807                                \n",
      "Iteration 72 maximum change in gamma: 0.00928021                                  \n",
      "Iteration 72 maximum change in gamma: 0.00928021                                 \n",
      "Iteration 73 maximum change in gamma: 0.00929737                                 \n",
      "Iteration 73 maximum change in gamma: 0.00929737                                \n",
      "Iteration 74 maximum change in gamma: 0.00930916                                 \n",
      "Iteration 74 maximum change in gamma: 0.00930916                                \n",
      "Iteration 75 maximum change in gamma: 0.00931517                                  \n",
      "Iteration 75 maximum change in gamma: 0.00931517                                \n",
      "Iteration 76 maximum change in gamma: 0.00931502                                  \n",
      "Iteration 76 maximum change in gamma: 0.00931502                                \n",
      "Iteration 77 maximum change in gamma: 0.00930835                                  \n",
      "Iteration 77 maximum change in gamma: 0.00930835                                \n",
      "Iteration 78 maximum change in gamma: 0.00929481                                  \n",
      "Iteration 78 maximum change in gamma: 0.00929481                                \n",
      "Iteration 79 maximum change in gamma: 0.00927412                                 \n",
      "Iteration 79 maximum change in gamma: 0.00927412                                \n",
      "Iteration 80 maximum change in gamma: 0.00924599                                 \n",
      "Iteration 80 maximum change in gamma: 0.00924599                                \n",
      "Iteration 81 maximum change in gamma: 0.00921019                                  \n",
      "Iteration 81 maximum change in gamma: 0.00921019                                \n",
      "Iteration 82 maximum change in gamma: 0.00916651                                  \n",
      "Iteration 82 maximum change in gamma: 0.00916651                                \n",
      "Iteration 83 maximum change in gamma: 0.00911479                                  \n",
      "Iteration 83 maximum change in gamma: 0.00911479                                \n",
      "Iteration 84 maximum change in gamma: 0.00905492                                  \n",
      "Iteration 84 maximum change in gamma: 0.00905492                                \n",
      "Iteration 85 maximum change in gamma: 0.00898683                                 \n",
      "Iteration 85 maximum change in gamma: 0.00898683                                \n",
      "Iteration 86 maximum change in gamma: 0.00891048                                 \n",
      "Iteration 86 maximum change in gamma: 0.00891048                                \n",
      "Iteration 87 maximum change in gamma: 0.00882591                                  \n",
      "Iteration 87 maximum change in gamma: 0.00882591                                \n",
      "Iteration 88 maximum change in gamma: 0.00873318                                  \n",
      "Iteration 88 maximum change in gamma: 0.00873318                                \n",
      "Iteration 89 maximum change in gamma: 0.0086324                                   \n",
      "Iteration 89 maximum change in gamma: 0.0086324                                 \n",
      "Iteration 90 maximum change in gamma: 0.00852376                                  \n",
      "Iteration 90 maximum change in gamma: 0.00852376                                \n",
      "Iteration 91 maximum change in gamma: 0.00840745                                 \n",
      "Iteration 91 maximum change in gamma: 0.00840745                                \n",
      "Iteration 92 maximum change in gamma: 0.00828374                                 \n",
      "Iteration 92 maximum change in gamma: 0.00828374                                \n",
      "Iteration 93 maximum change in gamma: 0.00815293                                  \n",
      "Iteration 93 maximum change in gamma: 0.00815293                                \n",
      "Iteration 94 maximum change in gamma: 0.00801536                                  \n",
      "Iteration 94 maximum change in gamma: 0.00801536                                \n",
      "Iteration 95 maximum change in gamma: 0.00787141                                  \n",
      "Iteration 95 maximum change in gamma: 0.00787141                                 \n",
      "Iteration 96 maximum change in gamma: 0.00772149                                 \n",
      "Iteration 96 maximum change in gamma: 0.00772149                                \n",
      "Iteration 97 maximum change in gamma: 0.00756605                                 \n",
      "Iteration 97 maximum change in gamma: 0.00756605                                \n",
      "Iteration 98 maximum change in gamma: 0.00740556                                 \n",
      "Iteration 98 maximum change in gamma: 0.00740556                                \n",
      "Iteration 99 maximum change in gamma: 0.0072405                                   \n",
      "Iteration 99 maximum change in gamma: 0.0072405                                 \n",
      "Iteration 100 maximum change in gamma: 0.00707137                                 \n",
      "Iteration 100 maximum change in gamma: 0.00707137                               \n",
      "Iteration 101 maximum change in gamma: 0.0068987                                  \n",
      "Iteration 101 maximum change in gamma: 0.0068987                                \n",
      "Iteration 102 maximum change in gamma: 0.00672302                                 \n",
      "Iteration 102 maximum change in gamma: 0.00672302                               \n",
      "Iteration 103 maximum change in gamma: 0.00654484                                \n",
      "Iteration 103 maximum change in gamma: 0.00654484                               \n",
      "Iteration 104 maximum change in gamma: 0.00636471                                \n",
      "Iteration 104 maximum change in gamma: 0.00636471                               \n",
      "Iteration 105 maximum change in gamma: 0.00618313                                 \n",
      "Iteration 105 maximum change in gamma: 0.00618313                               \n",
      "Iteration 106 maximum change in gamma: 0.00600062                                 \n",
      "Iteration 106 maximum change in gamma: 0.00600062                                \n",
      "Iteration 107 maximum change in gamma: 0.00581768                                \n",
      "Iteration 107 maximum change in gamma: 0.00581768                               \n",
      "Iteration 108 maximum change in gamma: 0.00563479                                 \n",
      "Iteration 108 maximum change in gamma: 0.00563479                               \n",
      "Iteration 109 maximum change in gamma: 0.00545242                                \n",
      "Iteration 109 maximum change in gamma: 0.00545242                               \n",
      "Iteration 110 maximum change in gamma: 0.00527099                                \n",
      "Iteration 110 maximum change in gamma: 0.00527099                               \n",
      "Iteration 111 maximum change in gamma: 0.00509093                                 \n",
      "Iteration 111 maximum change in gamma: 0.00509093                               \n",
      "Iteration 112 maximum change in gamma: 0.00491263                                 \n",
      "Iteration 112 maximum change in gamma: 0.00491263                               \n",
      "Iteration 113 maximum change in gamma: 0.00473645                                 \n",
      "Iteration 113 maximum change in gamma: 0.00473645                               \n",
      "Iteration 114 maximum change in gamma: 0.00456272                                 \n",
      "Iteration 114 maximum change in gamma: 0.00456272                               \n",
      "Iteration 115 maximum change in gamma: 0.00439176                                \n",
      "Iteration 115 maximum change in gamma: 0.00439176                               \n",
      "Iteration 116 maximum change in gamma: 0.00422383                                \n",
      "Iteration 116 maximum change in gamma: 0.00422383                               \n",
      "Iteration 117 maximum change in gamma: 0.00405918                                 \n",
      "Iteration 117 maximum change in gamma: 0.00405918                               \n",
      "Iteration 118 maximum change in gamma: 0.00389803                                 \n",
      "Iteration 118 maximum change in gamma: 0.00389803                               \n",
      "Iteration 119 maximum change in gamma: 0.00374058                                 \n",
      "Iteration 119 maximum change in gamma: 0.00374058                               \n",
      "Iteration 120 maximum change in gamma: 0.00358697                                 \n",
      "Iteration 120 maximum change in gamma: 0.00358697                               \n",
      "Iteration 121 maximum change in gamma: 0.00343736                                \n",
      "Iteration 121 maximum change in gamma: 0.00343736                               \n",
      "Iteration 122 maximum change in gamma: 0.00329185                                \n",
      "Iteration 122 maximum change in gamma: 0.00329185                               \n",
      "Iteration 123 maximum change in gamma: 0.00315053                                 \n",
      "Iteration 123 maximum change in gamma: 0.00315053                               \n",
      "Iteration 124 maximum change in gamma: 0.00301346                                 \n",
      "Iteration 124 maximum change in gamma: 0.00301346                                \n",
      "Iteration 125 maximum change in gamma: 0.00288069                                \n",
      "Iteration 125 maximum change in gamma: 0.00288069                               \n",
      "Iteration 126 maximum change in gamma: 0.00275224                                 \n",
      "Iteration 126 maximum change in gamma: 0.00275224                               \n",
      "Iteration 127 maximum change in gamma: 0.00262812                                \n",
      "Iteration 127 maximum change in gamma: 0.00262812                               \n",
      "Iteration 128 maximum change in gamma: 0.00250831                                \n",
      "Iteration 128 maximum change in gamma: 0.00250831                               \n",
      "Iteration 129 maximum change in gamma: 0.00239279                                 \n",
      "Iteration 129 maximum change in gamma: 0.00239279                               \n",
      "Iteration 130 maximum change in gamma: 0.00228152                                 \n",
      "Iteration 130 maximum change in gamma: 0.00228152                               \n",
      "Iteration 131 maximum change in gamma: 0.00217445                                 \n",
      "Iteration 131 maximum change in gamma: 0.00217445                               \n",
      "Iteration 132 maximum change in gamma: 0.00207151                                 \n",
      "Iteration 132 maximum change in gamma: 0.00207151                               \n",
      "Iteration 133 maximum change in gamma: 0.00197264                                \n",
      "Iteration 133 maximum change in gamma: 0.00197264                               \n",
      "Iteration 134 maximum change in gamma: 0.00187775                                \n",
      "Iteration 134 maximum change in gamma: 0.00187775                               \n",
      "Iteration 135 maximum change in gamma: 0.00178675                                 \n",
      "Iteration 135 maximum change in gamma: 0.00178675                               \n",
      "Iteration 136 maximum change in gamma: 0.00169956                                 \n",
      "Iteration 136 maximum change in gamma: 0.00169956                               \n",
      "Iteration 137 maximum change in gamma: 0.00161608                                 \n",
      "Iteration 137 maximum change in gamma: 0.00161608                               \n",
      "Iteration 138 maximum change in gamma: 0.00153619                                 \n",
      "Iteration 138 maximum change in gamma: 0.00153619                               \n",
      "Iteration 139 maximum change in gamma: 0.00145981                                \n",
      "Iteration 139 maximum change in gamma: 0.00145981                               \n",
      "Iteration 140 maximum change in gamma: 0.00138681                                \n",
      "Iteration 140 maximum change in gamma: 0.00138681                               \n",
      "Iteration 141 maximum change in gamma: 0.0013171                                  \n",
      "Iteration 141 maximum change in gamma: 0.0013171                                \n",
      "Iteration 142 maximum change in gamma: 0.00125055                                 \n",
      "Iteration 142 maximum change in gamma: 0.00125055                               \n",
      "Iteration 143 maximum change in gamma: 0.00118707                                 \n",
      "Iteration 143 maximum change in gamma: 0.00118707                               \n",
      "Iteration 144 maximum change in gamma: 0.00112654                                 \n",
      "Iteration 144 maximum change in gamma: 0.00112654                               \n",
      "Iteration 145 maximum change in gamma: 0.00106885                                \n",
      "Iteration 145 maximum change in gamma: 0.00106885                                \n",
      "Iteration 146 maximum change in gamma: 0.00101389                                \n",
      "Iteration 146 maximum change in gamma: 0.00101389                               \n",
      "Iteration 147 maximum change in gamma: 0.000961562                                \n",
      "Iteration 147 maximum change in gamma: 0.000961562                              \n",
      "1502921989: [info]     Finished maximum iterations, or found convergence! (/tmp/pip-bneszy3v-build/deps/meta/src/topics/lda_cvb.cpp:60)\n",
      "1502921989: [info]     Finished maximum iterations, or found convergence! (/tmp/pip-bneszy3v-build/deps/meta/src/topics/lda_cvb.cpp:60)\n"
     ]
    }
   ],
   "source": [
    "lda_inf = metapy.topics.LDACollapsedVB(dset, num_topics=2, alpha=1.0, beta=0.01)\n",
    "lda_inf.run(num_iters=1000)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true,
    "deletable": true,
    "editable": true
   },
   "source": [
    "The above ran the CVB0 algorithm for 1000 iterations, or until an algorithm-specific convergence criterion was met. Now let's save the current estimate for our topics and topic proportions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 160,
   "metadata": {
    "collapsed": true,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "lda_inf.save('lda-cvb0')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "We can interrogate the topic inference results by using the `TopicModel` query class. Let's load our inference results back in."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 161,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      " \r",
      " > Loading topic term probabilities: [===========>           ]  50% ETA 00:00:00 \r",
      " > Loading topic term probabilities: [===========>           ]  50% ETA 00:00:00 \r",
      " > Loading topic term probabilities: [=======================] 100% ETA 00:00:00 \r",
      " > Loading topic term probabilities: [=======================] 100% ETA 00:00:00 \n",
      " \n",
      " \r",
      " > Loading document topic probabilities: [>                  ]   0% ETA 00:00:00 \r",
      " > Loading document topic probabilities: [>                  ]   0% ETA 00:00:00 \r",
      " > Loading document topic probabilities: [===================] 100% ETA 00:00:00 \r",
      " > Loading document topic probabilities: [===================] 100% ETA 00:00:00 \n",
      " \n"
     ]
    }
   ],
   "source": [
    "model = metapy.topics.TopicModel('lda-cvb0')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "Now, let's have a look at our topics. A typical way of doing this is to print the top $k$ words in each topic, so let's do that."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 162,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[(3759, 0.06705637112528769),\n",
       " (1968, 0.05605930810442864),\n",
       " (2635, 0.05222307061872271),\n",
       " (3549, 0.04642939140343873),\n",
       " (665, 0.03488141234942433),\n",
       " (4157, 0.02906748539640022),\n",
       " (2322, 0.02885022388702368),\n",
       " (3729, 0.022331344581221765),\n",
       " (1790, 0.020755699719924883),\n",
       " (3554, 0.015483037834133842)]"
      ]
     },
     "execution_count": 162,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "model.top_k(tid=0)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "The models operate on term ids instead of raw text strings, so let's convert this to a human readable format by using the vocabulary contained in our `ForwardIndex` to map the term ids to strings."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 163,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('time', 0.06705637112528769),\n",
       " ('job', 0.05605930810442864),\n",
       " ('part', 0.05222307061872271),\n",
       " ('student', 0.04642939140343873),\n",
       " ('colleg', 0.03488141234942433),\n",
       " ('work', 0.02906748539640022),\n",
       " ('money', 0.02885022388702368),\n",
       " ('think', 0.022331344581221765),\n",
       " ('import', 0.020755699719924883),\n",
       " ('studi', 0.015483037834133842)]"
      ]
     },
     "execution_count": 163,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "[(fidx.term_text(pr[0]), pr[1]) for pr in model.top_k(tid=0)]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 164,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('smoke', 0.13110394941553408),\n",
       " ('restaur', 0.054349311633512025),\n",
       " ('peopl', 0.036780087802958536),\n",
       " ('smoker', 0.03349263454160484),\n",
       " ('ban', 0.022530670096022554),\n",
       " ('think', 0.015620489442527752),\n",
       " ('japan', 0.012780916901417468),\n",
       " ('complet', 0.012635067649017825),\n",
       " ('cigarett', 0.011987181371938055),\n",
       " ('non', 0.011317738574939687)]"
      ]
     },
     "execution_count": 164,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "[(fidx.term_text(pr[0]), pr[1]) for pr in model.top_k(tid=1)]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "We can pretty clearly see that this particular dataset was about two major issues: part time jobs for students and smoking in public. This dataset is actually a collection of essays written by students, and there just so happen to be two different topics they can choose from!\n",
    "\n",
    "The topics are pretty clear in this case, but in some cases it is also useful to score the terms in a topic using some function of the probability of the word in the topic and the probability of the word in the other topics. Intuitively, we might want to select words from each topic that best reflect that topic's content by picking words that both have high probability in that topic **and** have low probability in the other topics. In other words, we want to balance between high probability terms and highly specific terms (this is kind of like a tf-idf weighting). One such scoring function is provided by the toolkit in `BLTermScorer`, which implements a scoring function proposed by Blei and Lafferty."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 165,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('job', 0.34822058296128233),\n",
       " ('part', 0.31311075688049606),\n",
       " ('student', 0.2832893627599442),\n",
       " ('colleg', 0.20809000481963835),\n",
       " ('time', 0.17796675292712294),\n",
       " ('money', 0.16234684321361126),\n",
       " ('work', 0.1558533795913366),\n",
       " ('studi', 0.08228291023281153),\n",
       " ('learn', 0.06491900298193354),\n",
       " ('experi', 0.054945276562063716)]"
      ]
     },
     "execution_count": 165,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "scorer = metapy.topics.BLTermScorer(model)\n",
    "[(fidx.term_text(pr[0]), pr[1]) for pr in model.top_k(tid=0, scorer=scorer)]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 166,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('smoke', 0.874164081128221),\n",
       " ('restaur', 0.31746129947227786),\n",
       " ('smoker', 0.20060262327581713),\n",
       " ('ban', 0.128530349360076),\n",
       " ('cigarett', 0.06557605570188008),\n",
       " ('non', 0.061284206154067045),\n",
       " ('complet', 0.0610537364588466),\n",
       " ('japan', 0.0584657324517579),\n",
       " ('health', 0.05054833214552534),\n",
       " ('seat', 0.04533989023870699)]"
      ]
     },
     "execution_count": 166,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "[(fidx.term_text(pr[0]), pr[1]) for pr in model.top_k(tid=1, scorer=scorer)]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "Here we can see that the uninformative word stem \"think\" was downweighted from the word list from each topic, since it had relatively high probability in either topic."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "We can also see the inferred topic distribution for each document."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 167,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<metapy.stats.Multinomial {0: 0.978659, 1: 0.021341}>"
      ]
     },
     "execution_count": 167,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "model.topic_distribution(0)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "It looks like our first document was written by a student who chose the part-time job essay topic..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 168,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<metapy.stats.Multinomial {0: 0.021203, 1: 0.978797}>"
      ]
     },
     "execution_count": 168,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "model.topic_distribution(900)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "...whereas this document looks like it was written by a student who chose the public smoking essay topic."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}