alkutnikar/gist:325da91e51c3fdd617f3

## gistfile1.txt
{
 "metadata": {
  "name": "",
  "signature": "sha256:a4baa3274925a195e78c7fe11589641b8a04e07e5a258f1e04cda1c58a775f14"
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "#Assignment 1"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "**Author - Ajay Lakshminarayanarao**"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "In this assignment you will investigate the ability of a POS tagging system to deal\n",
      "with fine-grained POS categories and coarse-grained POS categories. \n",
      "The goal of the assignment is to run conduct three experiments, analyze the results, and to\n",
      "draw conclusions."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "##Import Data"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from nltk.corpus import treebank\n",
      "from decimal import *\n",
      "from IPython.display import HTML\n",
      "from IPython.display import *\n",
      "%matplotlib inline"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 1
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "words = treebank.words()\n",
      "n = len(words)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 2
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "train = []\n",
      "test = []\n",
      "for i in treebank.sents()[:500]:\n",
      "    train.append(i)\n",
      "for j in treebank.sents()[500:]:\n",
      "    test.append(j)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 62
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "##Divide the data into training and test"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "trainData = treebank.tagged_sents()[:500]\n",
      "testData = treebank.tagged_sents()[500:]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 63
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "##Train POS Tagger using training sentences"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import nltk\n",
      "from nltk import UnigramTagger, BigramTagger, TrigramTagger,AffixTagger\n",
      "t = nltk.DefaultTagger('NN')\n",
      "affix = AffixTagger(trainData,backoff=t)\n",
      "unigram = UnigramTagger(trainData, backoff = affix)\n",
      "bigram = BigramTagger(trainData, backoff=unigram) \n",
      "trigram = TrigramTagger(trainData, backoff=bigram)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 5
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "csvContent=[]\n",
      "def myEvaluate(tagger):\n",
      "    id=1\n",
      "    csvContent.append(['id','Actual Tag','Predicted Tag'])\n",
      "    getcontext().prec = 4\n",
      "    correct = 0\n",
      "    incorrect = 0\n",
      "    for sent, actualSent in zip(test,testData):\n",
      "        mylist=[]\n",
      "        predSet=[]\n",
      "        mylist.append(id)\n",
      "        id+=1\n",
      "        for word,actualword in zip(sent,actualSent):\n",
      "            if tagger.tag(nltk.word_tokenize(word))[0][1] == actualword[1]:\n",
      "                correct += 1\n",
      "            else:\n",
      "                incorrect += 1\n",
      "            predSet.append(tagger.tag(nltk.word_tokenize(word)))\n",
      "        mylist.append(actualSent)\n",
      "        mylist.append(predSet)\n",
      "        csvContent.append(mylist)\n",
      "    print 'No of correct:' , correct\n",
      "    print 'No of incorrect:', incorrect\n",
      "    print 'Accuracy:', (Decimal(correct)/Decimal((correct + incorrect)))*100"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 6
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "myEvaluate(affix)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "No of correct: 24720\n",
        "No of incorrect: 63366\n",
        "Accuracy: 28.06\n"
       ]
      }
     ],
     "prompt_number": 7
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "myEvaluate(unigram)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "No of correct: 72481\n",
        "No of incorrect: 15605\n",
        "Accuracy: 82.28\n"
       ]
      }
     ],
     "prompt_number": 8
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "myEvaluate(bigram)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "No of correct: 72472\n",
        "No of incorrect: 15614\n",
        "Accuracy: 82.27\n"
       ]
      }
     ],
     "prompt_number": 9
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "myEvaluate(trigram)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "No of correct: 72472\n",
        "No of incorrect: 15614\n",
        "Accuracy: 82.27\n"
       ]
      }
     ],
     "prompt_number": 10
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "#Generate TSV file\n",
      "import csv\n",
      "f = open('part1.tsv','wb') \n",
      "fw = csv.writer(f,delimiter='\\t') \n",
      "fw.writerows(csvContent) \n",
      "f.close() "
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 11
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "def display(mytags):\n",
      "    s = \"\"\"<table>\n",
      "    <tr>\n",
      "    <th>Tag Name</th>\n",
      "    <th>Tag Accuracy</th>\n",
      "    </tr>\n",
      "    \"\"\"\n",
      "    for key in mytags.keys():\n",
      "        s += \"\"\"\n",
      "        <tr>\n",
      "            <td>\"\"\"+key+\"\"\"</td>\n",
      "            <td>\"\"\"+str(mytags[key])+\"\"\"</td>\n",
      "        </tr>\"\"\"\n",
      "    s+=\"\"\"</table>\"\"\"\n",
      "    content = HTML(s);content"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 12
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import pandas as pd\n",
      "def myPOSEvaluate(tagger):\n",
      "    getcontext().prec = 4\n",
      "    tags={}\n",
      "    accuracy=[]\n",
      "    for sent, actualSent in zip(test,testData):\n",
      "        for word,actualword in zip(sent,actualSent):\n",
      "            predSent = tagger.tag(nltk.word_tokenize(word))[0][1]\n",
      "            if(not tags.has_key(predSent)):\n",
      "                tags[predSent]=[0,0]\n",
      "            accuracy = tags[predSent]\n",
      "            if predSent == actualword[1]:\n",
      "                accuracy[0] += 1\n",
      "            else:\n",
      "                accuracy[1] += 1\n",
      "            tags[predSent] = accuracy\n",
      "    finaltags=[]\n",
      "    for key in tags.keys():\n",
      "        correct = tags[key][0]\n",
      "        incorrect = tags[key][1]\n",
      "        finaltags.append((Decimal(correct)/Decimal((correct + incorrect)))*100)\n",
      "\n",
      "    DF = pd.DataFrame()\n",
      "    DF['Tag Name'] = tags.keys()\n",
      "    DF['Accuracy'] = finaltags\n",
      "    print DF"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 13
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "myPOSEvaluate(bigram)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "   Tag Name Accuracy\n",
        "0      PRP$      100\n",
        "1       VBG    70.06\n",
        "2       VBD    82.48\n",
        "3        ``      100\n",
        "4       POS    88.61\n",
        "5        ''      100\n",
        "6       VBP    64.12\n",
        "7       WDT    94.88\n",
        "8        JJ    75.79\n",
        "9        WP    98.58\n",
        "10      VBZ    89.63\n",
        "11       DT    98.19\n",
        "12       RP    53.23\n",
        "13        $      100\n",
        "14       NN    54.69\n",
        "15        ,      100\n",
        "16        .      100\n",
        "17       TO    99.90\n",
        "18      PRP    97.71\n",
        "19       RB    86.74\n",
        "20    -LRB-      100\n",
        "21        :      100\n",
        "22      NNS    84.81\n",
        "23      NNP    79.51\n",
        "24       VB    61.01\n",
        "25      WRB      100\n",
        "26       CC    99.40\n",
        "27      RBR    38.50\n",
        "28      VBN    57.85\n",
        "29   -NONE-    99.91\n",
        "30       EX    71.05\n",
        "31       IN    92.65\n",
        "32      WP$      100\n",
        "33       CD    98.44\n",
        "34       MD    99.61\n",
        "35     NNPS    43.90\n",
        "36    -RRB-      100\n",
        "37      JJS    70.05\n",
        "38      JJR    63.26\n",
        "\n",
        "[39 rows x 2 columns]\n"
       ]
      }
     ],
     "prompt_number": 14
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "class ListTable(list):    \n",
      "    def _repr_html_(self):\n",
      "        html = [\"<table>\"]\n",
      "        for row in self:\n",
      "            html.append(\"<tr>\")\n",
      "            \n",
      "            for col in row:\n",
      "                html.append(\"<td>{0}</td>\".format(col))\n",
      "            \n",
      "            html.append(\"</tr>\")\n",
      "        html.append(\"</table>\")\n",
      "        return ''.join(html)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 15
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "Matrix = [[0 for x in range(11)] for x in range(11)]\n",
      "def getConfusionMatrix2(tagger):\n",
      "    from sklearn.metrics import confusion_matrix\n",
      "    dict={'JJ':0, 'NN':1, 'NNP':2, 'NNPS':3, 'RB':4, 'RP':5, 'IN':6, 'VB':7, 'VBD':8, 'VBN':9, 'VBP':10}\n",
      "    mylist=[]\n",
      "    tags={}\n",
      "    pred=[]\n",
      "    actual=[]\n",
      "    for sent, actualSent in zip(test,testData):\n",
      "        for word,actualword in zip(sent,actualSent):\n",
      "            predSent = tagger.tag(nltk.word_tokenize(word))[0][1]    \n",
      "            if ((predSent in dict.keys()) and (actualword[1] in dict.keys())):\n",
      "                Matrix[dict[actualword[1]]][dict[predSent]] +=1\n",
      "                \n",
      "getConfusionMatrix2(bigram)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 16
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "table = ListTable()\n",
      "tagList=['JJ', 'NN', 'NNP', 'NNPS', 'RB', 'RP', 'IN', 'VB', 'VBD', 'VBN', 'VBP']\n",
      "table.append([' '] + tagList)\n",
      "for i in xrange(11):\n",
      "    table.append([tagList[i]] + Matrix[i])\n",
      "print 'Confusion Matrix'\n",
      "table"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Confusion Matrix\n"
       ]
      },
      {
       "html": [
        "<table><tr><td> </td><td>JJ</td><td>NN</td><td>NNP</td><td>NNPS</td><td>RB</td><td>RP</td><td>IN</td><td>VB</td><td>VBD</td><td>VBN</td><td>VBP</td></tr><tr><td>JJ</td><td>3272</td><td>932</td><td>190</td><td>0</td><td>123</td><td>0</td><td>68</td><td>56</td><td>28</td><td>207</td><td>15</td></tr><tr><td>NN</td><td>324</td><td>9595</td><td>459</td><td>0</td><td>51</td><td>0</td><td>81</td><td>390</td><td>2</td><td>49</td><td>106</td></tr><tr><td>NNP</td><td>384</td><td>3779</td><td>3337</td><td>13</td><td>39</td><td>0</td><td>69</td><td>37</td><td>26</td><td>25</td><td>11</td></tr><tr><td>NNPS</td><td>0</td><td>16</td><td>13</td><td>18</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td></tr><tr><td>RB</td><td>97</td><td>116</td><td>6</td><td>0</td><td>2002</td><td>54</td><td>109</td><td>31</td><td>0</td><td>1</td><td>1</td></tr><tr><td>RP</td><td>0</td><td>1</td><td>0</td><td>0</td><td>14</td><td>132</td><td>31</td><td>6</td><td>0</td><td>0</td><td>0</td></tr><tr><td>IN</td><td>19</td><td>66</td><td>19</td><td>0</td><td>43</td><td>62</td><td>8348</td><td>1</td><td>0</td><td>0</td><td>1</td></tr><tr><td>VB</td><td>135</td><td>623</td><td>58</td><td>0</td><td>12</td><td>0</td><td>26</td><td>1114</td><td>0</td><td>31</td><td>266</td></tr><tr><td>VBD</td><td>8</td><td>101</td><td>3</td><td>0</td><td>0</td><td>0</td><td>0</td><td>16</td><td>1841</td><td>647</td><td>2</td></tr><tr><td>VBN</td><td>23</td><td>116</td><td>31</td><td>0</td><td>7</td><td>0</td><td>1</td><td>13</td><td>335</td><td>1326</td><td>5</td></tr><tr><td>VBP</td><td>20</td><td>194</td><td>12</td><td>0</td><td>2</td><td>0</td><td>9</td><td>162</td><td>0</td><td>3</td><td>731</td></tr></table>"
       ],
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 17,
       "text": [
        "[[' ', 'JJ', 'NN', 'NNP', 'NNPS', 'RB', 'RP', 'IN', 'VB', 'VBD', 'VBN', 'VBP'],\n",
        " ['JJ', 3272, 932, 190, 0, 123, 0, 68, 56, 28, 207, 15],\n",
        " ['NN', 324, 9595, 459, 0, 51, 0, 81, 390, 2, 49, 106],\n",
        " ['NNP', 384, 3779, 3337, 13, 39, 0, 69, 37, 26, 25, 11],\n",
        " ['NNPS', 0, 16, 13, 18, 0, 0, 0, 0, 0, 0, 0],\n",
        " ['RB', 97, 116, 6, 0, 2002, 54, 109, 31, 0, 1, 1],\n",
        " ['RP', 0, 1, 0, 0, 14, 132, 31, 6, 0, 0, 0],\n",
        " ['IN', 19, 66, 19, 0, 43, 62, 8348, 1, 0, 0, 1],\n",
        " ['VB', 135, 623, 58, 0, 12, 0, 26, 1114, 0, 31, 266],\n",
        " ['VBD', 8, 101, 3, 0, 0, 0, 0, 16, 1841, 647, 2],\n",
        " ['VBN', 23, 116, 31, 0, 7, 0, 1, 13, 335, 1326, 5],\n",
        " ['VBP', 20, 194, 12, 0, 2, 0, 9, 162, 0, 3, 731]]"
       ]
      }
     ],
     "prompt_number": 17
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "    *Give plausible explanations for the two categories on which the POS tagger performs the worst and the top two categories on which it performs the best. For example, the tagger may get PRP with a high accuracy and may make most mistakes on RBP. Think what factors could affect the performance on a category.\n",
      "    \n",
      "    Ans: The major error is when the actual category is NNP and the predicted category is NN with 3779 incorrect entries. The next high error rate is when the actual tag is verb and the predicted tag is noun. In the first case it has failed to distinguish proper nouns and predicted as nouns. In the second case, it has predicted verbs to be nouns. In the first case a proper noun is a closed case category and has very limited examples in real life. The model could not be trained with this very small set and hence could not distinguish between proper nouns and nouns. whereas in the case of nouns and verbs, they are sometimes used in different contexts and hence it becomes difficult for the parser to understand the context."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "*Briefly (in a couple of sentences) mention what tests you can do to verify if your explanations are indeed true\n",
      "\n",
      "Ans: Information such as knowledge of word probabilities and knowledge of the POS of the neighbouring words will helpp minimize these error. Also using advanced models like HMM and feature rich model will help explain the cause.\n"
     ]
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "PART 2"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      " myDict = {\"NN\":\"SNN\", \"NNS\":\"SNN\", \"NNP\":\"SNN\", \"NNPS\":\"SNN\", \"PRP\":\"SNN\", \"PRP$\":\"SNN\", \\\n",
      "            \"VB\":\"SVB\", \"VBP\":\"SVB\", \"VBD\":\"SVB\", \"VBN\":\"SVB\", \"VBZ\":\"SVB\", \"VBG\":\"SVB\", \\\n",
      "            \"JJ\":\"SJJ\", \"JJR\":\"SJJ\", \"JJS\":\"SJJ\", \\\n",
      "            \"RB\":\"SRB\", \"RBR\":\"SRB\", \"RBS\":\"SRB\"}"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 48
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "csvContent=[]\n",
      "def myEvaluateCoarse(tagger):\n",
      "    getcontext().prec = 4\n",
      "    correct = 0\n",
      "    incorrect = 0\n",
      "    id=1\n",
      "    csvContent.append(['id','Actual Tag','Predicted Tag'])\n",
      "    for sent, actualSent in zip(test,testData):\n",
      "        mylist=[]\n",
      "        predSet=[]\n",
      "        mylist.append(id)\n",
      "        id += 1\n",
      "        for word,actualword in zip(sent,actualSent):\n",
      "            if tagger.tag(nltk.word_tokenize(word))[0][1] in myDict.keys():\n",
      "                val = myDict[tagger.tag(nltk.word_tokenize(word))[0][1]]\n",
      "            else:\n",
      "                val = \"MISC\"\n",
      "            if actualword[1] in myDict.keys():\n",
      "                val2 = myDict[actualword[1]]\n",
      "            else:\n",
      "                val2 = \"MISC\"\n",
      "            \n",
      "            if val  == val2:\n",
      "                correct += 1\n",
      "            else:\n",
      "                incorrect += 1\n",
      "            predSet.append(tagger.tag(nltk.word_tokenize(word)))\n",
      "        mylist.append(actualSent)\n",
      "        mylist.append(predSet)\n",
      "        csvContent.append(mylist)\n",
      "    print 'No of correct:' , correct\n",
      "    print 'No of incorrect:', incorrect\n",
      "    print 'Accuracy:', (Decimal(correct)/Decimal((correct + incorrect)))*100"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 49
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "myEvaluateCoarse(affix)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "No of correct: 88086\n",
        "No of incorrect: 0\n",
        "Accuracy: 100\n"
       ]
      }
     ],
     "prompt_number": 50
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "myEvaluateCoarse(unigram)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "No of correct: 79579\n",
        "No of incorrect: 8507\n",
        "Accuracy: 90.34\n"
       ]
      }
     ],
     "prompt_number": 21
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "myEvaluateCoarse(bigram)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "No of correct: 79579\n",
        "No of incorrect: 8507\n",
        "Accuracy: 90.34\n"
       ]
      }
     ],
     "prompt_number": 22
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "myEvaluateCoarse(trigram)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "No of correct: 79579\n",
        "No of incorrect: 8507\n",
        "Accuracy: 90.34\n"
       ]
      }
     ],
     "prompt_number": 23
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "#Generate CSV file\n",
      "import csv\n",
      "f = open('part2.tsv','wb') \n",
      "fw = csv.writer(f,delimiter='\\t') \n",
      "fw.writerows(csvContent) \n",
      "f.close() "
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 51
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import pandas as pd\n",
      "def myPOSEvaluateCoarse(tagger):\n",
      "    getcontext().prec = 4\n",
      "    tags={}\n",
      "    accuracy=[]\n",
      "    for sent, actualSent in zip(test,testData):\n",
      "        for word,actualword in zip(sent,actualSent):\n",
      "            if tagger.tag(nltk.word_tokenize(word))[0][1] in myDict.keys():\n",
      "                val = myDict[tagger.tag(nltk.word_tokenize(word))[0][1]]\n",
      "            else:\n",
      "                val = \"MISC\"\n",
      "            if actualword[1] in myDict.keys():\n",
      "                val2 = myDict[actualword[1]]\n",
      "            else:\n",
      "                val2 = \"MISC\"\n",
      "            \n",
      "            predSent = val\n",
      "            if(not tags.has_key(predSent)):\n",
      "                tags[predSent]=[0,0]\n",
      "            accuracy = tags[predSent]\n",
      "            if predSent == val2:\n",
      "                accuracy[0] += 1\n",
      "            else:\n",
      "                accuracy[1] += 1\n",
      "            tags[predSent] = accuracy\n",
      "    finaltags=[]\n",
      "    for key in tags.keys():\n",
      "        correct = tags[key][0]\n",
      "        incorrect = tags[key][1]\n",
      "        finaltags.append((Decimal(correct)/Decimal((correct + incorrect)))*100)\n",
      "\n",
      "    DF = pd.DataFrame()\n",
      "    DF['Tag Name'] = tags.keys()\n",
      "    DF['Accuracy'] = finaltags\n",
      "    print DF"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 24
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "myPOSEvaluateCoarse(affix)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "  Tag Name Accuracy\n",
        "0      SNN    33.48\n",
        "1      SRB    76.94\n",
        "2      SVB    67.97\n",
        "3      SJJ    59.51\n",
        "4     MISC    84.65\n",
        "\n",
        "[5 rows x 2 columns]\n"
       ]
      }
     ],
     "prompt_number": 25
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "myPOSEvaluateCoarse(unigram)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "  Tag Name Accuracy\n",
        "0      SNN    84.65\n",
        "1      SRB    83.21\n",
        "2      SJJ    75.42\n",
        "3     MISC    98.30\n",
        "4      SVB    84.41\n",
        "\n",
        "[5 rows x 2 columns]\n"
       ]
      }
     ],
     "prompt_number": 26
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "myPOSEvaluateCoarse(bigram)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "  Tag Name Accuracy\n",
        "0      SNN    84.65\n",
        "1      SRB    83.21\n",
        "2      SJJ    75.42\n",
        "3     MISC    98.30\n",
        "4      SVB    84.41\n",
        "\n",
        "[5 rows x 2 columns]\n"
       ]
      }
     ],
     "prompt_number": 27
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "myPOSEvaluateCoarse(trigram)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "  Tag Name Accuracy\n",
        "0      SNN    84.65\n",
        "1      SRB    83.21\n",
        "2      SJJ    75.42\n",
        "3     MISC    98.30\n",
        "4      SVB    84.41\n",
        "\n",
        "[5 rows x 2 columns]\n"
       ]
      }
     ],
     "prompt_number": 28
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "#Confusion Matrix\n",
      "Matrix = [[0 for x in range(5)] for x in range(5)]\n",
      "def getConfusionMatrixCoarse(tagger):\n",
      "    dict={'SNN':0, 'SVB':1, 'SJJ':2, 'SRB':3, 'MISC':4}\n",
      "    tags={}\n",
      "    pred=[]\n",
      "    actual=[]\n",
      "    for sent, actualSent in zip(test,testData):\n",
      "        for word,actualword in zip(sent,actualSent):\n",
      "            if tagger.tag(nltk.word_tokenize(word))[0][1] in myDict.keys():\n",
      "                val = myDict[tagger.tag(nltk.word_tokenize(word))[0][1]]\n",
      "            else:\n",
      "                val = \"MISC\"\n",
      "            if actualword[1] in myDict.keys():\n",
      "                val2 = myDict[actualword[1]]\n",
      "            else:\n",
      "                val2 = \"MISC\"\n",
      "            \n",
      "            predSent = val  \n",
      "            \n",
      "            Matrix[dict[val2]][dict[val]] +=1\n",
      "                \n",
      "getConfusionMatrixCoarse(bigram)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 30
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "table = ListTable()\n",
      "tagList=['SNN', 'SVB', 'SJJ', 'SRB', 'MISC']\n",
      "table.append([' '] + tagList)\n",
      "for i in xrange(5):\n",
      "    table.append([tagList[i]] + Matrix[i])\n",
      "print 'Confusion Matrix'\n",
      "table"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Confusion Matrix\n"
       ]
      },
      {
       "html": [
        "<table><tr><td> </td><td>SNN</td><td>SVB</td><td>SJJ</td><td>SRB</td><td>MISC</td></tr><tr><td>SNN</td><td>25046</td><td>1185</td><td>773</td><td>92</td><td>243</td></tr><tr><td>SVB</td><td>1622</td><td>9099</td><td>196</td><td>21</td><td>142</td></tr><tr><td>SJJ</td><td>1171</td><td>451</td><td>3611</td><td>238</td><td>98</td></tr><tr><td>SRB</td><td>129</td><td>33</td><td>172</td><td>2076</td><td>206</td></tr><tr><td>MISC</td><td>1619</td><td>12</td><td>36</td><td>68</td><td>39747</td></tr></table>"
       ],
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 31,
       "text": [
        "[[' ', 'SNN', 'SVB', 'SJJ', 'SRB', 'MISC'],\n",
        " ['SNN', 25046, 1185, 773, 92, 243],\n",
        " ['SVB', 1622, 9099, 196, 21, 142],\n",
        " ['SJJ', 1171, 451, 3611, 238, 98],\n",
        " ['SRB', 129, 33, 172, 2076, 206],\n",
        " ['MISC', 1619, 12, 36, 68, 39747]]"
       ]
      }
     ],
     "prompt_number": 31
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "tsent = treebank.tagged_sents()\n",
      "mytsent=[]\n",
      "for sent in tsent:\n",
      "    mysent=[]\n",
      "    for word in sent:\n",
      "        if word[1] in myDict:\n",
      "            myword=(word[0], myDict[word[1]])\n",
      "        else:\n",
      "            myword = (word[0], \"MISC\")\n",
      "        mysent.append(myword)\n",
      "    mytsent.append(mysent)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 80
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "trainData = mytsent[:500]\n",
      "testData = mytsent[500:]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 81
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "t = nltk.DefaultTagger('SNN')\n",
      "affix = AffixTagger(trainData,backoff=t)\n",
      "unigram = UnigramTagger(trainData, backoff = affix)\n",
      "bigram = BigramTagger(trainData, backoff=unigram) \n",
      "trigram = TrigramTagger(trainData, backoff=bigram)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 82
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "csvContentPre=[]\n",
      "def myEvaluatePreCoarse(tagger):\n",
      "    getcontext().prec = 4\n",
      "    correct = 0\n",
      "    incorrect = 0\n",
      "    id=1\n",
      "    csvContentPre.append(['id','Actual Tag','Predicted Tag'])\n",
      "    for sent, actualSent in zip(test,testData):\n",
      "        mylist=[]\n",
      "        predSet=[]\n",
      "        mylist.append(id)\n",
      "        id += 1\n",
      "        for word,actualword in zip(sent,actualSent):\n",
      "            if tagger.tag(nltk.word_tokenize(word))[0][1]  == actualword[1]:\n",
      "                correct += 1\n",
      "            else:\n",
      "                incorrect += 1\n",
      "            predSet.append(tagger.tag(nltk.word_tokenize(word)))\n",
      "        mylist.append(actualSent)\n",
      "        mylist.append(predSet)\n",
      "        csvContentPre.append(mylist)\n",
      "    print 'No of correct:' , correct\n",
      "    print 'No of incorrect:', incorrect\n",
      "    print 'Accuracy:', (Decimal(correct)/Decimal((correct + incorrect)))*100"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 104
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "myEvaluatePreCoarse(affix)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "No of correct: 35213\n",
        "No of incorrect: 52873\n",
        "Accuracy: 39.98\n"
       ]
      }
     ],
     "prompt_number": 105
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "myEvaluatePreCoarse(unigram)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "No of correct: 79839\n",
        "No of incorrect: 8247\n",
        "Accuracy: 90.64\n"
       ]
      }
     ],
     "prompt_number": 106
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "myEvaluatePreCoarse(bigram)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "No of correct: 79795\n",
        "No of incorrect: 8291\n",
        "Accuracy: 90.59\n"
       ]
      }
     ],
     "prompt_number": 107
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "myEvaluatePreCoarse(trigram)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "No of correct: 79795\n",
        "No of incorrect: 8291\n",
        "Accuracy: 90.59\n"
       ]
      }
     ],
     "prompt_number": 108
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "#Generate CSV file\n",
      "import csv\n",
      "f = open('part2b.tsv','wb') \n",
      "fw = csv.writer(f,delimiter='\\t') \n",
      "fw.writerows(csvContent) \n",
      "f.close() "
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 40
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "def myPOSEvaluatePre(tagger):\n",
      "    getcontext().prec = 4\n",
      "    tags={}\n",
      "    accuracy=[]\n",
      "    for sent, actualSent in zip(test,testData):\n",
      "        for word,actualword in zip(sent,actualSent):\n",
      "            predSent = tagger.tag(nltk.word_tokenize(word))[0][1]\n",
      "            if(not tags.has_key(predSent)):\n",
      "                tags[predSent]=[0,0]\n",
      "            accuracy = tags[predSent]\n",
      "            if predSent == actualword[1]:\n",
      "                accuracy[0] += 1\n",
      "            else:\n",
      "                accuracy[1] += 1\n",
      "            tags[predSent] = accuracy\n",
      "    finaltags=[]\n",
      "    for key in tags.keys():\n",
      "        correct = tags[key][0]\n",
      "        incorrect = tags[key][1]\n",
      "        finaltags.append((Decimal(correct)/Decimal((correct + incorrect)))*100)\n",
      "\n",
      "    DF = pd.DataFrame()\n",
      "    DF['Tag Name'] = tags.keys()\n",
      "    DF['Accuracy'] = finaltags\n",
      "    print DF"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 41
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "myPOSEvaluatePre(affix)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "  Tag Name Accuracy\n",
        "0      SNN    33.62\n",
        "1      SRB    76.51\n",
        "2      SVB    68.97\n",
        "3      SJJ    63.08\n",
        "4     MISC    85.24\n",
        "\n",
        "[5 rows x 2 columns]\n"
       ]
      }
     ],
     "prompt_number": 42
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "myPOSEvaluatePre(unigram)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "  Tag Name Accuracy\n",
        "0      SNN    84.40\n",
        "1      SRB    82.33\n",
        "2      SJJ    77.79\n",
        "3     MISC    98.36\n",
        "4      SVB    86.43\n",
        "\n",
        "[5 rows x 2 columns]\n"
       ]
      }
     ],
     "prompt_number": 43
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "myPOSEvaluatePre(bigram)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "  Tag Name Accuracy\n",
        "0      SNN    84.38\n",
        "1      SRB    82.33\n",
        "2      SJJ    77.02\n",
        "3     MISC    98.36\n",
        "4      SVB    86.43\n",
        "\n",
        "[5 rows x 2 columns]\n"
       ]
      }
     ],
     "prompt_number": 44
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "myPOSEvaluatePre(trigram)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "  Tag Name Accuracy\n",
        "0      SNN    84.38\n",
        "1      SRB    82.33\n",
        "2      SJJ    77.02\n",
        "3     MISC    98.36\n",
        "4      SVB    86.43\n",
        "\n",
        "[5 rows x 2 columns]\n"
       ]
      }
     ],
     "prompt_number": 45
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "Matrix = [[0 for x in range(5)] for x in range(5)]\n",
      "def getConfusionMatrixPreCoarse(tagger):\n",
      "    from sklearn.metrics import confusion_matrix\n",
      "    dict={'SNN':0, 'SVB':1, 'SJJ':2, 'SRB':3, 'MISC':4}\n",
      "    tags={}\n",
      "    pred=[]\n",
      "    actual=[]\n",
      "    for sent, actualSent in zip(test,testData):\n",
      "        for word,actualword in zip(sent,actualSent):\n",
      "            predSent = tagger.tag(nltk.word_tokenize(word))[0][1]    \n",
      "            Matrix[dict[actualword[1]]][dict[predSent]] +=1\n",
      "                \n",
      "getConfusionMatrixPreCoarse(bigram)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 46
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "table = ListTable()\n",
      "tagList=['SNN', 'SVB', 'SJJ', 'SRB', 'MISC']\n",
      "table.append([' '] + tagList)\n",
      "for i in xrange(5):\n",
      "    table.append([tagList[i]] + Matrix[i])\n",
      "print 'Confusion Matrix'\n",
      "table"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Confusion Matrix\n"
       ]
      },
      {
       "html": [
        "<table><tr><td> </td><td>SNN</td><td>SVB</td><td>SJJ</td><td>SRB</td><td>MISC</td></tr><tr><td>SNN</td><td>25303</td><td>943</td><td>737</td><td>121</td><td>235</td></tr><tr><td>SVB</td><td>1726</td><td>9056</td><td>131</td><td>25</td><td>142</td></tr><tr><td>SJJ</td><td>1204</td><td>455</td><td>3590</td><td>235</td><td>85</td></tr><tr><td>SRB</td><td>129</td><td>18</td><td>167</td><td>2101</td><td>201</td></tr><tr><td>MISC</td><td>1625</td><td>6</td><td>36</td><td>70</td><td>39745</td></tr></table>"
       ],
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 47,
       "text": [
        "[[' ', 'SNN', 'SVB', 'SJJ', 'SRB', 'MISC'],\n",
        " ['SNN', 25303, 943, 737, 121, 235],\n",
        " ['SVB', 1726, 9056, 131, 25, 142],\n",
        " ['SJJ', 1204, 455, 3590, 235, 85],\n",
        " ['SRB', 129, 18, 167, 2101, 201],\n",
        " ['MISC', 1625, 6, 36, 70, 39745]]"
       ]
      }
     ],
     "prompt_number": 47
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Part II  \n",
      "\u2022  Did you a priori (before experimentation) expect Method A or B to  perform better? Why? There is no correct answer here. This  exercise is to test your ability to articulate your intuitions ba  sed on  what you\u2019ve learnt in class. \n",
      "\n",
      "Actually my priori indicated that MethodB will do well as it has been trained on coarse categories. \n"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "\u2022 Give plausible explanations for the observed differences in overall  accuracy between Method A and B.  Again there is no correct  answer here. The purpose of this exercise is for you to connect  concepts we\u2019ve learnt in cla  ss to what you observe in practice\n",
      "\n",
      "Sometimes the close categories having few examples might have lead to incorrect predictions in Method B. This is because in method A we are just mapping the fine grained categories to coarse caegories. Hence there is no drastic improvement in method B , eventhough it could have performed better. The Confusion matrices are very similar too."
     ]
    }
   ],
   "metadata": {}
  }
 ]
}