Skip to content

Instantly share code, notes, and snippets.

@alkutnikar
Created March 5, 2015 21:54
Show Gist options
  • Save alkutnikar/325da91e51c3fdd617f3 to your computer and use it in GitHub Desktop.
Save alkutnikar/325da91e51c3fdd617f3 to your computer and use it in GitHub Desktop.
{
"metadata": {
"name": "",
"signature": "sha256:a4baa3274925a195e78c7fe11589641b8a04e07e5a258f1e04cda1c58a775f14"
},
"nbformat": 3,
"nbformat_minor": 0,
"worksheets": [
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#Assignment 1"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Author - Ajay Lakshminarayanarao**"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In this assignment you will investigate the ability of a POS tagging system to deal\n",
"with fine-grained POS categories and coarse-grained POS categories. \n",
"The goal of the assignment is to run conduct three experiments, analyze the results, and to\n",
"draw conclusions."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"##Import Data"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"from nltk.corpus import treebank\n",
"from decimal import *\n",
"from IPython.display import HTML\n",
"from IPython.display import *\n",
"%matplotlib inline"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 1
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"words = treebank.words()\n",
"n = len(words)"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 2
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"train = []\n",
"test = []\n",
"for i in treebank.sents()[:500]:\n",
" train.append(i)\n",
"for j in treebank.sents()[500:]:\n",
" test.append(j)"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 62
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"##Divide the data into training and test"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"trainData = treebank.tagged_sents()[:500]\n",
"testData = treebank.tagged_sents()[500:]"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 63
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"##Train POS Tagger using training sentences"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"import nltk\n",
"from nltk import UnigramTagger, BigramTagger, TrigramTagger,AffixTagger\n",
"t = nltk.DefaultTagger('NN')\n",
"affix = AffixTagger(trainData,backoff=t)\n",
"unigram = UnigramTagger(trainData, backoff = affix)\n",
"bigram = BigramTagger(trainData, backoff=unigram) \n",
"trigram = TrigramTagger(trainData, backoff=bigram)"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 5
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"csvContent=[]\n",
"def myEvaluate(tagger):\n",
" id=1\n",
" csvContent.append(['id','Actual Tag','Predicted Tag'])\n",
" getcontext().prec = 4\n",
" correct = 0\n",
" incorrect = 0\n",
" for sent, actualSent in zip(test,testData):\n",
" mylist=[]\n",
" predSet=[]\n",
" mylist.append(id)\n",
" id+=1\n",
" for word,actualword in zip(sent,actualSent):\n",
" if tagger.tag(nltk.word_tokenize(word))[0][1] == actualword[1]:\n",
" correct += 1\n",
" else:\n",
" incorrect += 1\n",
" predSet.append(tagger.tag(nltk.word_tokenize(word)))\n",
" mylist.append(actualSent)\n",
" mylist.append(predSet)\n",
" csvContent.append(mylist)\n",
" print 'No of correct:' , correct\n",
" print 'No of incorrect:', incorrect\n",
" print 'Accuracy:', (Decimal(correct)/Decimal((correct + incorrect)))*100"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 6
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"myEvaluate(affix)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"No of correct: 24720\n",
"No of incorrect: 63366\n",
"Accuracy: 28.06\n"
]
}
],
"prompt_number": 7
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"myEvaluate(unigram)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"No of correct: 72481\n",
"No of incorrect: 15605\n",
"Accuracy: 82.28\n"
]
}
],
"prompt_number": 8
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"myEvaluate(bigram)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"No of correct: 72472\n",
"No of incorrect: 15614\n",
"Accuracy: 82.27\n"
]
}
],
"prompt_number": 9
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"myEvaluate(trigram)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"No of correct: 72472\n",
"No of incorrect: 15614\n",
"Accuracy: 82.27\n"
]
}
],
"prompt_number": 10
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#Generate TSV file\n",
"import csv\n",
"f = open('part1.tsv','wb') \n",
"fw = csv.writer(f,delimiter='\\t') \n",
"fw.writerows(csvContent) \n",
"f.close() "
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 11
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"def display(mytags):\n",
" s = \"\"\"<table>\n",
" <tr>\n",
" <th>Tag Name</th>\n",
" <th>Tag Accuracy</th>\n",
" </tr>\n",
" \"\"\"\n",
" for key in mytags.keys():\n",
" s += \"\"\"\n",
" <tr>\n",
" <td>\"\"\"+key+\"\"\"</td>\n",
" <td>\"\"\"+str(mytags[key])+\"\"\"</td>\n",
" </tr>\"\"\"\n",
" s+=\"\"\"</table>\"\"\"\n",
" content = HTML(s);content"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 12
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"import pandas as pd\n",
"def myPOSEvaluate(tagger):\n",
" getcontext().prec = 4\n",
" tags={}\n",
" accuracy=[]\n",
" for sent, actualSent in zip(test,testData):\n",
" for word,actualword in zip(sent,actualSent):\n",
" predSent = tagger.tag(nltk.word_tokenize(word))[0][1]\n",
" if(not tags.has_key(predSent)):\n",
" tags[predSent]=[0,0]\n",
" accuracy = tags[predSent]\n",
" if predSent == actualword[1]:\n",
" accuracy[0] += 1\n",
" else:\n",
" accuracy[1] += 1\n",
" tags[predSent] = accuracy\n",
" finaltags=[]\n",
" for key in tags.keys():\n",
" correct = tags[key][0]\n",
" incorrect = tags[key][1]\n",
" finaltags.append((Decimal(correct)/Decimal((correct + incorrect)))*100)\n",
"\n",
" DF = pd.DataFrame()\n",
" DF['Tag Name'] = tags.keys()\n",
" DF['Accuracy'] = finaltags\n",
" print DF"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 13
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"myPOSEvaluate(bigram)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
" Tag Name Accuracy\n",
"0 PRP$ 100\n",
"1 VBG 70.06\n",
"2 VBD 82.48\n",
"3 `` 100\n",
"4 POS 88.61\n",
"5 '' 100\n",
"6 VBP 64.12\n",
"7 WDT 94.88\n",
"8 JJ 75.79\n",
"9 WP 98.58\n",
"10 VBZ 89.63\n",
"11 DT 98.19\n",
"12 RP 53.23\n",
"13 $ 100\n",
"14 NN 54.69\n",
"15 , 100\n",
"16 . 100\n",
"17 TO 99.90\n",
"18 PRP 97.71\n",
"19 RB 86.74\n",
"20 -LRB- 100\n",
"21 : 100\n",
"22 NNS 84.81\n",
"23 NNP 79.51\n",
"24 VB 61.01\n",
"25 WRB 100\n",
"26 CC 99.40\n",
"27 RBR 38.50\n",
"28 VBN 57.85\n",
"29 -NONE- 99.91\n",
"30 EX 71.05\n",
"31 IN 92.65\n",
"32 WP$ 100\n",
"33 CD 98.44\n",
"34 MD 99.61\n",
"35 NNPS 43.90\n",
"36 -RRB- 100\n",
"37 JJS 70.05\n",
"38 JJR 63.26\n",
"\n",
"[39 rows x 2 columns]\n"
]
}
],
"prompt_number": 14
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"class ListTable(list): \n",
" def _repr_html_(self):\n",
" html = [\"<table>\"]\n",
" for row in self:\n",
" html.append(\"<tr>\")\n",
" \n",
" for col in row:\n",
" html.append(\"<td>{0}</td>\".format(col))\n",
" \n",
" html.append(\"</tr>\")\n",
" html.append(\"</table>\")\n",
" return ''.join(html)"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 15
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"Matrix = [[0 for x in range(11)] for x in range(11)]\n",
"def getConfusionMatrix2(tagger):\n",
" from sklearn.metrics import confusion_matrix\n",
" dict={'JJ':0, 'NN':1, 'NNP':2, 'NNPS':3, 'RB':4, 'RP':5, 'IN':6, 'VB':7, 'VBD':8, 'VBN':9, 'VBP':10}\n",
" mylist=[]\n",
" tags={}\n",
" pred=[]\n",
" actual=[]\n",
" for sent, actualSent in zip(test,testData):\n",
" for word,actualword in zip(sent,actualSent):\n",
" predSent = tagger.tag(nltk.word_tokenize(word))[0][1] \n",
" if ((predSent in dict.keys()) and (actualword[1] in dict.keys())):\n",
" Matrix[dict[actualword[1]]][dict[predSent]] +=1\n",
" \n",
"getConfusionMatrix2(bigram)"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 16
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"table = ListTable()\n",
"tagList=['JJ', 'NN', 'NNP', 'NNPS', 'RB', 'RP', 'IN', 'VB', 'VBD', 'VBN', 'VBP']\n",
"table.append([' '] + tagList)\n",
"for i in xrange(11):\n",
" table.append([tagList[i]] + Matrix[i])\n",
"print 'Confusion Matrix'\n",
"table"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"Confusion Matrix\n"
]
},
{
"html": [
"<table><tr><td> </td><td>JJ</td><td>NN</td><td>NNP</td><td>NNPS</td><td>RB</td><td>RP</td><td>IN</td><td>VB</td><td>VBD</td><td>VBN</td><td>VBP</td></tr><tr><td>JJ</td><td>3272</td><td>932</td><td>190</td><td>0</td><td>123</td><td>0</td><td>68</td><td>56</td><td>28</td><td>207</td><td>15</td></tr><tr><td>NN</td><td>324</td><td>9595</td><td>459</td><td>0</td><td>51</td><td>0</td><td>81</td><td>390</td><td>2</td><td>49</td><td>106</td></tr><tr><td>NNP</td><td>384</td><td>3779</td><td>3337</td><td>13</td><td>39</td><td>0</td><td>69</td><td>37</td><td>26</td><td>25</td><td>11</td></tr><tr><td>NNPS</td><td>0</td><td>16</td><td>13</td><td>18</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td></tr><tr><td>RB</td><td>97</td><td>116</td><td>6</td><td>0</td><td>2002</td><td>54</td><td>109</td><td>31</td><td>0</td><td>1</td><td>1</td></tr><tr><td>RP</td><td>0</td><td>1</td><td>0</td><td>0</td><td>14</td><td>132</td><td>31</td><td>6</td><td>0</td><td>0</td><td>0</td></tr><tr><td>IN</td><td>19</td><td>66</td><td>19</td><td>0</td><td>43</td><td>62</td><td>8348</td><td>1</td><td>0</td><td>0</td><td>1</td></tr><tr><td>VB</td><td>135</td><td>623</td><td>58</td><td>0</td><td>12</td><td>0</td><td>26</td><td>1114</td><td>0</td><td>31</td><td>266</td></tr><tr><td>VBD</td><td>8</td><td>101</td><td>3</td><td>0</td><td>0</td><td>0</td><td>0</td><td>16</td><td>1841</td><td>647</td><td>2</td></tr><tr><td>VBN</td><td>23</td><td>116</td><td>31</td><td>0</td><td>7</td><td>0</td><td>1</td><td>13</td><td>335</td><td>1326</td><td>5</td></tr><tr><td>VBP</td><td>20</td><td>194</td><td>12</td><td>0</td><td>2</td><td>0</td><td>9</td><td>162</td><td>0</td><td>3</td><td>731</td></tr></table>"
],
"metadata": {},
"output_type": "pyout",
"prompt_number": 17,
"text": [
"[[' ', 'JJ', 'NN', 'NNP', 'NNPS', 'RB', 'RP', 'IN', 'VB', 'VBD', 'VBN', 'VBP'],\n",
" ['JJ', 3272, 932, 190, 0, 123, 0, 68, 56, 28, 207, 15],\n",
" ['NN', 324, 9595, 459, 0, 51, 0, 81, 390, 2, 49, 106],\n",
" ['NNP', 384, 3779, 3337, 13, 39, 0, 69, 37, 26, 25, 11],\n",
" ['NNPS', 0, 16, 13, 18, 0, 0, 0, 0, 0, 0, 0],\n",
" ['RB', 97, 116, 6, 0, 2002, 54, 109, 31, 0, 1, 1],\n",
" ['RP', 0, 1, 0, 0, 14, 132, 31, 6, 0, 0, 0],\n",
" ['IN', 19, 66, 19, 0, 43, 62, 8348, 1, 0, 0, 1],\n",
" ['VB', 135, 623, 58, 0, 12, 0, 26, 1114, 0, 31, 266],\n",
" ['VBD', 8, 101, 3, 0, 0, 0, 0, 16, 1841, 647, 2],\n",
" ['VBN', 23, 116, 31, 0, 7, 0, 1, 13, 335, 1326, 5],\n",
" ['VBP', 20, 194, 12, 0, 2, 0, 9, 162, 0, 3, 731]]"
]
}
],
"prompt_number": 17
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" *Give plausible explanations for the two categories on which the POS tagger performs the worst and the top two categories on which it performs the best. For example, the tagger may get PRP with a high accuracy and may make most mistakes on RBP. Think what factors could affect the performance on a category.\n",
" \n",
" Ans: The major error is when the actual category is NNP and the predicted category is NN with 3779 incorrect entries. The next high error rate is when the actual tag is verb and the predicted tag is noun. In the first case it has failed to distinguish proper nouns and predicted as nouns. In the second case, it has predicted verbs to be nouns. In the first case a proper noun is a closed case category and has very limited examples in real life. The model could not be trained with this very small set and hence could not distinguish between proper nouns and nouns. whereas in the case of nouns and verbs, they are sometimes used in different contexts and hence it becomes difficult for the parser to understand the context."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"*Briefly (in a couple of sentences) mention what tests you can do to verify if your explanations are indeed true\n",
"\n",
"Ans: Information such as knowledge of word probabilities and knowledge of the POS of the neighbouring words will helpp minimize these error. Also using advanced models like HMM and feature rich model will help explain the cause.\n"
]
},
{
"cell_type": "heading",
"level": 2,
"metadata": {},
"source": [
"PART 2"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
" myDict = {\"NN\":\"SNN\", \"NNS\":\"SNN\", \"NNP\":\"SNN\", \"NNPS\":\"SNN\", \"PRP\":\"SNN\", \"PRP$\":\"SNN\", \\\n",
" \"VB\":\"SVB\", \"VBP\":\"SVB\", \"VBD\":\"SVB\", \"VBN\":\"SVB\", \"VBZ\":\"SVB\", \"VBG\":\"SVB\", \\\n",
" \"JJ\":\"SJJ\", \"JJR\":\"SJJ\", \"JJS\":\"SJJ\", \\\n",
" \"RB\":\"SRB\", \"RBR\":\"SRB\", \"RBS\":\"SRB\"}"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 48
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"csvContent=[]\n",
"def myEvaluateCoarse(tagger):\n",
" getcontext().prec = 4\n",
" correct = 0\n",
" incorrect = 0\n",
" id=1\n",
" csvContent.append(['id','Actual Tag','Predicted Tag'])\n",
" for sent, actualSent in zip(test,testData):\n",
" mylist=[]\n",
" predSet=[]\n",
" mylist.append(id)\n",
" id += 1\n",
" for word,actualword in zip(sent,actualSent):\n",
" if tagger.tag(nltk.word_tokenize(word))[0][1] in myDict.keys():\n",
" val = myDict[tagger.tag(nltk.word_tokenize(word))[0][1]]\n",
" else:\n",
" val = \"MISC\"\n",
" if actualword[1] in myDict.keys():\n",
" val2 = myDict[actualword[1]]\n",
" else:\n",
" val2 = \"MISC\"\n",
" \n",
" if val == val2:\n",
" correct += 1\n",
" else:\n",
" incorrect += 1\n",
" predSet.append(tagger.tag(nltk.word_tokenize(word)))\n",
" mylist.append(actualSent)\n",
" mylist.append(predSet)\n",
" csvContent.append(mylist)\n",
" print 'No of correct:' , correct\n",
" print 'No of incorrect:', incorrect\n",
" print 'Accuracy:', (Decimal(correct)/Decimal((correct + incorrect)))*100"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 49
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"myEvaluateCoarse(affix)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"No of correct: 88086\n",
"No of incorrect: 0\n",
"Accuracy: 100\n"
]
}
],
"prompt_number": 50
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"myEvaluateCoarse(unigram)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"No of correct: 79579\n",
"No of incorrect: 8507\n",
"Accuracy: 90.34\n"
]
}
],
"prompt_number": 21
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"myEvaluateCoarse(bigram)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"No of correct: 79579\n",
"No of incorrect: 8507\n",
"Accuracy: 90.34\n"
]
}
],
"prompt_number": 22
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"myEvaluateCoarse(trigram)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"No of correct: 79579\n",
"No of incorrect: 8507\n",
"Accuracy: 90.34\n"
]
}
],
"prompt_number": 23
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#Generate CSV file\n",
"import csv\n",
"f = open('part2.tsv','wb') \n",
"fw = csv.writer(f,delimiter='\\t') \n",
"fw.writerows(csvContent) \n",
"f.close() "
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 51
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"import pandas as pd\n",
"def myPOSEvaluateCoarse(tagger):\n",
" getcontext().prec = 4\n",
" tags={}\n",
" accuracy=[]\n",
" for sent, actualSent in zip(test,testData):\n",
" for word,actualword in zip(sent,actualSent):\n",
" if tagger.tag(nltk.word_tokenize(word))[0][1] in myDict.keys():\n",
" val = myDict[tagger.tag(nltk.word_tokenize(word))[0][1]]\n",
" else:\n",
" val = \"MISC\"\n",
" if actualword[1] in myDict.keys():\n",
" val2 = myDict[actualword[1]]\n",
" else:\n",
" val2 = \"MISC\"\n",
" \n",
" predSent = val\n",
" if(not tags.has_key(predSent)):\n",
" tags[predSent]=[0,0]\n",
" accuracy = tags[predSent]\n",
" if predSent == val2:\n",
" accuracy[0] += 1\n",
" else:\n",
" accuracy[1] += 1\n",
" tags[predSent] = accuracy\n",
" finaltags=[]\n",
" for key in tags.keys():\n",
" correct = tags[key][0]\n",
" incorrect = tags[key][1]\n",
" finaltags.append((Decimal(correct)/Decimal((correct + incorrect)))*100)\n",
"\n",
" DF = pd.DataFrame()\n",
" DF['Tag Name'] = tags.keys()\n",
" DF['Accuracy'] = finaltags\n",
" print DF"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 24
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"myPOSEvaluateCoarse(affix)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
" Tag Name Accuracy\n",
"0 SNN 33.48\n",
"1 SRB 76.94\n",
"2 SVB 67.97\n",
"3 SJJ 59.51\n",
"4 MISC 84.65\n",
"\n",
"[5 rows x 2 columns]\n"
]
}
],
"prompt_number": 25
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"myPOSEvaluateCoarse(unigram)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
" Tag Name Accuracy\n",
"0 SNN 84.65\n",
"1 SRB 83.21\n",
"2 SJJ 75.42\n",
"3 MISC 98.30\n",
"4 SVB 84.41\n",
"\n",
"[5 rows x 2 columns]\n"
]
}
],
"prompt_number": 26
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"myPOSEvaluateCoarse(bigram)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
" Tag Name Accuracy\n",
"0 SNN 84.65\n",
"1 SRB 83.21\n",
"2 SJJ 75.42\n",
"3 MISC 98.30\n",
"4 SVB 84.41\n",
"\n",
"[5 rows x 2 columns]\n"
]
}
],
"prompt_number": 27
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"myPOSEvaluateCoarse(trigram)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
" Tag Name Accuracy\n",
"0 SNN 84.65\n",
"1 SRB 83.21\n",
"2 SJJ 75.42\n",
"3 MISC 98.30\n",
"4 SVB 84.41\n",
"\n",
"[5 rows x 2 columns]\n"
]
}
],
"prompt_number": 28
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#Confusion Matrix\n",
"Matrix = [[0 for x in range(5)] for x in range(5)]\n",
"def getConfusionMatrixCoarse(tagger):\n",
" dict={'SNN':0, 'SVB':1, 'SJJ':2, 'SRB':3, 'MISC':4}\n",
" tags={}\n",
" pred=[]\n",
" actual=[]\n",
" for sent, actualSent in zip(test,testData):\n",
" for word,actualword in zip(sent,actualSent):\n",
" if tagger.tag(nltk.word_tokenize(word))[0][1] in myDict.keys():\n",
" val = myDict[tagger.tag(nltk.word_tokenize(word))[0][1]]\n",
" else:\n",
" val = \"MISC\"\n",
" if actualword[1] in myDict.keys():\n",
" val2 = myDict[actualword[1]]\n",
" else:\n",
" val2 = \"MISC\"\n",
" \n",
" predSent = val \n",
" \n",
" Matrix[dict[val2]][dict[val]] +=1\n",
" \n",
"getConfusionMatrixCoarse(bigram)"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 30
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"table = ListTable()\n",
"tagList=['SNN', 'SVB', 'SJJ', 'SRB', 'MISC']\n",
"table.append([' '] + tagList)\n",
"for i in xrange(5):\n",
" table.append([tagList[i]] + Matrix[i])\n",
"print 'Confusion Matrix'\n",
"table"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"Confusion Matrix\n"
]
},
{
"html": [
"<table><tr><td> </td><td>SNN</td><td>SVB</td><td>SJJ</td><td>SRB</td><td>MISC</td></tr><tr><td>SNN</td><td>25046</td><td>1185</td><td>773</td><td>92</td><td>243</td></tr><tr><td>SVB</td><td>1622</td><td>9099</td><td>196</td><td>21</td><td>142</td></tr><tr><td>SJJ</td><td>1171</td><td>451</td><td>3611</td><td>238</td><td>98</td></tr><tr><td>SRB</td><td>129</td><td>33</td><td>172</td><td>2076</td><td>206</td></tr><tr><td>MISC</td><td>1619</td><td>12</td><td>36</td><td>68</td><td>39747</td></tr></table>"
],
"metadata": {},
"output_type": "pyout",
"prompt_number": 31,
"text": [
"[[' ', 'SNN', 'SVB', 'SJJ', 'SRB', 'MISC'],\n",
" ['SNN', 25046, 1185, 773, 92, 243],\n",
" ['SVB', 1622, 9099, 196, 21, 142],\n",
" ['SJJ', 1171, 451, 3611, 238, 98],\n",
" ['SRB', 129, 33, 172, 2076, 206],\n",
" ['MISC', 1619, 12, 36, 68, 39747]]"
]
}
],
"prompt_number": 31
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"tsent = treebank.tagged_sents()\n",
"mytsent=[]\n",
"for sent in tsent:\n",
" mysent=[]\n",
" for word in sent:\n",
" if word[1] in myDict:\n",
" myword=(word[0], myDict[word[1]])\n",
" else:\n",
" myword = (word[0], \"MISC\")\n",
" mysent.append(myword)\n",
" mytsent.append(mysent)"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 80
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"trainData = mytsent[:500]\n",
"testData = mytsent[500:]"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 81
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"t = nltk.DefaultTagger('SNN')\n",
"affix = AffixTagger(trainData,backoff=t)\n",
"unigram = UnigramTagger(trainData, backoff = affix)\n",
"bigram = BigramTagger(trainData, backoff=unigram) \n",
"trigram = TrigramTagger(trainData, backoff=bigram)"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 82
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"csvContentPre=[]\n",
"def myEvaluatePreCoarse(tagger):\n",
" getcontext().prec = 4\n",
" correct = 0\n",
" incorrect = 0\n",
" id=1\n",
" csvContentPre.append(['id','Actual Tag','Predicted Tag'])\n",
" for sent, actualSent in zip(test,testData):\n",
" mylist=[]\n",
" predSet=[]\n",
" mylist.append(id)\n",
" id += 1\n",
" for word,actualword in zip(sent,actualSent):\n",
" if tagger.tag(nltk.word_tokenize(word))[0][1] == actualword[1]:\n",
" correct += 1\n",
" else:\n",
" incorrect += 1\n",
" predSet.append(tagger.tag(nltk.word_tokenize(word)))\n",
" mylist.append(actualSent)\n",
" mylist.append(predSet)\n",
" csvContentPre.append(mylist)\n",
" print 'No of correct:' , correct\n",
" print 'No of incorrect:', incorrect\n",
" print 'Accuracy:', (Decimal(correct)/Decimal((correct + incorrect)))*100"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 104
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"myEvaluatePreCoarse(affix)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"No of correct: 35213\n",
"No of incorrect: 52873\n",
"Accuracy: 39.98\n"
]
}
],
"prompt_number": 105
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"myEvaluatePreCoarse(unigram)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"No of correct: 79839\n",
"No of incorrect: 8247\n",
"Accuracy: 90.64\n"
]
}
],
"prompt_number": 106
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"myEvaluatePreCoarse(bigram)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"No of correct: 79795\n",
"No of incorrect: 8291\n",
"Accuracy: 90.59\n"
]
}
],
"prompt_number": 107
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"myEvaluatePreCoarse(trigram)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"No of correct: 79795\n",
"No of incorrect: 8291\n",
"Accuracy: 90.59\n"
]
}
],
"prompt_number": 108
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#Generate CSV file\n",
"import csv\n",
"f = open('part2b.tsv','wb') \n",
"fw = csv.writer(f,delimiter='\\t') \n",
"fw.writerows(csvContent) \n",
"f.close() "
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 40
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"def myPOSEvaluatePre(tagger):\n",
" getcontext().prec = 4\n",
" tags={}\n",
" accuracy=[]\n",
" for sent, actualSent in zip(test,testData):\n",
" for word,actualword in zip(sent,actualSent):\n",
" predSent = tagger.tag(nltk.word_tokenize(word))[0][1]\n",
" if(not tags.has_key(predSent)):\n",
" tags[predSent]=[0,0]\n",
" accuracy = tags[predSent]\n",
" if predSent == actualword[1]:\n",
" accuracy[0] += 1\n",
" else:\n",
" accuracy[1] += 1\n",
" tags[predSent] = accuracy\n",
" finaltags=[]\n",
" for key in tags.keys():\n",
" correct = tags[key][0]\n",
" incorrect = tags[key][1]\n",
" finaltags.append((Decimal(correct)/Decimal((correct + incorrect)))*100)\n",
"\n",
" DF = pd.DataFrame()\n",
" DF['Tag Name'] = tags.keys()\n",
" DF['Accuracy'] = finaltags\n",
" print DF"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 41
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"myPOSEvaluatePre(affix)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
" Tag Name Accuracy\n",
"0 SNN 33.62\n",
"1 SRB 76.51\n",
"2 SVB 68.97\n",
"3 SJJ 63.08\n",
"4 MISC 85.24\n",
"\n",
"[5 rows x 2 columns]\n"
]
}
],
"prompt_number": 42
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"myPOSEvaluatePre(unigram)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
" Tag Name Accuracy\n",
"0 SNN 84.40\n",
"1 SRB 82.33\n",
"2 SJJ 77.79\n",
"3 MISC 98.36\n",
"4 SVB 86.43\n",
"\n",
"[5 rows x 2 columns]\n"
]
}
],
"prompt_number": 43
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"myPOSEvaluatePre(bigram)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
" Tag Name Accuracy\n",
"0 SNN 84.38\n",
"1 SRB 82.33\n",
"2 SJJ 77.02\n",
"3 MISC 98.36\n",
"4 SVB 86.43\n",
"\n",
"[5 rows x 2 columns]\n"
]
}
],
"prompt_number": 44
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"myPOSEvaluatePre(trigram)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
" Tag Name Accuracy\n",
"0 SNN 84.38\n",
"1 SRB 82.33\n",
"2 SJJ 77.02\n",
"3 MISC 98.36\n",
"4 SVB 86.43\n",
"\n",
"[5 rows x 2 columns]\n"
]
}
],
"prompt_number": 45
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"Matrix = [[0 for x in range(5)] for x in range(5)]\n",
"def getConfusionMatrixPreCoarse(tagger):\n",
" from sklearn.metrics import confusion_matrix\n",
" dict={'SNN':0, 'SVB':1, 'SJJ':2, 'SRB':3, 'MISC':4}\n",
" tags={}\n",
" pred=[]\n",
" actual=[]\n",
" for sent, actualSent in zip(test,testData):\n",
" for word,actualword in zip(sent,actualSent):\n",
" predSent = tagger.tag(nltk.word_tokenize(word))[0][1] \n",
" Matrix[dict[actualword[1]]][dict[predSent]] +=1\n",
" \n",
"getConfusionMatrixPreCoarse(bigram)"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 46
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"table = ListTable()\n",
"tagList=['SNN', 'SVB', 'SJJ', 'SRB', 'MISC']\n",
"table.append([' '] + tagList)\n",
"for i in xrange(5):\n",
" table.append([tagList[i]] + Matrix[i])\n",
"print 'Confusion Matrix'\n",
"table"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"Confusion Matrix\n"
]
},
{
"html": [
"<table><tr><td> </td><td>SNN</td><td>SVB</td><td>SJJ</td><td>SRB</td><td>MISC</td></tr><tr><td>SNN</td><td>25303</td><td>943</td><td>737</td><td>121</td><td>235</td></tr><tr><td>SVB</td><td>1726</td><td>9056</td><td>131</td><td>25</td><td>142</td></tr><tr><td>SJJ</td><td>1204</td><td>455</td><td>3590</td><td>235</td><td>85</td></tr><tr><td>SRB</td><td>129</td><td>18</td><td>167</td><td>2101</td><td>201</td></tr><tr><td>MISC</td><td>1625</td><td>6</td><td>36</td><td>70</td><td>39745</td></tr></table>"
],
"metadata": {},
"output_type": "pyout",
"prompt_number": 47,
"text": [
"[[' ', 'SNN', 'SVB', 'SJJ', 'SRB', 'MISC'],\n",
" ['SNN', 25303, 943, 737, 121, 235],\n",
" ['SVB', 1726, 9056, 131, 25, 142],\n",
" ['SJJ', 1204, 455, 3590, 235, 85],\n",
" ['SRB', 129, 18, 167, 2101, 201],\n",
" ['MISC', 1625, 6, 36, 70, 39745]]"
]
}
],
"prompt_number": 47
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Part II \n",
"\u2022 Did you a priori (before experimentation) expect Method A or B to perform better? Why? There is no correct answer here. This exercise is to test your ability to articulate your intuitions ba sed on what you\u2019ve learnt in class. \n",
"\n",
"Actually my priori indicated that MethodB will do well as it has been trained on coarse categories. \n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\u2022 Give plausible explanations for the observed differences in overall accuracy between Method A and B. Again there is no correct answer here. The purpose of this exercise is for you to connect concepts we\u2019ve learnt in cla ss to what you observe in practice\n",
"\n",
"Sometimes the close categories having few examples might have lead to incorrect predictions in Method B. This is because in method A we are just mapping the fine grained categories to coarse caegories. Hence there is no drastic improvement in method B , eventhough it could have performed better. The Confusion matrices are very similar too."
]
}
],
"metadata": {}
}
]
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment