alkutnikar/relation_extraction

## relation_extraction
{
	"metadata": {
		"name": "",
		"signature": "sha256:27ba14dd948ce47454e111a60aec9e03aa3ef3d3649f5e52256948d157666cab"
	},
	"nbformat": 3,
	"nbformat_minor": 0,
	"worksheets": [{
		"cells": [{
				"cell_type": "markdown",
				"metadata": {

				},
				"source": [
					"\n"
				]
			},
			{
				"cell_type": "markdown",
				"metadata": {

				},
				"source": [
					"We are building a relation extractor to identify institution relation instances from Wikipedia sentences.  An institution relation institution(x, y) indicates that a person x studied in the institution y. "
				]
			},
			{
				"cell_type": "markdown",
				"metadata": {

				},
				"source": [
					"1.Manual Rule Based Extractor\n",
					"\n",
					"Write a rule-based extractor. Specify five regular expression patterns (word-based) to identify “institution” relations. Use the training set to develop your patterns and report the performance on the test set. \n",
					"\n",
					"e.g., If sentence matches “(.*?)[space]graduate.*[space]in” then output “Yes” for the relation otherwise output “No”."
				]
			},
			{
				"cell_type": "code",
				"collapsed": false,
				"input": [
					"TEST_DATA_PATH = \"test.tsv\"\n",
					"TRAIN_DATA_PATH = \"train.tsv\"\n",
					"PATHS_DATA_PATH = \"paths\""
				],
				"language": "python",
				"metadata": {

				},
				"outputs": [

				],
				"prompt_number": 4
			},
			{
				"cell_type": "markdown",
				"metadata": {

				},
				"source": [
					"####Function to get Evaluation Parameters"
				]
			},
			{
				"cell_type": "code",
				"collapsed": false,
				"input": [
					"from __future__ import division\n",
					"def evaluate(tp,fp,tn,fn):\n",
					"    precision = tp / (tp + fp)\n",
					"    recall = tp / (tp + fn)\n",
					"    f1 = 2 * precision * recall / (precision + recall)\n",
					"    #f1=0\n",
					"    print 'Precision = {0}, Recall = {1}, F1 Score = {2}'.format(precision, recall, f1)"
				],
				"language": "python",
				"metadata": {

				},
				"outputs": [

				],
				"prompt_number": 5
			},
			{
				"cell_type": "markdown",
				"metadata": {

				},
				"source": [
					"####Regular Expression patterns"
				]
			},
			{
				"cell_type": "code",
				"collapsed": false,
				"input": [
					"import csv\n",
					"import re\n",
					"institution=[]\n",
					"name=[]\n",
					"sentence=[]\n",
					"judgment=[]\n",
					"tp=0\n",
					"tn=0\n",
					"fp=0\n",
					"fn=0\n",
					"\n",
					"r_list = [re.compile(r'.*graduate.*',re.IGNORECASE),\n",
					"re.compile(r'.*completed.*education.*',re.IGNORECASE),\n",
					"re.compile(r'.*bachelor.*',re.IGNORECASE),\n",
					"re.compile(r'.*attended.*',re.IGNORECASE),\n",
					"re.compile(r'.*student.*',re.IGNORECASE)]\n",
					"\n",
					"\n",
					"with open(TRAIN_DATA_PATH,'r') as f:\n",
					"    next(f) # skip headings\n",
					"    reader=csv.reader(f,delimiter='\\t')\n",
					"   \n",
					"    for line in reader:\n",
					"       \n",
					"        if any(r.match(line[3]) for r in r_list):\n",
					"           \n",
					"            if line[4] == 'yes':\n",
					"                tp += 1\n",
					"            else:\n",
					"                fp += 1\n",
					"        else:\n",
					"            \n",
					"            if line[4] == 'no':\n",
					"                tn += 1\n",
					"            else:\n",
					"                fn += 1\n",
					"    print 'For Training Set'\n",
					"    print evaluate(tp,fp,tn,fn)"
				],
				"language": "python",
				"metadata": {

				},
				"outputs": [{
					"output_type": "stream",
					"stream": "stdout",
					"text": [
						"For Training Set\n",
						"Precision = 0.714383094751, Recall = 0.531440162272, F1 Score = 0.609479499855\n",
						"None\n"
					]
				}],
				"prompt_number": 8
			},
			{
				"cell_type": "markdown",
				"metadata": {

				},
				"source": [
					"####Measuring performance on Test Set"
				]
			},
			{
				"cell_type": "code",
				"collapsed": false,
				"input": [
					"institution=[]\n",
					"name=[]\n",
					"sentence=[]\n",
					"judgment=[]\n",
					"tp=0\n",
					"tn=0\n",
					"fp=0\n",
					"fn=0\n",
					"with open(TEST_DATA_PATH,'r') as f:\n",
					"    next(f) # skip headings\n",
					"    reader=csv.reader(f,delimiter='\\t')\n",
					"\n",
					"    for line in reader:\n",
					"      \n",
					"        if any(r.match(line[3]) for r in r_list):\n",
					"\n",
					"            if line[4] == 'yes':\n",
					"                tp += 1\n",
					"            else:\n",
					"                fp += 1\n",
					"        else:\n",
					"\n",
					"            if line[4] == 'no':\n",
					"                tn += 1\n",
					"            else:\n",
					"                fn += 1\n",
					"    print 'For Test Set'\n",
					"    print evaluate(tp,fp,tn,fn)"
				],
				"language": "python",
				"metadata": {

				},
				"outputs": [{
					"output_type": "stream",
					"stream": "stdout",
					"text": [
						"For Test Set\n",
						"Precision = 0.737739872068, Recall = 0.529862174579, F1 Score = 0.616755793226\n",
						"None\n"
					]
				}],
				"prompt_number": 10
			},
			{
				"cell_type": "markdown",
				"metadata": {

				},
				"source": [

				]
			},
			{
				"cell_type": "code",
				"collapsed": false,
				"input": [

				],
				"language": "python",
				"metadata": {

				},
				"outputs": [

				]
			},
			{
				"cell_type": "markdown",
				"metadata": {

				},
				"source": [
					"##Supervised Classification"
				]
			},
			{
				"cell_type": "markdown",
				"metadata": {

				},
				"source": [
					"At a high-level the following steps are involved:\n",
					"1) Convert the training and test instances into feature vectors + the label.\n",
					"2) Use a binary classification algorithm such as SVM to learn a classifier on the training data. \n",
					"3) Use the trained classifier to make predictions on the test data.\n",
					"4) Evaluate the predictions."
				]
			},
			{
				"cell_type": "markdown",
				"metadata": {

				},
				"source": [
					"1) Bag-of-words – Every word that occurs in the span between the PERSON mention and the INSTITUTION mention is used as a feature in a supervised classification setup. The classifiers typically expect you to assign a unique id to each word in the vocabulary and represent the text via these ids"
				]
			},
			{
				"cell_type": "markdown",
				"metadata": {

				},
				"source": [
					"####To get the feature vectors and the arff files for training and test"
				]
			},
			{
				"cell_type": "code",
				"collapsed": false,
				"input": [
					"TEST_DATA_PATH = \"test.tsv\"\n",
					"TRAIN_DATA_PATH = \"train.tsv\"\n",
					"PATHS_DATA_PATH = \"paths\""
				],
				"language": "python",
				"metadata": {

				},
				"outputs": [

				],
				"prompt_number": 137
			},
			{
				"cell_type": "code",
				"collapsed": false,
				"input": [
					"def parse_data(train_data, test_data):\n",
					"    \"\"\"\n",
					"    Input: path to the data file\n",
					"    Output: (1) a list of tuples, one for each instance of the data, and\n",
					"            (2) a list of all unique tokens in the data\n",
					"\n",
					"    Parses the data file to extract all instances of the data as tuples of the form:\n",
					"    (person, institution, judgment, full snippet, intermediate text)\n",
					"    where the intermediate text is all tokens that occur between the first occurrence of\n",
					"    the person and the first occurrence of the institution.\n",
					"\n",
					"    Also extracts a list of all tokens that appear in the intermediate text for the\n",
					"    purpose of creating feature vectors.\n",
					"    \"\"\"\n",
					"    all_tokens = []\n",
					"    data = []\n",
					"    for fp in [train_data, test_data]:\n",
					"        with open(fp) as f:\n",
					"            for line in f:\n",
					"                institution, person, snippet, intermediate_text, judgment = line.split(\"\\t\")\n",
					"                judgment = judgment.strip()\n",
					"\n",
					"                # Build up a list of unique tokens that occur in the intermediate text\n",
					"                # This is needed to create BOW feature vectors\n",
					"                tokens = intermediate_text.split()\n",
					"                for t in tokens:\n",
					"                    t = re.sub('[^A-Za-z0-9]+', '', t)\n",
					"                    if t not in all_tokens:\n",
					"                        all_tokens.append(t)\n",
					"                data.append((person, institution, judgment, snippet, intermediate_text))\n",
					"    return data, all_tokens"
				],
				"language": "python",
				"metadata": {

				},
				"outputs": [

				],
				"prompt_number": 10
			},
			{
				"cell_type": "code",
				"collapsed": false,
				"input": [
					"def create_feature_vectors(data, all_tokens):\n",
					"    \"\"\"\n",
					"    Input: (1) The parsed data from parse_data()\n",
					"             (2) a list of all unique tokens found in the intermediate text\n",
					"    Output: A list of lists representing the feature vectors for each data instance\n",
					"\n",
					"    Creates feature vectors from the parsed data file. These features include\n",
					"    bag of words features representing the number of occurrences of each\n",
					"    token in the intermediate text (text that comes between the first occurrence\n",
					"    of the person and the first occurrence of the institution).\n",
					"    This is also where any additional user-defined features can be added.\n",
					"    \"\"\"\n",
					"    feature_vectors = []\n",
					"    for instance in data:\n",
					"        # BOW features\n",
					"        # Gets the number of occurrences of each token\n",
					"        # in the intermediate text\n",
					"        feature_vector = [0 for t in all_tokens]\n",
					"        intermediate_text = instance[4]\n",
					"        tokens = intermediate_text.split()\n",
					"        try:\n",
					"            for token in tokens:\n",
					"                index = all_tokens.index(token)\n",
					"                feature_vector[index] += 1\n",
					"\n",
					"            ### ADD ADDITIONAL FEATURES HERE ###\n",
					"\n",
					"            # Class label\n",
					"            judgment = instance[2]\n",
					"            feature_vector.append(judgment)\n",
					"\n",
					"            feature_vectors.append(feature_vector)\n",
					"        except:\n",
					"            pass\n",
					"    return feature_vectors"
				],
				"language": "python",
				"metadata": {

				},
				"outputs": [

				],
				"prompt_number": 44
			},
			{
				"cell_type": "code",
				"collapsed": false,
				"input": [
					"def generate_arff_file(feature_vectors, all_tokens, out_path):\n",
					"    \"\"\"\n",
					"    Input: (1) A list of all feature vectors for the data\n",
					"             (2) A list of all unique tokens that occurred in the intermediate text\n",
					"             (3) The name and path of the ARFF file to be output\n",
					"    Output: an ARFF file output to the location specified in out_path\n",
					"\n",
					"    Converts a list of feature vectors to an ARFF file for use with Weka.\n",
					"    \"\"\"\n",
					"    with open(out_path, 'w') as f:\n",
					"        # Header info\n",
					"        f.write(\"@RELATION institutions\\n\")\n",
					"        for i in range(len(all_tokens)):\n",
					"            f.write(\"@ATTRIBUTE token_{} INTEGER\\n\".format(i))\n",
					"\n",
					"        ### SPECIFY ADDITIONAL FEATURES HERE ###\n",
					"        # For example: f.write(\"@ATTRIBUTE custom_1 REAL\\n\")\n",
					"\n",
					"        # Classes\n",
					"        f.write(\"@ATTRIBUTE class {yes,no}\\n\")\n",
					"\n",
					"        # Data instances\n",
					"        f.write(\"\\n@DATA\\n\")\n",
					"        for fv in feature_vectors:\n",
					"            features = []\n",
					"            for i in range(len(fv)):\n",
					"                value = fv[i]\n",
					"                if value != 0:\n",
					"                    features.append(\"{} {}\".format(i, value))\n",
					"            entry = \",\".join(features)\n",
					"            f.write(\"{\" + entry + \"}\\n\")"
				],
				"language": "python",
				"metadata": {

				},
				"outputs": [

				],
				"prompt_number": 45
			},
			{
				"cell_type": "code",
				"collapsed": false,
				"input": [
					"data, all_tokens = parse_data(TRAIN_DATA_PATH, TEST_DATA_PATH)\n",
					"feature_vectors = create_feature_vectors(data, all_tokens)\n",
					"generate_arff_file(feature_vectors[:6000], all_tokens, \"train_cluster.arff\")\n",
					"generate_arff_file(feature_vectors[6000:], all_tokens, \"test_cluster.arff\")"
				],
				"language": "python",
				"metadata": {

				},
				"outputs": [

				],
				"prompt_number": 46
			},
			{
				"cell_type": "markdown",
				"metadata": {

				},
				"source": [
					"##Classifier"
				]
			},
			{
				"cell_type": "markdown",
				"metadata": {

				},
				"source": [

				]
			},
			{
				"cell_type": "markdown",
				"metadata": {

				},
				"source": [
					"######The following tests were conducted by training a LibLinear Model from the trained arff that we got previously. I Reevaluated the model on test arff file for various values of C as observed below."
				]
			},
			{
				"cell_type": "code",
				"collapsed": false,
				"input": [
					"Performance for c = 0.001\n",
					"\n",
					"Correctly Classified Instances         719               71.9    %\n",
					"Incorrectly Classified Instances       281               28.1    %\n",
					"\n",
					"\n",
					"Performance for c = 0.01\n",
					"\n",
					"Correctly Classified Instances         747               74.7    %\n",
					"Incorrectly Classified Instances       253               25.3    %\n",
					"\n",
					"Performance for c = 0.1\n",
					"\n",
					"Correctly Classified Instances         735               73.5    %\n",
					"Incorrectly Classified Instances       265               26.5    %\n",
					"\n",
					"Performance for c = 1.0\n",
					"\n",
					"Correctly Classified Instances         703               70.3    %\n",
					"Incorrectly Classified Instances       297               29.7    %\n",
					"\n",
					"Performance for c = 10\n",
					"\n",
					"Correctly Classified Instances         670               67      %\n",
					"Incorrectly Classified Instances       330               33      %\n",
					"\n",
					"Performance for c = 100\n",
					"\n",
					"Correctly Classified Instances         635               63.5    %\n",
					"Incorrectly Classified Instances       365               36.5    %"
				],
				"language": "python",
				"metadata": {

				},
				"outputs": [

				]
			},
			{
				"cell_type": "markdown",
				"metadata": {

				},
				"source": [
					"We observe that the best result is obtained for c = 0.01"
				]
			},
			{
				"cell_type": "markdown",
				"metadata": {

				},
				"source": [
					"####We will use the training arff file to train a Classifier in Weka with c = 0.01"
				]
			},
			{
				"cell_type": "code",
				"collapsed": false,
				"input": [
					"=== Re-evaluation on test set ===\n",
					"\n",
					"User supplied test set\n",
					"Relation:     institutions\n",
					"Instances:     unknown (yet). Reading incrementally\n",
					"Attributes:   28059\n",
					"\n",
					"=== Summary ===\n",
					"\n",
					"Correctly Classified Instances         747               74.7    %\n",
					"Incorrectly Classified Instances       253               25.3    %\n",
					"Kappa statistic                          0.3826\n",
					"Mean absolute error                      0.253 \n",
					"Root mean squared error                  0.503 \n",
					"Coverage of cases (0.95 level)          74.7    %\n",
					"Total Number of Instances             1000     \n",
					"\n",
					"=== Detailed Accuracy By Class ===\n",
					"\n",
					"                 TP Rate  FP Rate  Precision  Recall   F-Measure  MCC      ROC Area  PRC Area  Class\n",
					"                 0.913    0.566    0.753      0.913    0.825      0.407    0.673     0.744     yes\n",
					"                 0.434    0.087    0.725      0.434    0.542      0.407    0.673     0.510     no\n",
					"Weighted Avg.    0.747    0.401    0.743      0.747    0.727      0.407    0.673     0.663     \n",
					"\n",
					"=== Confusion Matrix ===\n",
					"\n",
					"   a   b   <-- classified as\n",
					" 597  57 |   a = yes\n",
					" 196 150 |   b = no"
				],
				"language": "python",
				"metadata": {

				},
				"outputs": [

				]
			},
			{
				"cell_type": "markdown",
				"metadata": {

				},
				"source": [

				]
			},
			{
				"cell_type": "code",
				"collapsed": false,
				"input": [
					"from nltk.tokenize import word_tokenize\n",
					"def parse_data(train_data, test_data):\n",
					"    \"\"\"\n",
					"    Input: path to the data file\n",
					"    Output: (1) a list of tuples, one for each instance of the data, and\n",
					"            (2) a list of all unique tokens in the data\n",
					"\n",
					"    Parses the data file to extract all instances of the data as tuples of the form:\n",
					"    (person, institution, judgment, full snippet, intermediate text)\n",
					"    where the intermediate text is all tokens that occur between the first occurrence of\n",
					"    the person and the first occurrence of the institution.\n",
					"\n",
					"    Also extracts a list of all tokens that appear in the intermediate text for the\n",
					"    purpose of creating feature vectors.\n",
					"    \"\"\"\n",
					"    all_tokens = []\n",
					"    data = []\n",
					"    for fp in [train_data, test_data]:\n",
					"        with open(fp) as f:\n",
					"            for line in f:\n",
					"                institution, person, snippet, intermediate_text, judgment = line.split(\"\\t\")\n",
					"                judgment = judgment.strip()\n",
					"\n",
					"                # Build up a list of unique tokens that occur in the intermediate text\n",
					"                # This is needed to create BOW feature vectors\n",
					"              \n",
					"                tokens = word_tokenize(intermediate_text.decode('utf8'))\n",
					"             \n",
					"                for t in tokens:\n",
					"                    #t = re.sub('[^A-Za-z0-9]+', '', t)\n",
					"                    if t not in all_tokens:\n",
					"                        all_tokens.append(t)\n",
					"                data.append((person, institution, judgment, snippet, intermediate_text))\n",
					"    return data, all_tokens"
				],
				"language": "python",
				"metadata": {

				},
				"outputs": [

				],
				"prompt_number": 61
			},
			{
				"cell_type": "code",
				"collapsed": false,
				"input": [
					"\n",
					"import unicodedata\n",
					"def create_feature_vectors_tokenized(data, all_tokens):\n",
					"    \"\"\"\n",
					"    Input: (1) The parsed data from parse_data()\n",
					"             (2) a list of all unique tokens found in the intermediate text\n",
					"    Output: A list of lists representing the feature vectors for each data instance\n",
					"\n",
					"    Creates feature vectors from the parsed data file. These features include\n",
					"    bag of words features representing the number of occurrences of each\n",
					"    token in the intermediate text (text that comes between the first occurrence\n",
					"    of the person and the first occurrence of the institution).\n",
					"    This is also where any additional user-defined features can be added.\n",
					"    \"\"\"\n",
					"    feature_vectors = []\n",
					"    for instance in data:\n",
					"            feature_vector = [0 for t in all_tokens]\n",
					"        \n",
					"            # BOW features\n",
					"            # Gets the number of occurrences of each token\n",
					"            # in the intermediate text\n",
					"            intermediate_text = instance[4]\n",
					"            tokens = word_tokenize(intermediate_text.decode('utf8'))\n",
					"            for token in tokens:\n",
					"                index = all_tokens.index(token)\n",
					"                feature_vector[index] += 1\n",
					"            \n",
					"\n",
					"            ### ADD ADDITIONAL FEATURES HERE ###\n",
					"\n",
					"            # Class label\n",
					"            judgment = instance[2]\n",
					"            feature_vector.append(judgment)\n",
					"\n",
					"            feature_vectors.append(feature_vector)\n",
					"        \n",
					"    return feature_vectors"
				],
				"language": "python",
				"metadata": {

				},
				"outputs": [

				],
				"prompt_number": 62
			},
			{
				"cell_type": "code",
				"collapsed": false,
				"input": [
					"data, all_tokens = parse_data(TRAIN_DATA_PATH, TEST_DATA_PATH)\n",
					"feature_vectors = create_feature_vectors_tokenized(data, all_tokens)\n",
					"generate_arff_file(feature_vectors[:6000], all_tokens, \"train_tokenize.arff\")\n",
					"generate_arff_file(feature_vectors[6000:], all_tokens, \"test_tokenize.arff\")"
				],
				"language": "python",
				"metadata": {

				},
				"outputs": [

				],
				"prompt_number": 63
			},
			{
				"cell_type": "markdown",
				"metadata": {

				},
				"source": [
					"####Evaluated the Liblinear model with nltk.tokenize with c = 0.01"
				]
			},
			{
				"cell_type": "code",
				"collapsed": false,
				"input": [
					"=== Re-evaluation on test set ===\n",
					"\n",
					"User supplied test set\n",
					"Relation:     institutions\n",
					"Instances:     unknown (yet). Reading incrementally\n",
					"Attributes:   21945\n",
					"\n",
					"=== Summary ===\n",
					"\n",
					"Correctly Classified Instances         727               72.7    %\n",
					"Incorrectly Classified Instances       273               27.3    %\n",
					"Kappa statistic                          0.3225\n",
					"Mean absolute error                      0.273 \n",
					"Root mean squared error                  0.5225\n",
					"Coverage of cases (0.95 level)          72.7    %\n",
					"Total Number of Instances             1000     \n",
					"\n",
					"=== Detailed Accuracy By Class ===\n",
					"\n",
					"                 TP Rate  FP Rate  Precision  Recall   F-Measure  MCC      ROC Area  PRC Area  Class\n",
					"                 0.914    0.627    0.734      0.914    0.814      0.352    0.644     0.727     yes\n",
					"                 0.373    0.086    0.697      0.373    0.486      0.352    0.644     0.477     no\n",
					"Weighted Avg.    0.727    0.440    0.721      0.727    0.701      0.352    0.644     0.640     \n",
					"\n",
					"=== Confusion Matrix ===\n",
					"\n",
					"   a   b   <-- classified as\n",
					" 598  56 |   a = yes\n",
					" 217 129 |   b = no"
				],
				"language": "python",
				"metadata": {

				},
				"outputs": [

				]
			},
			{
				"cell_type": "markdown",
				"metadata": {

				},
				"source": [
					"#####2) Clustering Features – Using words directly as features often leads to sparsity. A standard technique to overcome sparsity is to use a generalized representation of words via clustering. The idea is to first cluster words based on their usage and represent each word by the cluster to which it belongs. \n",
					"\n",
					"Brown Clustering is a widely used for clustering words. You can use Percy Liang’s code here to generate hierarchical word clusters. The algorithm takes in a text file as input and a cluster parameter (c) that controls the number of clusters. The output includes a path file that contains the cluster ID (which is a path on the hierarchical (binary) clustering).  "
				]
			},
			{
				"cell_type": "code",
				"collapsed": false,
				"input": [
					"import csv\n",
					"import re\n",
					"xmap={}\n",
					"keymap={}\n",
					"convertmap={}\n",
					"with open('paths','r') as f:\n",
					"\n",
					"    lines=csv.reader(f,delimiter='\\t')\n",
					"    for reader in lines:\n",
					"        xmap[re.sub('[^A-Za-z0-9]+', '', reader[1])]=reader[0]\n",
					"        keymap[reader[0]] = re.sub('[^A-Za-z0-9]+', '', reader[1])\n",
					"    val=0\n",
					"    for key in keymap.keys():\n",
					"        convertmap[key] = val\n",
					"        val+=1\n"
				],
				"language": "python",
				"metadata": {

				},
				"outputs": [

				],
				"prompt_number": 94
			},
			{
				"cell_type": "code",
				"collapsed": false,
				"input": [
					"def parse_data(train_data, test_data):\n",
					"    \"\"\"\n",
					"    Input: path to the data file\n",
					"    Output: (1) a list of tuples, one for each instance of the data, and\n",
					"            (2) a list of all unique tokens in the data\n",
					"\n",
					"    Parses the data file to extract all instances of the data as tuples of the form:\n",
					"    (person, institution, judgment, full snippet, intermediate text)\n",
					"    where the intermediate text is all tokens that occur between the first occurrence of\n",
					"    the person and the first occurrence of the institution.\n",
					"\n",
					"    Also extracts a list of all tokens that appear in the intermediate text for the\n",
					"    purpose of creating feature vectors.\n",
					"    \"\"\"\n",
					"    all_tokens = []\n",
					"    data = []\n",
					"    for fp in [train_data, test_data]:\n",
					"        with open(fp) as f:\n",
					"            for line in f:\n",
					"                institution, person, snippet, intermediate_text, judgment = line.split(\"\\t\")\n",
					"                judgment = judgment.strip()\n",
					"\n",
					"                # Build up a list of unique tokens that occur in the intermediate text\n",
					"                # This is needed to create BOW feature vectors\n",
					"              \n",
					"                tokens = intermediate_text.split()\n",
					"             \n",
					"                for t in tokens:\n",
					"                    t = re.sub('[^A-Za-z0-9]+', '', t)\n",
					"                    if t not in all_tokens:\n",
					"                        all_tokens.append(t)\n",
					"                data.append((person, institution, judgment, snippet, intermediate_text))\n",
					"    return data, all_tokens"
				],
				"language": "python",
				"metadata": {

				},
				"outputs": [

				],
				"prompt_number": 132
			},
			{
				"cell_type": "code",
				"collapsed": false,
				"input": [
					"def create_feature_vectors_cluster(data, all_tokens):\n",
					"    \"\"\"\n",
					"    Input: (1) The parsed data from parse_data()\n",
					"             (2) a list of all unique tokens found in the intermediate text\n",
					"    Output: A list of lists representing the feature vectors for each data instance\n",
					"\n",
					"    Creates feature vectors from the parsed data file. These features include\n",
					"    bag of words features representing the number of occurrences of each\n",
					"    token in the intermediate text (text that comes between the first occurrence\n",
					"    of the person and the first occurrence of the institution).\n",
					"    This is also where any additional user-defined features can be added.\n",
					"    \"\"\"\n",
					"    feature_vectors = []\n",
					"    for instance in data:\n",
					"        # BOW features\n",
					"        # Gets the number of occurrences of each token\n",
					"        # in the intermediate text\n",
					"        feature_vector = {}\n",
					"        intermediate_text = instance[4]\n",
					"        tokens = intermediate_text.split()\n",
					"        for token in tokens:\n",
					"            token = re.sub('[^A-Za-z0-9]+', '', token)\n",
					"            if xmap[token] in feature_vector:\n",
					"                feature_vector[xmap[token]] += 1\n",
					"            else:\n",
					"                feature_vector[xmap[token]] = 1\n",
					"\n",
					"        ### ADD ADDITIONAL FEATURES HERE ###\n",
					"\n",
					"        # Class label\n",
					"        judgment = instance[2]\n",
					"        feature_list=[]\n",
					"        feature_list.append(feature_vector)\n",
					"        feature_list.append(judgment)\n",
					"\n",
					"        feature_vectors.append(feature_list)\n",
					"    return feature_vectors"
				],
				"language": "python",
				"metadata": {

				},
				"outputs": [

				],
				"prompt_number": 96
			},
			{
				"cell_type": "code",
				"collapsed": false,
				"input": [
					"import collections\n",
					"from operator import itemgetter\n",
					"def generate_arff_file(feature_vectors, all_tokens, out_path):\n",
					"    \"\"\"\n",
					"    Input: (1) A list of all feature vectors for the data\n",
					"             (2) A list of all unique tokens that occurred in the intermediate text\n",
					"             (3) The name and path of the ARFF file to be output\n",
					"    Output: an ARFF file output to the location specified in out_path\n",
					"\n",
					"    Converts a list of feature vectors to an ARFF file for use with Weka.\n",
					"    \"\"\"\n",
					"    with open(out_path, 'w') as f:\n",
					"        # Header info\n",
					"        f.write(\"@RELATION institutions\\n\")\n",
					"        \n",
					"        for i in range(len(keymap.keys())):\n",
					"            f.write(\"@ATTRIBUTE token_{} INTEGER\\n\".format(i))\n",
					"\n",
					"        ### SPECIFY ADDITIONAL FEATURES HERE ###\n",
					"        # For example: f.write(\"@ATTRIBUTE custom_1 REAL\\n\")\n",
					"\n",
					"        # Classes\n",
					"        f.write(\"@ATTRIBUTE class {yes,no}\\n\")\n",
					"\n",
					"        # Data instances\n",
					"        f.write(\"\\n@DATA\\n\")\n",
					"        \n",
					"        for fv in feature_vectors:\n",
					"            features=[]\n",
					"            features_map = fv[0]\n",
					"            for i in features_map.keys():\n",
					"                value = features_map[i]\n",
					"                if value != 0:\n",
					"                    features.append(\"{} {}\".format(str(convertmap[i]).zfill(2), value))\n",
					"            features = sorted(features)\n",
					"            features.append(\"{} {}\".format(50, fv[1]))\n",
					"            entry = \",\".join(features)\n",
					"            f.write(\"{\" + entry + \"}\\n\")"
				],
				"language": "python",
				"metadata": {

				},
				"outputs": [

				],
				"prompt_number": 97
			},
			{
				"cell_type": "code",
				"collapsed": false,
				"input": [
					"data, all_tokens = parse_data(TRAIN_DATA_PATH, TEST_DATA_PATH)\n",
					"feature_vectors = create_feature_vectors_cluster(data, all_tokens)\n",
					"generate_arff_file(feature_vectors[:6000], all_tokens, \"train_cluster.arff\")\n",
					"generate_arff_file(feature_vectors[6000:], all_tokens, \"test_cluster.arff\")"
				],
				"language": "python",
				"metadata": {

				},
				"outputs": [

				],
				"prompt_number": 98
			},
			{
				"cell_type": "code",
				"collapsed": false,
				"input": [
					"=== Run information ===\n",
					"\n",
					"Scheme:       weka.classifiers.functions.LibLINEAR -S 1 -C 0.01 -E 0.001 -B 1.0\n",
					"Relation:     institutions\n",
					"Instances:    6000\n",
					"Attributes:   51\n",
					"              token_0\n",
					"              token_1\n",
					"              token_2\n",
					"              token_3\n",
					"              token_4\n",
					"              token_5\n",
					"              token_6\n",
					"              token_7\n",
					"              token_8\n",
					"              token_9\n",
					"              token_10\n",
					"              token_11\n",
					"              token_12\n",
					"              token_13\n",
					"              token_14\n",
					"              token_15\n",
					"              token_16\n",
					"              token_17\n",
					"              token_18\n",
					"              token_19\n",
					"              token_20\n",
					"              token_21\n",
					"              token_22\n",
					"              token_23\n",
					"              token_24\n",
					"              token_25\n",
					"              token_26\n",
					"              token_27\n",
					"              token_28\n",
					"              token_29\n",
					"              token_30\n",
					"              token_31\n",
					"              token_32\n",
					"              token_33\n",
					"              token_34\n",
					"              token_35\n",
					"              token_36\n",
					"              token_37\n",
					"              token_38\n",
					"              token_39\n",
					"              token_40\n",
					"              token_41\n",
					"              token_42\n",
					"              token_43\n",
					"              token_44\n",
					"              token_45\n",
					"              token_46\n",
					"              token_47\n",
					"              token_48\n",
					"              token_49\n",
					"              class\n",
					"Test mode:    evaluate on training data\n",
					"\n",
					"=== Classifier model (full training set) ===\n",
					"\n",
					"LibLINEAR wrapper\n",
					"\n",
					"Time taken to build model: 1.2 seconds\n",
					"\n",
					"=== Evaluation on training set ===\n",
					"\n",
					"Time taken to test model on training data: 0.16 seconds\n",
					"\n",
					"=== Summary ===\n",
					"\n",
					"Correctly Classified Instances        3977               66.2833 %\n",
					"Incorrectly Classified Instances      2023               33.7167 %\n",
					"Kappa statistic                          0.0569\n",
					"Mean absolute error                      0.3372\n",
					"Root mean squared error                  0.5807\n",
					"Relative absolute error                 74.8587 %\n",
					"Root relative squared error            122.3613 %\n",
					"Coverage of cases (0.95 level)          66.2833 %\n",
					"Mean rel. region size (0.95 level)      50      %\n",
					"Total Number of Instances             6000     \n",
					"\n",
					"=== Detailed Accuracy By Class ===\n",
					"\n",
					"                 TP Rate  FP Rate  Precision  Recall   F-Measure  MCC      ROC Area  PRC Area  Class\n",
					"                 0.968    0.923    0.668      0.968    0.791      0.101    0.523     0.668     yes\n",
					"                 0.077    0.032    0.556      0.077    0.136      0.101    0.523     0.359     no\n",
					"Weighted Avg.    0.663    0.618    0.630      0.663    0.566      0.101    0.523     0.562     \n",
					"\n",
					"=== Confusion Matrix ===\n",
					"\n",
					"    a    b   <-- classified as\n",
					" 3818  127 |    a = yes\n",
					" 1896  159 |    b = no\n",
					"\n",
					"\n",
					"=== Re-evaluation on test set ===\n",
					"\n",
					"User supplied test set\n",
					"Relation:     institutions\n",
					"Instances:     unknown (yet). Reading incrementally\n",
					"Attributes:   51\n",
					"\n",
					"=== Summary ===\n",
					"\n",
					"Correctly Classified Instances         660               66      %\n",
					"Incorrectly Classified Instances       340               34      %\n",
					"Kappa statistic                          0.056 \n",
					"Mean absolute error                      0.34  \n",
					"Root mean squared error                  0.5831\n",
					"Coverage of cases (0.95 level)          66      %\n",
					"Total Number of Instances             1000     \n",
					"\n",
					"=== Detailed Accuracy By Class ===\n",
					"\n",
					"                 TP Rate  FP Rate  Precision  Recall   F-Measure  MCC      ROC Area  PRC Area  Class\n",
					"                 0.969    0.925    0.665      0.969    0.789      0.101    0.522     0.664     yes\n",
					"                 0.075    0.031    0.565      0.075    0.133      0.101    0.522     0.362     no\n",
					"Weighted Avg.    0.660    0.615    0.630      0.660    0.562      0.101    0.522     0.560     \n",
					"\n",
					"=== Confusion Matrix ===\n",
					"\n",
					"   a   b   <-- classified as\n",
					" 634  20 |   a = yes\n",
					" 320  26 |   b = no\n",
					"\n"
				],
				"language": "python",
				"metadata": {

				},
				"outputs": [

				]
			},
			{
				"cell_type": "markdown",
				"metadata": {

				},
				"source": [

				]
			},
			{
				"cell_type": "code",
				"collapsed": false,
				"input": [
					"import csv\n",
					"import re\n",
					"xmap={}\n",
					"keymap={}\n",
					"convertmap={}\n",
					"with open('paths10000.txt','r') as f:\n",
					"\n",
					"    lines=csv.reader(f,delimiter='\\t')\n",
					"    for reader in lines:\n",
					"        xmap[re.sub('[^A-Za-z0-9]+', '', reader[1])]=reader[0]\n",
					"        keymap[reader[0]] = re.sub('[^A-Za-z0-9]+', '', reader[1])\n",
					"    val=0\n",
					"    for key in keymap.keys():\n",
					"        convertmap[key] = val\n",
					"        val+=1\n"
				],
				"language": "python",
				"metadata": {

				},
				"outputs": [

				],
				"prompt_number": 99
			},
			{
				"cell_type": "code",
				"collapsed": false,
				"input": [
					"def create_feature_vectors_cluster10k(data, all_tokens):\n",
					"    \"\"\"\n",
					"    Input: (1) The parsed data from parse_data()\n",
					"             (2) a list of all unique tokens found in the intermediate text\n",
					"    Output: A list of lists representing the feature vectors for each data instance\n",
					"\n",
					"    Creates feature vectors from the parsed data file. These features include\n",
					"    bag of words features representing the number of occurrences of each\n",
					"    token in the intermediate text (text that comes between the first occurrence\n",
					"    of the person and the first occurrence of the institution).\n",
					"    This is also where any additional user-defined features can be added.\n",
					"    \"\"\"\n",
					"    feature_vectors = []\n",
					"    for instance in data:\n",
					"        # BOW features\n",
					"        # Gets the number of occurrences of each token\n",
					"        # in the intermediate text\n",
					"        feature_vector = {}\n",
					"        intermediate_text = instance[4]\n",
					"        tokens = intermediate_text.split()\n",
					"        for token in tokens:\n",
					"            token = re.sub('[^A-Za-z0-9]+', '', token)\n",
					"            try:\n",
					"                if xmap[token] in feature_vector:\n",
					"                    feature_vector[xmap[token]] += 1\n",
					"                else:\n",
					"                    feature_vector[xmap[token]] = 1\n",
					"            except:\n",
					"                pass\n",
					"\n",
					"        ### ADD ADDITIONAL FEATURES HERE ###\n",
					"\n",
					"        # Class label\n",
					"        judgment = instance[2]\n",
					"        feature_list=[]\n",
					"        feature_list.append(feature_vector)\n",
					"        feature_list.append(judgment)\n",
					"\n",
					"        feature_vectors.append(feature_list)\n",
					"    return feature_vectors"
				],
				"language": "python",
				"metadata": {

				},
				"outputs": [

				],
				"prompt_number": 106
			},
			{
				"cell_type": "code",
				"collapsed": false,
				"input": [
					"import collections\n",
					"from operator import itemgetter\n",
					"def generate_arff_file(feature_vectors, all_tokens, out_path):\n",
					"    \"\"\"\n",
					"    Input: (1) A list of all feature vectors for the data\n",
					"             (2) A list of all unique tokens that occurred in the intermediate text\n",
					"             (3) The name and path of the ARFF file to be output\n",
					"    Output: an ARFF file output to the location specified in out_path\n",
					"\n",
					"    Converts a list of feature vectors to an ARFF file for use with Weka.\n",
					"    \"\"\"\n",
					"    with open(out_path, 'w') as f:\n",
					"        # Header info\n",
					"        f.write(\"@RELATION institutions\\n\")\n",
					"        \n",
					"        for i in range(len(keymap.keys())):\n",
					"            f.write(\"@ATTRIBUTE token_{} INTEGER\\n\".format(i))\n",
					"\n",
					"        ### SPECIFY ADDITIONAL FEATURES HERE ###\n",
					"        # For example: f.write(\"@ATTRIBUTE custom_1 REAL\\n\")\n",
					"\n",
					"        # Classes\n",
					"        f.write(\"@ATTRIBUTE class {yes,no}\\n\")\n",
					"\n",
					"        # Data instances\n",
					"        f.write(\"\\n@DATA\\n\")\n",
					"        \n",
					"        for fv in feature_vectors:\n",
					"            features=[]\n",
					"            features_map = fv[0]\n",
					"            for i in features_map.keys():\n",
					"                value = features_map[i]\n",
					"                if value != 0:\n",
					"                    features.append(\"{} {}\".format(str(convertmap[i]).zfill(2), value))\n",
					"            features = sorted(features)\n",
					"            features.append(\"{} {}\".format(9, fv[1]))\n",
					"            entry = \",\".join(features)\n",
					"            f.write(\"{\" + entry + \"}\\n\")"
				],
				"language": "python",
				"metadata": {

				},
				"outputs": [

				],
				"prompt_number": 109
			},
			{
				"cell_type": "code",
				"collapsed": false,
				"input": [
					"data, all_tokens = parse_data(TRAIN_DATA_PATH, TEST_DATA_PATH)\n",
					"feature_vectors = create_feature_vectors_cluster10k(data, all_tokens)\n",
					"generate_arff_file(feature_vectors[:6000], all_tokens, \"train_cluster10000.arff\")\n",
					"generate_arff_file(feature_vectors[6000:], all_tokens, \"test_cluster10000.arff\")"
				],
				"language": "python",
				"metadata": {

				},
				"outputs": [

				],
				"prompt_number": 110
			},
			{
				"cell_type": "raw",
				"metadata": {

				},
				"source": [
					"=== Re-evaluation on test set ===\n",
					"\n",
					"User supplied test set\n",
					"Relation:     institutions\n",
					"Instances:     unknown (yet). Reading incrementally\n",
					"Attributes:   10\n",
					"\n",
					"=== Summary ===\n",
					"\n",
					"Correctly Classified Instances         654               65.4    %\n",
					"Incorrectly Classified Instances       346               34.6    %\n",
					"Kappa statistic                          0.0123\n",
					"Mean absolute error                      0.346 \n",
					"Root mean squared error                  0.5882\n",
					"Coverage of cases (0.95 level)          65.4    %\n",
					"Total Number of Instances             1000     \n",
					"\n",
					"=== Detailed Accuracy By Class ===\n",
					"\n",
					"                 TP Rate  FP Rate  Precision  Recall   F-Measure  MCC      ROC Area  PRC Area  Class\n",
					"                 0.989    0.980    0.656      0.989    0.789      0.039    0.505     0.656     yes\n",
					"                 0.020    0.011    0.500      0.020    0.039      0.039    0.505     0.349     no\n",
					"Weighted Avg.    0.654    0.644    0.602      0.654    0.529      0.039    0.505     0.550     \n",
					"\n",
					"=== Confusion Matrix ===\n",
					"\n",
					"   a   b   <-- classified as\n",
					" 647   7 |   a = yes\n",
					" 339   7 |   b = no\n",
					"\n"
				]
			},
			{
				"cell_type": "markdown",
				"metadata": {

				},
				"source": [
					"###3) Dependency Features "
				]
			},
			{
				"cell_type": "markdown",
				"metadata": {

				},
				"source": [
					"####I am making use of the Stanford Dependency Parser for this assignment.\n",
					"\n",
					"####Version : stanford-parser-full-2015-01-30"
				]
			},
			{
				"cell_type": "markdown",
				"metadata": {

				},
				"source": [
					"The parser makes use of the complete text and the first process is to preprocess this as below. I am converting a paragraph into its substituent sentences."
				]
			},
			{
				"cell_type": "markdown",
				"metadata": {

				},
				"source": [
					"###Before forming the final features, I am running all the sentences into the stanford dependency parser via shell script command : ./lexparser.sh file_name. This will give me the solution in the below format."
				]
			},
			{
				"cell_type": "code",
				"collapsed": false,
				"input": [
					"(ROOT\n",
					"  (S\n",
					"    (NP (NNP David) (NNP Maxwell))\n",
					"    (VP (VBD was)\n",
					"      (ADJP (VBN educated)\n",
					"        (PP\n",
					"          (PP (IN at)\n",
					"            (NP\n",
					"              (NP (NNP Eton) (NNP College))\n",
					"              (, ,)\n",
					"              (SBAR\n",
					"                (WHADVP (WRB where))\n",
					"                (S\n",
					"                  (NP (PRP he))\n",
					"                  (VP (VBD was)\n",
					"                    (NP\n",
					"                      (NP\n",
					"                        (NP (DT a) (NNP King) (POS 's))\n",
					"                        (NNS Scholar)\n",
					"                        (CC and)\n",
					"                        (NNS Captain))\n",
					"                      (PP (IN of)\n",
					"                        (NP (NNP Boats)))))))))\n",
					"          (, ,)\n",
					"          (CC and)\n",
					"          (PP (IN at)\n",
					"            (NP (NNP Cambridge) (NNP University)))))\n",
					"      (SBAR\n",
					"        (WHADVP (WRB where))\n",
					"        (S\n",
					"          (NP (PRP he))\n",
					"          (VP (VBD rowed)\n",
					"            (PP (IN in)\n",
					"              (NP\n",
					"                (NP (DT the) (VBG winning) (NNP Cambridge) (NN boat))\n",
					"                (PP (IN in)\n",
					"                  (NP (DT the) (CD 1971)\n",
					"                    (CC and)\n",
					"                    (CD 1972) (NNP Boat) (NN Races)))))))))\n",
					"    (. .)))\n",
					"\n",
					"nn(Maxwell-2, David-1)\n",
					"nsubjpass(educated-4, Maxwell-2)\n",
					"auxpass(educated-4, was-3)\n",
					"root(ROOT-0, educated-4)\n",
					"nn(College-7, Eton-6)\n",
					"prep_at(educated-4, College-7)\n",
					"advmod(Scholar-15, where-9)\n",
					"nsubj(Scholar-15, he-10)\n",
					"cop(Scholar-15, was-11)\n",
					"det(King-13, a-12)\n",
					"poss(Scholar-15, King-13)\n",
					"rcmod(College-7, Scholar-15)\n",
					"rcmod(College-7, Captain-17)\n",
					"conj_and(Scholar-15, Captain-17)\n",
					"prep_of(Scholar-15, Boats-19)\n",
					"nn(University-24, Cambridge-23)\n",
					"prep_at(educated-4, University-24)\n",
					"conj_and(College-7, University-24)\n",
					"advmod(rowed-27, where-25)\n",
					"nsubj(rowed-27, he-26)\n",
					"advcl(educated-4, rowed-27)\n",
					"det(boat-32, the-29)\n",
					"amod(boat-32, winning-30)\n",
					"nn(boat-32, Cambridge-31)\n",
					"prep_in(rowed-27, boat-32)\n",
					"det(1971-35, the-34)\n",
					"prep_in(boat-32, 1971-35)\n",
					"num(Races-39, 1972-37)\n",
					"nn(Races-39, Boat-38)\n",
					"prep_in(boat-32, Races-39)\n",
					"conj_and(1971-35, Races-39)\n",
					"\n",
					"(ROOT\n",
					"  (NP\n",
					"    (NP ($ $)\n",
					"      (QP ($ $) (CD $) (CD $)))\n",
					"    (ADJP ($ $)\n",
					"      (QP ($ $) (CD $)))\n",
					"    (. .)))\n",
					"\n",
					"root(ROOT-0, $-1)\n",
					"num($-1, $-2)\n",
					"number($-4, $-3)\n",
					"num($-2, $-4)\n",
					"amod($-1, $-5)\n",
					"dep($-5, $-6)\n",
					"num($-6, $-7)"
				],
				"language": "python",
				"metadata": {

				},
				"outputs": [

				]
			},
			{
				"cell_type": "markdown",
				"metadata": {

				},
				"source": [
					"Each instance is delimited by 6 $ symbols and the final training output file got from the parser is 29.5Mb in size"
				]
			},
			{
				"cell_type": "markdown",
				"metadata": {

				},
				"source": [
					"###For Feature Dependecy I have also made use of the Jython Wrapper provided by Viktor Pekar "
				]
			},
			{
				"cell_type": "markdown",
				"metadata": {

				},
				"source": [
					"##Feature 1 - Number of Verbs between the Person and the Institution\n",
					"\n",
					"This feature first considers all the nodes from the Person to the Institution. It then counts the number of verbs that are present in this path. I observed that more the number of verbs greater is the chance that there is no relation between them."
				]
			},
			{
				"cell_type": "code",
				"collapsed": false,
				"input": [
					"#Copied from Jython file\n",
					"\n",
					"def get_num_common_verb(self, node_i_idx, node_j_idx):\n",
					"        common_node = None\n",
					"        shortest_path = []\n",
					"        path1 = self.path2root(node_i_idx)\n",
					"        path2 = self.path2root(node_j_idx)\n",
					"\tcount = 0\n",
					"\n",
					"        for idx_i in path1:\n",
					"            if common_node != None:\n",
					"                break\n",
					"            for idx_j in path2:\n",
					"                if idx_i == idx_j:\n",
					"                    common_node = idx_i\n",
					"                    break\n",
					"\n",
					"        if common_node != None:\n",
					"            for idx_i in path1:\n",
					"              \tmyTag = self.tag.get(idx_i, '')\n",
					"                if myTag[0] == 'V' or myTag[0] == 'v':\n",
					"\t\t\tcount += 1\n",
					"                shortest_path.append(idx_i)\n",
					"                if idx_i == common_node:\n",
					"                    break\n",
					"            for idx_i in path2:\n",
					"\t\tmyTag = self.tag.get(idx_i, '')\n",
					"                if myTag[0] == 'V' or myTag[0] == 'v':\n",
					"\t\t\tcount += 1\n",
					"                if idx_i == common_node:\n",
					"                    break\n",
					"                shortest_path.append(idx_i)\n",
					"\treturn count"
				],
				"language": "python",
				"metadata": {

				},
				"outputs": [

				]
			},
			{
				"cell_type": "markdown",
				"metadata": {

				},
				"source": [
					"##Feature 2 - Whether the person is the subject of the sentence\n",
					"\n",
					"This feature checks whether the person is the subject or not. The Jython Stanford package provides a relation based value called 'nsubjpass' and 'nsubj'"
				]
			},
			{
				"cell_type": "code",
				"collapsed": false,
				"input": [
					"#Copied from Jython file\n",
					"\n",
					"def is_a_nsubjpass(self, node_idx):\n",
					"\tif self.rel[node_idx] == 'nsubjpass' or self.rel[node_idx] == 'nsubj':\n",
					"\t\treturn 'yes'\n",
					"\telse:\n",
					"\t\treturn 'no'\n",
					"\t"
				],
				"language": "python",
				"metadata": {

				},
				"outputs": [

				]
			},
			{
				"cell_type": "markdown",
				"metadata": {

				},
				"source": [
					"##Feature 3 - What is the path length of the Person and the Institution to the Common Node. The Common Node is usually a verb.\n",
					"\n",
					"This feature again determines the paths of both the Person and the Institution this time to a common node. It is observed that the node is usually a verb when there is a direct relation between the person and the instituion. In cases where this failed, I observed that the person was usually not the subject of the sentence."
				]
			},
			{
				"cell_type": "code",
				"collapsed": false,
				"input": [
					"#Copied from Jython File\n",
					"def get_least_common_node_length(self, node_i_idx, node_j_idx):\n",
					"        \n",
					"\n",
					"        common_node = None\n",
					"        shortest_path = []\n",
					"        path1 = self.path2root(node_i_idx)\n",
					"        path2 = self.path2root(node_j_idx)\n",
					"\n",
					"        for idx_i in path1:\n",
					"            if common_node != None:\n",
					"                break\n",
					"            for idx_j in path2:\n",
					"                if idx_i == idx_j:\n",
					"                    common_node = idx_i\n",
					"                    break\n",
					"\n",
					"        if common_node != None:\n",
					"            for idx_i in path1:\n",
					"                shortest_path.append(idx_i)\n",
					"                if idx_i == common_node:\n",
					"                    break\n",
					"            for idx_i in path2:\n",
					"                if idx_i == common_node:\n",
					"                    break\n",
					"                shortest_path.append(idx_i)\n",
					"\n",
					"        return len(shortest_path)"
				],
				"language": "python",
				"metadata": {

				},
				"outputs": [

				]
			},
			{
				"cell_type": "code",
				"collapsed": false,
				"input": [
					"###Few Other Important Code Snippets used:\n",
					"\n",
					"import sys\n",
					"sys.path.append('/home/alakshminara/NLP Assignments/Assignment2/stanford-parser.jar')\n",
					"import unittest\n",
					"from stanford import StanfordParser, PySentence\n",
					"PARSER = StanfordParser('englishPCFG.ser.gz')"
				],
				"language": "python",
				"metadata": {

				},
				"outputs": [

				]
			},
			{
				"cell_type": "markdown",
				"metadata": {

				},
				"source": [
					"## Feature Generator\n",
					"\n",
					"I designed the featured fenerator in Jython and wrote the features to a CSV file. I later converted the CSV to a arff file in Weka itself."
				]
			},
			{
				"cell_type": "code",
				"collapsed": false,
				"input": [
					"#Copied from Jython File\n",
					"import sys\n",
					"import csv\n",
					"sys.path.append('/home/alakshminara/NLP Assignments/Assignment2/stanford-parser.jar')\n",
					"from stanford import StanfordParser, PySentence\n",
					"parser = StanfordParser('englishPCFG.ser.gz')\n",
					"\n",
					"TRAIN_DATA_PATH = \"test.tsv\"\n",
					"f = open(TRAIN_DATA_PATH,'r')\n",
					"reader=csv.reader(f,delimiter='\\t')\n",
					"k = open('NewFeaturesTest.txt','w')\n",
					"spamwriter = csv.writer(k, delimiter='\\t')\n",
					"for line in reader:\n",
					"\trequired_line = line[2][:160] + line[3][-30:] + line[0]\n",
					"        sentence = parser.parse(required_line)\n",
					"\ttokens=required_line.split()\n",
					"\t\n",
					"\tall_names = line[1].split()\n",
					"\tindex1=1\n",
					"\tfor name in all_names:\n",
					"\t\tif name in tokens:\n",
					"\t\t\tindex1 = tokens.index(name)+1\n",
					"\n",
					"\tall_names = line[0].split()\n",
					"\tindex2=len(tokens)\n",
					"\n",
					"\ta = sentence.get_least_common_node_length(index1, index2)\n",
					"\tb = sentence.get_num_common_verb(index1, index2)\n",
					"\tc = sentence.is_a_nsubjpass(index1) or sentence.is_a_nsubjpass(index1+1) or sentence.is_a_nsubjpass(index1+2)\n",
					"\tprint line[0]\n",
					"\tmyList=[]\n",
					"\tmyList.append(a)\n",
					"\tmyList.append(b)\n",
					"\tmyList.append(c)\n",
					"\tspamwriter.writerow(myList)\n",
					"\t"
				],
				"language": "python",
				"metadata": {

				},
				"outputs": [

				]
			},
			{
				"cell_type": "code",
				"collapsed": false,
				"input": [
					"#Snapshot of Features\n",
					"\n",
					"AA\tAB\tAC\tTarget\n",
					"5\t3\tyes\tyes\n",
					"4\t3\tno\tyes\n",
					"7\t3\tyes\tno\n",
					"3\t2\tyes\tno\n",
					"6\t1\tyes\tyes\n",
					"0\t0\tyes\tyes\n",
					"5\t3\tyes\tyes\n",
					"6\t3\tyes\tyes\n",
					"6\t3\tyes\tyes\n",
					"7\t4\tyes\tyes\n",
					"0\t0\tno\tyes\n",
					"0\t0\tyes\tno\n",
					"0\t0\tno\tyes\n",
					"7\t0\tno\tyes\n",
					"0\t0\tyes\tyes\n",
					"3\t1\tyes\tyes\n",
					"4\t2\tyes\tyes"
				],
				"language": "python",
				"metadata": {

				},
				"outputs": [

				]
			},
			{
				"cell_type": "code",
				"collapsed": false,
				"input": [
					"=== Run information ===\n",
					"\n",
					"Scheme:       weka.classifiers.functions.LibLINEAR -S 1 -C 1.0 -E 0.001 -B 1.0\n",
					"Relation:     FeaturesTrain\n",
					"Instances:    6000\n",
					"Attributes:   4\n",
					"              AA\n",
					"              AB\n",
					"              AC\n",
					"              AD\n",
					"Test mode:    evaluate on training data\n",
					"\n",
					"=== Classifier model (full training set) ===\n",
					"\n",
					"LibLINEAR wrapper\n",
					"\n",
					"Time taken to build model: 10.02 seconds\n",
					"\n",
					"=== Evaluation on training set ===\n",
					"\n",
					"Time taken to test model on training data: 0 seconds\n",
					"\n",
					"=== Summary ===\n",
					"\n",
					"Correctly Classified Instances          3918               65.3333 %\n",
					"Incorrectly Classified Instances        2082               34.6667 %\n",
					"Kappa statistic                          0.1975\n",
					"Mean absolute error                      0.3467\n",
					"Root mean squared error                  0.5888\n",
					"Relative absolute error                 72.1441 %\n",
					"Root relative squared error            120.1834 %\n",
					"Coverage of cases (0.95 level)          65.3333 %\n",
					"Mean rel. region size (0.95 level)      50      %\n",
					"Total Number of Instances               6000     \n",
					"\n",
					"=== Detailed Accuracy By Class ===\n",
					"\n",
					"                 TP Rate  FP Rate  Precision  Recall   F-Measure  MCC      ROC Area  PRC Area  Class\n",
					"                 0.911    0.733    0.651      0.911    0.759      0.238    0.589     0.646     yes\n",
					"                 0.267    0.089    0.667      0.267    0.381      0.238    0.589     0.471     no\n",
					"Weighted Avg.    0.653    0.476    0.657      0.653    0.608      0.238    0.589     0.576     \n",
					"\n",
					"=== Confusion Matrix ===\n",
					"\n",
					"  a    b   <-- classified as\n",
					" 3280  320 |  a = yes\n",
					" 1760  640 |  b = no\n",
					"\n",
					"\n",
					"=== Re-evaluation on test set ===\n",
					"\n",
					"User supplied test set\n",
					"Relation:     FeaturesTest\n",
					"Instances:     unknown (yet). Reading incrementally\n",
					"Attributes:   4\n",
					"\n",
					"=== Summary ===\n",
					"\n",
					"Correctly Classified Instances          710               71 %\n",
					"Incorrectly Classified Instances        290               29 %\n",
					"Kappa statistic                         -0.1644\n",
					"Mean absolute error                      0.2941\n",
					"Root mean squared error                  0.5423\n",
					"Coverage of cases (0.95 level)          71 %\n",
					"Total Number of Instances               1000     \n",
					"\n",
					"=== Detailed Accuracy By Class ===\n",
					"\n",
					"                 TP Rate  FP Rate  Precision  Recall   F-Measure  MCC      ROC Area  PRC Area  Class\n",
					"                 0.857    1.000    0.800      0.857    0.828      -0.169   0.429     0.803     yes\n",
					"                 0.000    0.143    0.000      0.000    0.000      -0.169   0.429     0.176     no\n",
					"Weighted Avg.    0.706    0.849    0.659      0.706    0.682      -0.169   0.429     0.693     \n",
					"\n",
					"=== Confusion Matrix ===\n",
					"\n",
					"  a  b   <-- classified as\n",
					" 635 131 |  a = yes\n",
					" 165  69 |  b = no\n",
					"\n"
				],
				"language": "python",
				"metadata": {

				},
				"outputs": [

				]
			},
			{
				"cell_type": "markdown",
				"metadata": {

				},
				"source": [
					"##Kitchen Sink"
				]
			},
			{
				"cell_type": "markdown",
				"metadata": {

				},
				"source": [
					"Combining the Features of the Regular Expression, Clustering (Both Bag of Words and Brown Clustering) and Dependency Features the following output is obtained in Weka."
				]
			},
			{
				"cell_type": "code",
				"collapsed": false,
				"input": [
					"Weka Gives a case by case output for the clustering\n",
					"\n",
					"\n",
					" inst#     actual  predicted error prediction\n",
					"     1      1:yes      1:yes       1 \n",
					"     2       2:no      1:yes   +   1 \n",
					"     3      1:yes      1:yes       1 \n",
					"     4       2:no      1:yes   +   1 \n",
					"     5       2:no      1:yes   +   1 \n",
					"     6       2:no      1:yes   +   1 \n",
					"     7       2:no      1:yes   +   1 \n",
					"     8      1:yes      1:yes       1 \n",
					"     9      1:yes       2:no   +   1 \n",
					"    10       2:no      1:yes   +   1 \n",
					"    11      1:yes      1:yes       1 \n",
					"    12      1:yes      1:yes       1 \n",
					"    13      1:yes      1:yes       1 \n",
					"    14       2:no       2:no       1 \n",
					"    15      1:yes      1:yes       1 \n",
					"    16       2:no      1:yes   +   1 \n",
					"    17      1:yes      1:yes       1 \n",
					"    18      1:yes      1:yes       1 \n",
					"    19      1:yes      1:yes       1 \n",
					"    20      1:yes      1:yes       1 \n",
					"    21       2:no      1:yes   +   1 \n",
					"    22      1:yes      1:yes       1 \n",
					"    23      1:yes      1:yes       1 \n",
					"    24       2:no      1:yes   +   1 \n",
					"    25      1:yes      1:yes       1 \n",
					"    26      1:yes      1:yes       1 \n",
					"    27      1:yes      1:yes       1 \n",
					"    28      1:yes      1:yes       1 \n",
					"    29      1:yes      1:yes       1 \n",
					"    30       2:no      1:yes   +   1 \n",
					"    31       2:no      1:yes   +   1 "
				],
				"language": "python",
				"metadata": {

				},
				"outputs": [

				]
			},
			{
				"cell_type": "code",
				"collapsed": false,
				"input": [
					"Final Training Set Format:\n",
					"\n",
					"feature 1\tfeature 2\tfeature 3\tRegular Expression\tClustering\tTarget\n",
					"5\t2\tno\tyes\tyes\tyes\n",
					"6\t3\tno\tyes\tyes\tno\n",
					"0\t0\tno\tno\tyes\tyes\n",
					"0\t0\tno\tyes\tyes\tno\n",
					"5\t3\tyes\tyes\tyes\tno\n",
					"5\t1\tyes\tyes\tyes\tno\n",
					"5\t0\tyes\tno\tyes\tno\n",
					"8\t4\tyes\tno\tyes\tyes\n",
					"0\t0\tyes\tyes\tno\tyes\n",
					"7\t3\tyes\tno\tyes\tno\n",
					"4\t3\tyes\tyes\tyes\tyes\n",
					"5\t3\tyes\tno\tyes\tyes\n",
					"4\t0\tno\tno\tyes\tyes\n",
					"5\t2\tyes\tyes\tno\tno\n",
					"3\t3\tyes\tyes\tyes\tyes\n",
					"0\t0\tyes\tyes\tyes\tno\n",
					"4\t3\tyes\tno\tyes\tyes\n",
					"0\t0\tyes\tno\tyes\tyes\n",
					"0\t0\tyes\tyes\tyes\tyes\n",
					"0\t0\tno\tno\tyes\tyes\n",
					"4\t1\tyes\tyes\tyes\tno\n",
					"6\t3\tyes\tno\tyes\tyes\n",
					"0\t0\tno\tno\tyes\tyes\n",
					"6\t3\tyes\tyes\tyes\tno\n",
					"0\t0\tyes\tyes\tyes\tyes\n",
					"0\t0\tno\tyes\tyes\tyes\n",
					"0\t0\tno\tyes\tyes\tyes\n",
					"6\t3\tno\tyes\tyes\tyes\n",
					"4\t3\tyes\tyes\tyes\tyes\n",
					"5\t3\tno\tyes\tyes\tno\n",
					"4\t2\tyes\tno\tyes\tno\n",
					"\n",
					"    \n",
					"    "
				],
				"language": "python",
				"metadata": {

				},
				"outputs": [

				]
			},
			{
				"cell_type": "code",
				"collapsed": false,
				"input": [
					"=== Summary ===\n",
					"\n",
					"Correctly Classified Instances          764               76.4 %\n",
					"Incorrectly Classified Instances        236               23.6 %\n",
					"Kappa statistic                         -0.0968\n",
					"Mean absolute error                      0.2353\n",
					"Root mean squared error                  0.4851\n",
					"Coverage of cases (0.95 level)          76.4 %\n",
					"Total Number of Instances               1000     \n",
					"\n",
					"=== Detailed Accuracy By Class ===\n",
					"\n",
					"                 TP Rate  FP Rate  Precision  Recall   F-Measure  MCC      ROC Area  PRC Area  Class\n",
					"                 0.929    1.000    0.813      0.929    0.867      -0.116   0.464     0.813     yes\n",
					"                 0.000    0.071    0.000      0.000    0.000      -0.116   0.464     0.176     no\n",
					"Weighted Avg.    0.765    0.836    0.669      0.765    0.714      -0.116   0.464     0.701     "
				],
				"language": "python",
				"metadata": {

				},
				"outputs": [

				]
			},
			{
				"cell_type": "markdown",
				"metadata": {

				},
				"source": [
					"####We observe that the Kitchen sink in the end gave us the best result. The Reason might be that together most of the models tend to point to the correct output. Whereas when taken individually they have a certain weakness for few cases and hence the accuracy went down.\n"
				]
			},
			{
				"cell_type": "code",
				"collapsed": false,
				"input": [

				],
				"language": "python",
				"metadata": {

				},
				"outputs": [

				]
			}
		],
		"metadata": {

		}
	}]
}