Skip to content

Instantly share code, notes, and snippets.

@alkutnikar
Last active September 7, 2017 16:22
Show Gist options
  • Save alkutnikar/efac65057178496550af1180279d3a01 to your computer and use it in GitHub Desktop.
Save alkutnikar/efac65057178496550af1180279d3a01 to your computer and use it in GitHub Desktop.
Relation Extraction
{
"metadata": {
"name": "",
"signature": "sha256:27ba14dd948ce47454e111a60aec9e03aa3ef3d3649f5e52256948d157666cab"
},
"nbformat": 3,
"nbformat_minor": 0,
"worksheets": [{
"cells": [{
"cell_type": "markdown",
"metadata": {
},
"source": [
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {
},
"source": [
"We are building a relation extractor to identify institution relation instances from Wikipedia sentences. An institution relation institution(x, y) indicates that a person x studied in the institution y. "
]
},
{
"cell_type": "markdown",
"metadata": {
},
"source": [
"1.Manual Rule Based Extractor\n",
"\n",
"Write a rule-based extractor. Specify five regular expression patterns (word-based) to identify “institution” relations. Use the training set to develop your patterns and report the performance on the test set. \n",
"\n",
"e.g., If sentence matches “(.*?)[space]graduate.*[space]in” then output “Yes” for the relation otherwise output “No”."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"TEST_DATA_PATH = \"test.tsv\"\n",
"TRAIN_DATA_PATH = \"train.tsv\"\n",
"PATHS_DATA_PATH = \"paths\""
],
"language": "python",
"metadata": {
},
"outputs": [
],
"prompt_number": 4
},
{
"cell_type": "markdown",
"metadata": {
},
"source": [
"####Function to get Evaluation Parameters"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"from __future__ import division\n",
"def evaluate(tp,fp,tn,fn):\n",
" precision = tp / (tp + fp)\n",
" recall = tp / (tp + fn)\n",
" f1 = 2 * precision * recall / (precision + recall)\n",
" #f1=0\n",
" print 'Precision = {0}, Recall = {1}, F1 Score = {2}'.format(precision, recall, f1)"
],
"language": "python",
"metadata": {
},
"outputs": [
],
"prompt_number": 5
},
{
"cell_type": "markdown",
"metadata": {
},
"source": [
"####Regular Expression patterns"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"import csv\n",
"import re\n",
"institution=[]\n",
"name=[]\n",
"sentence=[]\n",
"judgment=[]\n",
"tp=0\n",
"tn=0\n",
"fp=0\n",
"fn=0\n",
"\n",
"r_list = [re.compile(r'.*graduate.*',re.IGNORECASE),\n",
"re.compile(r'.*completed.*education.*',re.IGNORECASE),\n",
"re.compile(r'.*bachelor.*',re.IGNORECASE),\n",
"re.compile(r'.*attended.*',re.IGNORECASE),\n",
"re.compile(r'.*student.*',re.IGNORECASE)]\n",
"\n",
"\n",
"with open(TRAIN_DATA_PATH,'r') as f:\n",
" next(f) # skip headings\n",
" reader=csv.reader(f,delimiter='\\t')\n",
" \n",
" for line in reader:\n",
" \n",
" if any(r.match(line[3]) for r in r_list):\n",
" \n",
" if line[4] == 'yes':\n",
" tp += 1\n",
" else:\n",
" fp += 1\n",
" else:\n",
" \n",
" if line[4] == 'no':\n",
" tn += 1\n",
" else:\n",
" fn += 1\n",
" print 'For Training Set'\n",
" print evaluate(tp,fp,tn,fn)"
],
"language": "python",
"metadata": {
},
"outputs": [{
"output_type": "stream",
"stream": "stdout",
"text": [
"For Training Set\n",
"Precision = 0.714383094751, Recall = 0.531440162272, F1 Score = 0.609479499855\n",
"None\n"
]
}],
"prompt_number": 8
},
{
"cell_type": "markdown",
"metadata": {
},
"source": [
"####Measuring performance on Test Set"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"institution=[]\n",
"name=[]\n",
"sentence=[]\n",
"judgment=[]\n",
"tp=0\n",
"tn=0\n",
"fp=0\n",
"fn=0\n",
"with open(TEST_DATA_PATH,'r') as f:\n",
" next(f) # skip headings\n",
" reader=csv.reader(f,delimiter='\\t')\n",
"\n",
" for line in reader:\n",
" \n",
" if any(r.match(line[3]) for r in r_list):\n",
"\n",
" if line[4] == 'yes':\n",
" tp += 1\n",
" else:\n",
" fp += 1\n",
" else:\n",
"\n",
" if line[4] == 'no':\n",
" tn += 1\n",
" else:\n",
" fn += 1\n",
" print 'For Test Set'\n",
" print evaluate(tp,fp,tn,fn)"
],
"language": "python",
"metadata": {
},
"outputs": [{
"output_type": "stream",
"stream": "stdout",
"text": [
"For Test Set\n",
"Precision = 0.737739872068, Recall = 0.529862174579, F1 Score = 0.616755793226\n",
"None\n"
]
}],
"prompt_number": 10
},
{
"cell_type": "markdown",
"metadata": {
},
"source": [
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
],
"language": "python",
"metadata": {
},
"outputs": [
]
},
{
"cell_type": "markdown",
"metadata": {
},
"source": [
"##Supervised Classification"
]
},
{
"cell_type": "markdown",
"metadata": {
},
"source": [
"At a high-level the following steps are involved:\n",
"1) Convert the training and test instances into feature vectors + the label.\n",
"2) Use a binary classification algorithm such as SVM to learn a classifier on the training data. \n",
"3) Use the trained classifier to make predictions on the test data.\n",
"4) Evaluate the predictions."
]
},
{
"cell_type": "markdown",
"metadata": {
},
"source": [
"1) Bag-of-words – Every word that occurs in the span between the PERSON mention and the INSTITUTION mention is used as a feature in a supervised classification setup. The classifiers typically expect you to assign a unique id to each word in the vocabulary and represent the text via these ids"
]
},
{
"cell_type": "markdown",
"metadata": {
},
"source": [
"####To get the feature vectors and the arff files for training and test"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"TEST_DATA_PATH = \"test.tsv\"\n",
"TRAIN_DATA_PATH = \"train.tsv\"\n",
"PATHS_DATA_PATH = \"paths\""
],
"language": "python",
"metadata": {
},
"outputs": [
],
"prompt_number": 137
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"def parse_data(train_data, test_data):\n",
" \"\"\"\n",
" Input: path to the data file\n",
" Output: (1) a list of tuples, one for each instance of the data, and\n",
" (2) a list of all unique tokens in the data\n",
"\n",
" Parses the data file to extract all instances of the data as tuples of the form:\n",
" (person, institution, judgment, full snippet, intermediate text)\n",
" where the intermediate text is all tokens that occur between the first occurrence of\n",
" the person and the first occurrence of the institution.\n",
"\n",
" Also extracts a list of all tokens that appear in the intermediate text for the\n",
" purpose of creating feature vectors.\n",
" \"\"\"\n",
" all_tokens = []\n",
" data = []\n",
" for fp in [train_data, test_data]:\n",
" with open(fp) as f:\n",
" for line in f:\n",
" institution, person, snippet, intermediate_text, judgment = line.split(\"\\t\")\n",
" judgment = judgment.strip()\n",
"\n",
" # Build up a list of unique tokens that occur in the intermediate text\n",
" # This is needed to create BOW feature vectors\n",
" tokens = intermediate_text.split()\n",
" for t in tokens:\n",
" t = re.sub('[^A-Za-z0-9]+', '', t)\n",
" if t not in all_tokens:\n",
" all_tokens.append(t)\n",
" data.append((person, institution, judgment, snippet, intermediate_text))\n",
" return data, all_tokens"
],
"language": "python",
"metadata": {
},
"outputs": [
],
"prompt_number": 10
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"def create_feature_vectors(data, all_tokens):\n",
" \"\"\"\n",
" Input: (1) The parsed data from parse_data()\n",
" (2) a list of all unique tokens found in the intermediate text\n",
" Output: A list of lists representing the feature vectors for each data instance\n",
"\n",
" Creates feature vectors from the parsed data file. These features include\n",
" bag of words features representing the number of occurrences of each\n",
" token in the intermediate text (text that comes between the first occurrence\n",
" of the person and the first occurrence of the institution).\n",
" This is also where any additional user-defined features can be added.\n",
" \"\"\"\n",
" feature_vectors = []\n",
" for instance in data:\n",
" # BOW features\n",
" # Gets the number of occurrences of each token\n",
" # in the intermediate text\n",
" feature_vector = [0 for t in all_tokens]\n",
" intermediate_text = instance[4]\n",
" tokens = intermediate_text.split()\n",
" try:\n",
" for token in tokens:\n",
" index = all_tokens.index(token)\n",
" feature_vector[index] += 1\n",
"\n",
" ### ADD ADDITIONAL FEATURES HERE ###\n",
"\n",
" # Class label\n",
" judgment = instance[2]\n",
" feature_vector.append(judgment)\n",
"\n",
" feature_vectors.append(feature_vector)\n",
" except:\n",
" pass\n",
" return feature_vectors"
],
"language": "python",
"metadata": {
},
"outputs": [
],
"prompt_number": 44
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"def generate_arff_file(feature_vectors, all_tokens, out_path):\n",
" \"\"\"\n",
" Input: (1) A list of all feature vectors for the data\n",
" (2) A list of all unique tokens that occurred in the intermediate text\n",
" (3) The name and path of the ARFF file to be output\n",
" Output: an ARFF file output to the location specified in out_path\n",
"\n",
" Converts a list of feature vectors to an ARFF file for use with Weka.\n",
" \"\"\"\n",
" with open(out_path, 'w') as f:\n",
" # Header info\n",
" f.write(\"@RELATION institutions\\n\")\n",
" for i in range(len(all_tokens)):\n",
" f.write(\"@ATTRIBUTE token_{} INTEGER\\n\".format(i))\n",
"\n",
" ### SPECIFY ADDITIONAL FEATURES HERE ###\n",
" # For example: f.write(\"@ATTRIBUTE custom_1 REAL\\n\")\n",
"\n",
" # Classes\n",
" f.write(\"@ATTRIBUTE class {yes,no}\\n\")\n",
"\n",
" # Data instances\n",
" f.write(\"\\n@DATA\\n\")\n",
" for fv in feature_vectors:\n",
" features = []\n",
" for i in range(len(fv)):\n",
" value = fv[i]\n",
" if value != 0:\n",
" features.append(\"{} {}\".format(i, value))\n",
" entry = \",\".join(features)\n",
" f.write(\"{\" + entry + \"}\\n\")"
],
"language": "python",
"metadata": {
},
"outputs": [
],
"prompt_number": 45
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"data, all_tokens = parse_data(TRAIN_DATA_PATH, TEST_DATA_PATH)\n",
"feature_vectors = create_feature_vectors(data, all_tokens)\n",
"generate_arff_file(feature_vectors[:6000], all_tokens, \"train_cluster.arff\")\n",
"generate_arff_file(feature_vectors[6000:], all_tokens, \"test_cluster.arff\")"
],
"language": "python",
"metadata": {
},
"outputs": [
],
"prompt_number": 46
},
{
"cell_type": "markdown",
"metadata": {
},
"source": [
"##Classifier"
]
},
{
"cell_type": "markdown",
"metadata": {
},
"source": [
]
},
{
"cell_type": "markdown",
"metadata": {
},
"source": [
"######The following tests were conducted by training a LibLinear Model from the trained arff that we got previously. I Reevaluated the model on test arff file for various values of C as observed below."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"Performance for c = 0.001\n",
"\n",
"Correctly Classified Instances 719 71.9 %\n",
"Incorrectly Classified Instances 281 28.1 %\n",
"\n",
"\n",
"Performance for c = 0.01\n",
"\n",
"Correctly Classified Instances 747 74.7 %\n",
"Incorrectly Classified Instances 253 25.3 %\n",
"\n",
"Performance for c = 0.1\n",
"\n",
"Correctly Classified Instances 735 73.5 %\n",
"Incorrectly Classified Instances 265 26.5 %\n",
"\n",
"Performance for c = 1.0\n",
"\n",
"Correctly Classified Instances 703 70.3 %\n",
"Incorrectly Classified Instances 297 29.7 %\n",
"\n",
"Performance for c = 10\n",
"\n",
"Correctly Classified Instances 670 67 %\n",
"Incorrectly Classified Instances 330 33 %\n",
"\n",
"Performance for c = 100\n",
"\n",
"Correctly Classified Instances 635 63.5 %\n",
"Incorrectly Classified Instances 365 36.5 %"
],
"language": "python",
"metadata": {
},
"outputs": [
]
},
{
"cell_type": "markdown",
"metadata": {
},
"source": [
"We observe that the best result is obtained for c = 0.01"
]
},
{
"cell_type": "markdown",
"metadata": {
},
"source": [
"####We will use the training arff file to train a Classifier in Weka with c = 0.01"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"=== Re-evaluation on test set ===\n",
"\n",
"User supplied test set\n",
"Relation: institutions\n",
"Instances: unknown (yet). Reading incrementally\n",
"Attributes: 28059\n",
"\n",
"=== Summary ===\n",
"\n",
"Correctly Classified Instances 747 74.7 %\n",
"Incorrectly Classified Instances 253 25.3 %\n",
"Kappa statistic 0.3826\n",
"Mean absolute error 0.253 \n",
"Root mean squared error 0.503 \n",
"Coverage of cases (0.95 level) 74.7 %\n",
"Total Number of Instances 1000 \n",
"\n",
"=== Detailed Accuracy By Class ===\n",
"\n",
" TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class\n",
" 0.913 0.566 0.753 0.913 0.825 0.407 0.673 0.744 yes\n",
" 0.434 0.087 0.725 0.434 0.542 0.407 0.673 0.510 no\n",
"Weighted Avg. 0.747 0.401 0.743 0.747 0.727 0.407 0.673 0.663 \n",
"\n",
"=== Confusion Matrix ===\n",
"\n",
" a b <-- classified as\n",
" 597 57 | a = yes\n",
" 196 150 | b = no"
],
"language": "python",
"metadata": {
},
"outputs": [
]
},
{
"cell_type": "markdown",
"metadata": {
},
"source": [
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"from nltk.tokenize import word_tokenize\n",
"def parse_data(train_data, test_data):\n",
" \"\"\"\n",
" Input: path to the data file\n",
" Output: (1) a list of tuples, one for each instance of the data, and\n",
" (2) a list of all unique tokens in the data\n",
"\n",
" Parses the data file to extract all instances of the data as tuples of the form:\n",
" (person, institution, judgment, full snippet, intermediate text)\n",
" where the intermediate text is all tokens that occur between the first occurrence of\n",
" the person and the first occurrence of the institution.\n",
"\n",
" Also extracts a list of all tokens that appear in the intermediate text for the\n",
" purpose of creating feature vectors.\n",
" \"\"\"\n",
" all_tokens = []\n",
" data = []\n",
" for fp in [train_data, test_data]:\n",
" with open(fp) as f:\n",
" for line in f:\n",
" institution, person, snippet, intermediate_text, judgment = line.split(\"\\t\")\n",
" judgment = judgment.strip()\n",
"\n",
" # Build up a list of unique tokens that occur in the intermediate text\n",
" # This is needed to create BOW feature vectors\n",
" \n",
" tokens = word_tokenize(intermediate_text.decode('utf8'))\n",
" \n",
" for t in tokens:\n",
" #t = re.sub('[^A-Za-z0-9]+', '', t)\n",
" if t not in all_tokens:\n",
" all_tokens.append(t)\n",
" data.append((person, institution, judgment, snippet, intermediate_text))\n",
" return data, all_tokens"
],
"language": "python",
"metadata": {
},
"outputs": [
],
"prompt_number": 61
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"\n",
"import unicodedata\n",
"def create_feature_vectors_tokenized(data, all_tokens):\n",
" \"\"\"\n",
" Input: (1) The parsed data from parse_data()\n",
" (2) a list of all unique tokens found in the intermediate text\n",
" Output: A list of lists representing the feature vectors for each data instance\n",
"\n",
" Creates feature vectors from the parsed data file. These features include\n",
" bag of words features representing the number of occurrences of each\n",
" token in the intermediate text (text that comes between the first occurrence\n",
" of the person and the first occurrence of the institution).\n",
" This is also where any additional user-defined features can be added.\n",
" \"\"\"\n",
" feature_vectors = []\n",
" for instance in data:\n",
" feature_vector = [0 for t in all_tokens]\n",
" \n",
" # BOW features\n",
" # Gets the number of occurrences of each token\n",
" # in the intermediate text\n",
" intermediate_text = instance[4]\n",
" tokens = word_tokenize(intermediate_text.decode('utf8'))\n",
" for token in tokens:\n",
" index = all_tokens.index(token)\n",
" feature_vector[index] += 1\n",
" \n",
"\n",
" ### ADD ADDITIONAL FEATURES HERE ###\n",
"\n",
" # Class label\n",
" judgment = instance[2]\n",
" feature_vector.append(judgment)\n",
"\n",
" feature_vectors.append(feature_vector)\n",
" \n",
" return feature_vectors"
],
"language": "python",
"metadata": {
},
"outputs": [
],
"prompt_number": 62
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"data, all_tokens = parse_data(TRAIN_DATA_PATH, TEST_DATA_PATH)\n",
"feature_vectors = create_feature_vectors_tokenized(data, all_tokens)\n",
"generate_arff_file(feature_vectors[:6000], all_tokens, \"train_tokenize.arff\")\n",
"generate_arff_file(feature_vectors[6000:], all_tokens, \"test_tokenize.arff\")"
],
"language": "python",
"metadata": {
},
"outputs": [
],
"prompt_number": 63
},
{
"cell_type": "markdown",
"metadata": {
},
"source": [
"####Evaluated the Liblinear model with nltk.tokenize with c = 0.01"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"=== Re-evaluation on test set ===\n",
"\n",
"User supplied test set\n",
"Relation: institutions\n",
"Instances: unknown (yet). Reading incrementally\n",
"Attributes: 21945\n",
"\n",
"=== Summary ===\n",
"\n",
"Correctly Classified Instances 727 72.7 %\n",
"Incorrectly Classified Instances 273 27.3 %\n",
"Kappa statistic 0.3225\n",
"Mean absolute error 0.273 \n",
"Root mean squared error 0.5225\n",
"Coverage of cases (0.95 level) 72.7 %\n",
"Total Number of Instances 1000 \n",
"\n",
"=== Detailed Accuracy By Class ===\n",
"\n",
" TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class\n",
" 0.914 0.627 0.734 0.914 0.814 0.352 0.644 0.727 yes\n",
" 0.373 0.086 0.697 0.373 0.486 0.352 0.644 0.477 no\n",
"Weighted Avg. 0.727 0.440 0.721 0.727 0.701 0.352 0.644 0.640 \n",
"\n",
"=== Confusion Matrix ===\n",
"\n",
" a b <-- classified as\n",
" 598 56 | a = yes\n",
" 217 129 | b = no"
],
"language": "python",
"metadata": {
},
"outputs": [
]
},
{
"cell_type": "markdown",
"metadata": {
},
"source": [
"#####2) Clustering Features – Using words directly as features often leads to sparsity. A standard technique to overcome sparsity is to use a generalized representation of words via clustering. The idea is to first cluster words based on their usage and represent each word by the cluster to which it belongs. \n",
"\n",
"Brown Clustering is a widely used for clustering words. You can use Percy Liang’s code here to generate hierarchical word clusters. The algorithm takes in a text file as input and a cluster parameter (c) that controls the number of clusters. The output includes a path file that contains the cluster ID (which is a path on the hierarchical (binary) clustering). "
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"import csv\n",
"import re\n",
"xmap={}\n",
"keymap={}\n",
"convertmap={}\n",
"with open('paths','r') as f:\n",
"\n",
" lines=csv.reader(f,delimiter='\\t')\n",
" for reader in lines:\n",
" xmap[re.sub('[^A-Za-z0-9]+', '', reader[1])]=reader[0]\n",
" keymap[reader[0]] = re.sub('[^A-Za-z0-9]+', '', reader[1])\n",
" val=0\n",
" for key in keymap.keys():\n",
" convertmap[key] = val\n",
" val+=1\n"
],
"language": "python",
"metadata": {
},
"outputs": [
],
"prompt_number": 94
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"def parse_data(train_data, test_data):\n",
" \"\"\"\n",
" Input: path to the data file\n",
" Output: (1) a list of tuples, one for each instance of the data, and\n",
" (2) a list of all unique tokens in the data\n",
"\n",
" Parses the data file to extract all instances of the data as tuples of the form:\n",
" (person, institution, judgment, full snippet, intermediate text)\n",
" where the intermediate text is all tokens that occur between the first occurrence of\n",
" the person and the first occurrence of the institution.\n",
"\n",
" Also extracts a list of all tokens that appear in the intermediate text for the\n",
" purpose of creating feature vectors.\n",
" \"\"\"\n",
" all_tokens = []\n",
" data = []\n",
" for fp in [train_data, test_data]:\n",
" with open(fp) as f:\n",
" for line in f:\n",
" institution, person, snippet, intermediate_text, judgment = line.split(\"\\t\")\n",
" judgment = judgment.strip()\n",
"\n",
" # Build up a list of unique tokens that occur in the intermediate text\n",
" # This is needed to create BOW feature vectors\n",
" \n",
" tokens = intermediate_text.split()\n",
" \n",
" for t in tokens:\n",
" t = re.sub('[^A-Za-z0-9]+', '', t)\n",
" if t not in all_tokens:\n",
" all_tokens.append(t)\n",
" data.append((person, institution, judgment, snippet, intermediate_text))\n",
" return data, all_tokens"
],
"language": "python",
"metadata": {
},
"outputs": [
],
"prompt_number": 132
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"def create_feature_vectors_cluster(data, all_tokens):\n",
" \"\"\"\n",
" Input: (1) The parsed data from parse_data()\n",
" (2) a list of all unique tokens found in the intermediate text\n",
" Output: A list of lists representing the feature vectors for each data instance\n",
"\n",
" Creates feature vectors from the parsed data file. These features include\n",
" bag of words features representing the number of occurrences of each\n",
" token in the intermediate text (text that comes between the first occurrence\n",
" of the person and the first occurrence of the institution).\n",
" This is also where any additional user-defined features can be added.\n",
" \"\"\"\n",
" feature_vectors = []\n",
" for instance in data:\n",
" # BOW features\n",
" # Gets the number of occurrences of each token\n",
" # in the intermediate text\n",
" feature_vector = {}\n",
" intermediate_text = instance[4]\n",
" tokens = intermediate_text.split()\n",
" for token in tokens:\n",
" token = re.sub('[^A-Za-z0-9]+', '', token)\n",
" if xmap[token] in feature_vector:\n",
" feature_vector[xmap[token]] += 1\n",
" else:\n",
" feature_vector[xmap[token]] = 1\n",
"\n",
" ### ADD ADDITIONAL FEATURES HERE ###\n",
"\n",
" # Class label\n",
" judgment = instance[2]\n",
" feature_list=[]\n",
" feature_list.append(feature_vector)\n",
" feature_list.append(judgment)\n",
"\n",
" feature_vectors.append(feature_list)\n",
" return feature_vectors"
],
"language": "python",
"metadata": {
},
"outputs": [
],
"prompt_number": 96
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"import collections\n",
"from operator import itemgetter\n",
"def generate_arff_file(feature_vectors, all_tokens, out_path):\n",
" \"\"\"\n",
" Input: (1) A list of all feature vectors for the data\n",
" (2) A list of all unique tokens that occurred in the intermediate text\n",
" (3) The name and path of the ARFF file to be output\n",
" Output: an ARFF file output to the location specified in out_path\n",
"\n",
" Converts a list of feature vectors to an ARFF file for use with Weka.\n",
" \"\"\"\n",
" with open(out_path, 'w') as f:\n",
" # Header info\n",
" f.write(\"@RELATION institutions\\n\")\n",
" \n",
" for i in range(len(keymap.keys())):\n",
" f.write(\"@ATTRIBUTE token_{} INTEGER\\n\".format(i))\n",
"\n",
" ### SPECIFY ADDITIONAL FEATURES HERE ###\n",
" # For example: f.write(\"@ATTRIBUTE custom_1 REAL\\n\")\n",
"\n",
" # Classes\n",
" f.write(\"@ATTRIBUTE class {yes,no}\\n\")\n",
"\n",
" # Data instances\n",
" f.write(\"\\n@DATA\\n\")\n",
" \n",
" for fv in feature_vectors:\n",
" features=[]\n",
" features_map = fv[0]\n",
" for i in features_map.keys():\n",
" value = features_map[i]\n",
" if value != 0:\n",
" features.append(\"{} {}\".format(str(convertmap[i]).zfill(2), value))\n",
" features = sorted(features)\n",
" features.append(\"{} {}\".format(50, fv[1]))\n",
" entry = \",\".join(features)\n",
" f.write(\"{\" + entry + \"}\\n\")"
],
"language": "python",
"metadata": {
},
"outputs": [
],
"prompt_number": 97
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"data, all_tokens = parse_data(TRAIN_DATA_PATH, TEST_DATA_PATH)\n",
"feature_vectors = create_feature_vectors_cluster(data, all_tokens)\n",
"generate_arff_file(feature_vectors[:6000], all_tokens, \"train_cluster.arff\")\n",
"generate_arff_file(feature_vectors[6000:], all_tokens, \"test_cluster.arff\")"
],
"language": "python",
"metadata": {
},
"outputs": [
],
"prompt_number": 98
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"=== Run information ===\n",
"\n",
"Scheme: weka.classifiers.functions.LibLINEAR -S 1 -C 0.01 -E 0.001 -B 1.0\n",
"Relation: institutions\n",
"Instances: 6000\n",
"Attributes: 51\n",
" token_0\n",
" token_1\n",
" token_2\n",
" token_3\n",
" token_4\n",
" token_5\n",
" token_6\n",
" token_7\n",
" token_8\n",
" token_9\n",
" token_10\n",
" token_11\n",
" token_12\n",
" token_13\n",
" token_14\n",
" token_15\n",
" token_16\n",
" token_17\n",
" token_18\n",
" token_19\n",
" token_20\n",
" token_21\n",
" token_22\n",
" token_23\n",
" token_24\n",
" token_25\n",
" token_26\n",
" token_27\n",
" token_28\n",
" token_29\n",
" token_30\n",
" token_31\n",
" token_32\n",
" token_33\n",
" token_34\n",
" token_35\n",
" token_36\n",
" token_37\n",
" token_38\n",
" token_39\n",
" token_40\n",
" token_41\n",
" token_42\n",
" token_43\n",
" token_44\n",
" token_45\n",
" token_46\n",
" token_47\n",
" token_48\n",
" token_49\n",
" class\n",
"Test mode: evaluate on training data\n",
"\n",
"=== Classifier model (full training set) ===\n",
"\n",
"LibLINEAR wrapper\n",
"\n",
"Time taken to build model: 1.2 seconds\n",
"\n",
"=== Evaluation on training set ===\n",
"\n",
"Time taken to test model on training data: 0.16 seconds\n",
"\n",
"=== Summary ===\n",
"\n",
"Correctly Classified Instances 3977 66.2833 %\n",
"Incorrectly Classified Instances 2023 33.7167 %\n",
"Kappa statistic 0.0569\n",
"Mean absolute error 0.3372\n",
"Root mean squared error 0.5807\n",
"Relative absolute error 74.8587 %\n",
"Root relative squared error 122.3613 %\n",
"Coverage of cases (0.95 level) 66.2833 %\n",
"Mean rel. region size (0.95 level) 50 %\n",
"Total Number of Instances 6000 \n",
"\n",
"=== Detailed Accuracy By Class ===\n",
"\n",
" TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class\n",
" 0.968 0.923 0.668 0.968 0.791 0.101 0.523 0.668 yes\n",
" 0.077 0.032 0.556 0.077 0.136 0.101 0.523 0.359 no\n",
"Weighted Avg. 0.663 0.618 0.630 0.663 0.566 0.101 0.523 0.562 \n",
"\n",
"=== Confusion Matrix ===\n",
"\n",
" a b <-- classified as\n",
" 3818 127 | a = yes\n",
" 1896 159 | b = no\n",
"\n",
"\n",
"=== Re-evaluation on test set ===\n",
"\n",
"User supplied test set\n",
"Relation: institutions\n",
"Instances: unknown (yet). Reading incrementally\n",
"Attributes: 51\n",
"\n",
"=== Summary ===\n",
"\n",
"Correctly Classified Instances 660 66 %\n",
"Incorrectly Classified Instances 340 34 %\n",
"Kappa statistic 0.056 \n",
"Mean absolute error 0.34 \n",
"Root mean squared error 0.5831\n",
"Coverage of cases (0.95 level) 66 %\n",
"Total Number of Instances 1000 \n",
"\n",
"=== Detailed Accuracy By Class ===\n",
"\n",
" TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class\n",
" 0.969 0.925 0.665 0.969 0.789 0.101 0.522 0.664 yes\n",
" 0.075 0.031 0.565 0.075 0.133 0.101 0.522 0.362 no\n",
"Weighted Avg. 0.660 0.615 0.630 0.660 0.562 0.101 0.522 0.560 \n",
"\n",
"=== Confusion Matrix ===\n",
"\n",
" a b <-- classified as\n",
" 634 20 | a = yes\n",
" 320 26 | b = no\n",
"\n"
],
"language": "python",
"metadata": {
},
"outputs": [
]
},
{
"cell_type": "markdown",
"metadata": {
},
"source": [
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"import csv\n",
"import re\n",
"xmap={}\n",
"keymap={}\n",
"convertmap={}\n",
"with open('paths10000.txt','r') as f:\n",
"\n",
" lines=csv.reader(f,delimiter='\\t')\n",
" for reader in lines:\n",
" xmap[re.sub('[^A-Za-z0-9]+', '', reader[1])]=reader[0]\n",
" keymap[reader[0]] = re.sub('[^A-Za-z0-9]+', '', reader[1])\n",
" val=0\n",
" for key in keymap.keys():\n",
" convertmap[key] = val\n",
" val+=1\n"
],
"language": "python",
"metadata": {
},
"outputs": [
],
"prompt_number": 99
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"def create_feature_vectors_cluster10k(data, all_tokens):\n",
" \"\"\"\n",
" Input: (1) The parsed data from parse_data()\n",
" (2) a list of all unique tokens found in the intermediate text\n",
" Output: A list of lists representing the feature vectors for each data instance\n",
"\n",
" Creates feature vectors from the parsed data file. These features include\n",
" bag of words features representing the number of occurrences of each\n",
" token in the intermediate text (text that comes between the first occurrence\n",
" of the person and the first occurrence of the institution).\n",
" This is also where any additional user-defined features can be added.\n",
" \"\"\"\n",
" feature_vectors = []\n",
" for instance in data:\n",
" # BOW features\n",
" # Gets the number of occurrences of each token\n",
" # in the intermediate text\n",
" feature_vector = {}\n",
" intermediate_text = instance[4]\n",
" tokens = intermediate_text.split()\n",
" for token in tokens:\n",
" token = re.sub('[^A-Za-z0-9]+', '', token)\n",
" try:\n",
" if xmap[token] in feature_vector:\n",
" feature_vector[xmap[token]] += 1\n",
" else:\n",
" feature_vector[xmap[token]] = 1\n",
" except:\n",
" pass\n",
"\n",
" ### ADD ADDITIONAL FEATURES HERE ###\n",
"\n",
" # Class label\n",
" judgment = instance[2]\n",
" feature_list=[]\n",
" feature_list.append(feature_vector)\n",
" feature_list.append(judgment)\n",
"\n",
" feature_vectors.append(feature_list)\n",
" return feature_vectors"
],
"language": "python",
"metadata": {
},
"outputs": [
],
"prompt_number": 106
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"import collections\n",
"from operator import itemgetter\n",
"def generate_arff_file(feature_vectors, all_tokens, out_path):\n",
" \"\"\"\n",
" Input: (1) A list of all feature vectors for the data\n",
" (2) A list of all unique tokens that occurred in the intermediate text\n",
" (3) The name and path of the ARFF file to be output\n",
" Output: an ARFF file output to the location specified in out_path\n",
"\n",
" Converts a list of feature vectors to an ARFF file for use with Weka.\n",
" \"\"\"\n",
" with open(out_path, 'w') as f:\n",
" # Header info\n",
" f.write(\"@RELATION institutions\\n\")\n",
" \n",
" for i in range(len(keymap.keys())):\n",
" f.write(\"@ATTRIBUTE token_{} INTEGER\\n\".format(i))\n",
"\n",
" ### SPECIFY ADDITIONAL FEATURES HERE ###\n",
" # For example: f.write(\"@ATTRIBUTE custom_1 REAL\\n\")\n",
"\n",
" # Classes\n",
" f.write(\"@ATTRIBUTE class {yes,no}\\n\")\n",
"\n",
" # Data instances\n",
" f.write(\"\\n@DATA\\n\")\n",
" \n",
" for fv in feature_vectors:\n",
" features=[]\n",
" features_map = fv[0]\n",
" for i in features_map.keys():\n",
" value = features_map[i]\n",
" if value != 0:\n",
" features.append(\"{} {}\".format(str(convertmap[i]).zfill(2), value))\n",
" features = sorted(features)\n",
" features.append(\"{} {}\".format(9, fv[1]))\n",
" entry = \",\".join(features)\n",
" f.write(\"{\" + entry + \"}\\n\")"
],
"language": "python",
"metadata": {
},
"outputs": [
],
"prompt_number": 109
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"data, all_tokens = parse_data(TRAIN_DATA_PATH, TEST_DATA_PATH)\n",
"feature_vectors = create_feature_vectors_cluster10k(data, all_tokens)\n",
"generate_arff_file(feature_vectors[:6000], all_tokens, \"train_cluster10000.arff\")\n",
"generate_arff_file(feature_vectors[6000:], all_tokens, \"test_cluster10000.arff\")"
],
"language": "python",
"metadata": {
},
"outputs": [
],
"prompt_number": 110
},
{
"cell_type": "raw",
"metadata": {
},
"source": [
"=== Re-evaluation on test set ===\n",
"\n",
"User supplied test set\n",
"Relation: institutions\n",
"Instances: unknown (yet). Reading incrementally\n",
"Attributes: 10\n",
"\n",
"=== Summary ===\n",
"\n",
"Correctly Classified Instances 654 65.4 %\n",
"Incorrectly Classified Instances 346 34.6 %\n",
"Kappa statistic 0.0123\n",
"Mean absolute error 0.346 \n",
"Root mean squared error 0.5882\n",
"Coverage of cases (0.95 level) 65.4 %\n",
"Total Number of Instances 1000 \n",
"\n",
"=== Detailed Accuracy By Class ===\n",
"\n",
" TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class\n",
" 0.989 0.980 0.656 0.989 0.789 0.039 0.505 0.656 yes\n",
" 0.020 0.011 0.500 0.020 0.039 0.039 0.505 0.349 no\n",
"Weighted Avg. 0.654 0.644 0.602 0.654 0.529 0.039 0.505 0.550 \n",
"\n",
"=== Confusion Matrix ===\n",
"\n",
" a b <-- classified as\n",
" 647 7 | a = yes\n",
" 339 7 | b = no\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {
},
"source": [
"###3) Dependency Features "
]
},
{
"cell_type": "markdown",
"metadata": {
},
"source": [
"####I am making use of the Stanford Dependency Parser for this assignment.\n",
"\n",
"####Version : stanford-parser-full-2015-01-30"
]
},
{
"cell_type": "markdown",
"metadata": {
},
"source": [
"The parser makes use of the complete text and the first process is to preprocess this as below. I am converting a paragraph into its substituent sentences."
]
},
{
"cell_type": "markdown",
"metadata": {
},
"source": [
"###Before forming the final features, I am running all the sentences into the stanford dependency parser via shell script command : ./lexparser.sh file_name. This will give me the solution in the below format."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"(ROOT\n",
" (S\n",
" (NP (NNP David) (NNP Maxwell))\n",
" (VP (VBD was)\n",
" (ADJP (VBN educated)\n",
" (PP\n",
" (PP (IN at)\n",
" (NP\n",
" (NP (NNP Eton) (NNP College))\n",
" (, ,)\n",
" (SBAR\n",
" (WHADVP (WRB where))\n",
" (S\n",
" (NP (PRP he))\n",
" (VP (VBD was)\n",
" (NP\n",
" (NP\n",
" (NP (DT a) (NNP King) (POS 's))\n",
" (NNS Scholar)\n",
" (CC and)\n",
" (NNS Captain))\n",
" (PP (IN of)\n",
" (NP (NNP Boats)))))))))\n",
" (, ,)\n",
" (CC and)\n",
" (PP (IN at)\n",
" (NP (NNP Cambridge) (NNP University)))))\n",
" (SBAR\n",
" (WHADVP (WRB where))\n",
" (S\n",
" (NP (PRP he))\n",
" (VP (VBD rowed)\n",
" (PP (IN in)\n",
" (NP\n",
" (NP (DT the) (VBG winning) (NNP Cambridge) (NN boat))\n",
" (PP (IN in)\n",
" (NP (DT the) (CD 1971)\n",
" (CC and)\n",
" (CD 1972) (NNP Boat) (NN Races)))))))))\n",
" (. .)))\n",
"\n",
"nn(Maxwell-2, David-1)\n",
"nsubjpass(educated-4, Maxwell-2)\n",
"auxpass(educated-4, was-3)\n",
"root(ROOT-0, educated-4)\n",
"nn(College-7, Eton-6)\n",
"prep_at(educated-4, College-7)\n",
"advmod(Scholar-15, where-9)\n",
"nsubj(Scholar-15, he-10)\n",
"cop(Scholar-15, was-11)\n",
"det(King-13, a-12)\n",
"poss(Scholar-15, King-13)\n",
"rcmod(College-7, Scholar-15)\n",
"rcmod(College-7, Captain-17)\n",
"conj_and(Scholar-15, Captain-17)\n",
"prep_of(Scholar-15, Boats-19)\n",
"nn(University-24, Cambridge-23)\n",
"prep_at(educated-4, University-24)\n",
"conj_and(College-7, University-24)\n",
"advmod(rowed-27, where-25)\n",
"nsubj(rowed-27, he-26)\n",
"advcl(educated-4, rowed-27)\n",
"det(boat-32, the-29)\n",
"amod(boat-32, winning-30)\n",
"nn(boat-32, Cambridge-31)\n",
"prep_in(rowed-27, boat-32)\n",
"det(1971-35, the-34)\n",
"prep_in(boat-32, 1971-35)\n",
"num(Races-39, 1972-37)\n",
"nn(Races-39, Boat-38)\n",
"prep_in(boat-32, Races-39)\n",
"conj_and(1971-35, Races-39)\n",
"\n",
"(ROOT\n",
" (NP\n",
" (NP ($ $)\n",
" (QP ($ $) (CD $) (CD $)))\n",
" (ADJP ($ $)\n",
" (QP ($ $) (CD $)))\n",
" (. .)))\n",
"\n",
"root(ROOT-0, $-1)\n",
"num($-1, $-2)\n",
"number($-4, $-3)\n",
"num($-2, $-4)\n",
"amod($-1, $-5)\n",
"dep($-5, $-6)\n",
"num($-6, $-7)"
],
"language": "python",
"metadata": {
},
"outputs": [
]
},
{
"cell_type": "markdown",
"metadata": {
},
"source": [
"Each instance is delimited by 6 $ symbols and the final training output file got from the parser is 29.5Mb in size"
]
},
{
"cell_type": "markdown",
"metadata": {
},
"source": [
"###For Feature Dependecy I have also made use of the Jython Wrapper provided by Viktor Pekar "
]
},
{
"cell_type": "markdown",
"metadata": {
},
"source": [
"##Feature 1 - Number of Verbs between the Person and the Institution\n",
"\n",
"This feature first considers all the nodes from the Person to the Institution. It then counts the number of verbs that are present in this path. I observed that more the number of verbs greater is the chance that there is no relation between them."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#Copied from Jython file\n",
"\n",
"def get_num_common_verb(self, node_i_idx, node_j_idx):\n",
" common_node = None\n",
" shortest_path = []\n",
" path1 = self.path2root(node_i_idx)\n",
" path2 = self.path2root(node_j_idx)\n",
"\tcount = 0\n",
"\n",
" for idx_i in path1:\n",
" if common_node != None:\n",
" break\n",
" for idx_j in path2:\n",
" if idx_i == idx_j:\n",
" common_node = idx_i\n",
" break\n",
"\n",
" if common_node != None:\n",
" for idx_i in path1:\n",
" \tmyTag = self.tag.get(idx_i, '')\n",
" if myTag[0] == 'V' or myTag[0] == 'v':\n",
"\t\t\tcount += 1\n",
" shortest_path.append(idx_i)\n",
" if idx_i == common_node:\n",
" break\n",
" for idx_i in path2:\n",
"\t\tmyTag = self.tag.get(idx_i, '')\n",
" if myTag[0] == 'V' or myTag[0] == 'v':\n",
"\t\t\tcount += 1\n",
" if idx_i == common_node:\n",
" break\n",
" shortest_path.append(idx_i)\n",
"\treturn count"
],
"language": "python",
"metadata": {
},
"outputs": [
]
},
{
"cell_type": "markdown",
"metadata": {
},
"source": [
"##Feature 2 - Whether the person is the subject of the sentence\n",
"\n",
"This feature checks whether the person is the subject or not. The Jython Stanford package provides a relation based value called 'nsubjpass' and 'nsubj'"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#Copied from Jython file\n",
"\n",
"def is_a_nsubjpass(self, node_idx):\n",
"\tif self.rel[node_idx] == 'nsubjpass' or self.rel[node_idx] == 'nsubj':\n",
"\t\treturn 'yes'\n",
"\telse:\n",
"\t\treturn 'no'\n",
"\t"
],
"language": "python",
"metadata": {
},
"outputs": [
]
},
{
"cell_type": "markdown",
"metadata": {
},
"source": [
"##Feature 3 - What is the path length of the Person and the Institution to the Common Node. The Common Node is usually a verb.\n",
"\n",
"This feature again determines the paths of both the Person and the Institution this time to a common node. It is observed that the node is usually a verb when there is a direct relation between the person and the instituion. In cases where this failed, I observed that the person was usually not the subject of the sentence."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#Copied from Jython File\n",
"def get_least_common_node_length(self, node_i_idx, node_j_idx):\n",
" \n",
"\n",
" common_node = None\n",
" shortest_path = []\n",
" path1 = self.path2root(node_i_idx)\n",
" path2 = self.path2root(node_j_idx)\n",
"\n",
" for idx_i in path1:\n",
" if common_node != None:\n",
" break\n",
" for idx_j in path2:\n",
" if idx_i == idx_j:\n",
" common_node = idx_i\n",
" break\n",
"\n",
" if common_node != None:\n",
" for idx_i in path1:\n",
" shortest_path.append(idx_i)\n",
" if idx_i == common_node:\n",
" break\n",
" for idx_i in path2:\n",
" if idx_i == common_node:\n",
" break\n",
" shortest_path.append(idx_i)\n",
"\n",
" return len(shortest_path)"
],
"language": "python",
"metadata": {
},
"outputs": [
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"###Few Other Important Code Snippets used:\n",
"\n",
"import sys\n",
"sys.path.append('/home/alakshminara/NLP Assignments/Assignment2/stanford-parser.jar')\n",
"import unittest\n",
"from stanford import StanfordParser, PySentence\n",
"PARSER = StanfordParser('englishPCFG.ser.gz')"
],
"language": "python",
"metadata": {
},
"outputs": [
]
},
{
"cell_type": "markdown",
"metadata": {
},
"source": [
"## Feature Generator\n",
"\n",
"I designed the featured fenerator in Jython and wrote the features to a CSV file. I later converted the CSV to a arff file in Weka itself."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#Copied from Jython File\n",
"import sys\n",
"import csv\n",
"sys.path.append('/home/alakshminara/NLP Assignments/Assignment2/stanford-parser.jar')\n",
"from stanford import StanfordParser, PySentence\n",
"parser = StanfordParser('englishPCFG.ser.gz')\n",
"\n",
"TRAIN_DATA_PATH = \"test.tsv\"\n",
"f = open(TRAIN_DATA_PATH,'r')\n",
"reader=csv.reader(f,delimiter='\\t')\n",
"k = open('NewFeaturesTest.txt','w')\n",
"spamwriter = csv.writer(k, delimiter='\\t')\n",
"for line in reader:\n",
"\trequired_line = line[2][:160] + line[3][-30:] + line[0]\n",
" sentence = parser.parse(required_line)\n",
"\ttokens=required_line.split()\n",
"\t\n",
"\tall_names = line[1].split()\n",
"\tindex1=1\n",
"\tfor name in all_names:\n",
"\t\tif name in tokens:\n",
"\t\t\tindex1 = tokens.index(name)+1\n",
"\n",
"\tall_names = line[0].split()\n",
"\tindex2=len(tokens)\n",
"\n",
"\ta = sentence.get_least_common_node_length(index1, index2)\n",
"\tb = sentence.get_num_common_verb(index1, index2)\n",
"\tc = sentence.is_a_nsubjpass(index1) or sentence.is_a_nsubjpass(index1+1) or sentence.is_a_nsubjpass(index1+2)\n",
"\tprint line[0]\n",
"\tmyList=[]\n",
"\tmyList.append(a)\n",
"\tmyList.append(b)\n",
"\tmyList.append(c)\n",
"\tspamwriter.writerow(myList)\n",
"\t"
],
"language": "python",
"metadata": {
},
"outputs": [
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#Snapshot of Features\n",
"\n",
"AA\tAB\tAC\tTarget\n",
"5\t3\tyes\tyes\n",
"4\t3\tno\tyes\n",
"7\t3\tyes\tno\n",
"3\t2\tyes\tno\n",
"6\t1\tyes\tyes\n",
"0\t0\tyes\tyes\n",
"5\t3\tyes\tyes\n",
"6\t3\tyes\tyes\n",
"6\t3\tyes\tyes\n",
"7\t4\tyes\tyes\n",
"0\t0\tno\tyes\n",
"0\t0\tyes\tno\n",
"0\t0\tno\tyes\n",
"7\t0\tno\tyes\n",
"0\t0\tyes\tyes\n",
"3\t1\tyes\tyes\n",
"4\t2\tyes\tyes"
],
"language": "python",
"metadata": {
},
"outputs": [
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"=== Run information ===\n",
"\n",
"Scheme: weka.classifiers.functions.LibLINEAR -S 1 -C 1.0 -E 0.001 -B 1.0\n",
"Relation: FeaturesTrain\n",
"Instances: 6000\n",
"Attributes: 4\n",
" AA\n",
" AB\n",
" AC\n",
" AD\n",
"Test mode: evaluate on training data\n",
"\n",
"=== Classifier model (full training set) ===\n",
"\n",
"LibLINEAR wrapper\n",
"\n",
"Time taken to build model: 10.02 seconds\n",
"\n",
"=== Evaluation on training set ===\n",
"\n",
"Time taken to test model on training data: 0 seconds\n",
"\n",
"=== Summary ===\n",
"\n",
"Correctly Classified Instances 3918 65.3333 %\n",
"Incorrectly Classified Instances 2082 34.6667 %\n",
"Kappa statistic 0.1975\n",
"Mean absolute error 0.3467\n",
"Root mean squared error 0.5888\n",
"Relative absolute error 72.1441 %\n",
"Root relative squared error 120.1834 %\n",
"Coverage of cases (0.95 level) 65.3333 %\n",
"Mean rel. region size (0.95 level) 50 %\n",
"Total Number of Instances 6000 \n",
"\n",
"=== Detailed Accuracy By Class ===\n",
"\n",
" TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class\n",
" 0.911 0.733 0.651 0.911 0.759 0.238 0.589 0.646 yes\n",
" 0.267 0.089 0.667 0.267 0.381 0.238 0.589 0.471 no\n",
"Weighted Avg. 0.653 0.476 0.657 0.653 0.608 0.238 0.589 0.576 \n",
"\n",
"=== Confusion Matrix ===\n",
"\n",
" a b <-- classified as\n",
" 3280 320 | a = yes\n",
" 1760 640 | b = no\n",
"\n",
"\n",
"=== Re-evaluation on test set ===\n",
"\n",
"User supplied test set\n",
"Relation: FeaturesTest\n",
"Instances: unknown (yet). Reading incrementally\n",
"Attributes: 4\n",
"\n",
"=== Summary ===\n",
"\n",
"Correctly Classified Instances 710 71 %\n",
"Incorrectly Classified Instances 290 29 %\n",
"Kappa statistic -0.1644\n",
"Mean absolute error 0.2941\n",
"Root mean squared error 0.5423\n",
"Coverage of cases (0.95 level) 71 %\n",
"Total Number of Instances 1000 \n",
"\n",
"=== Detailed Accuracy By Class ===\n",
"\n",
" TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class\n",
" 0.857 1.000 0.800 0.857 0.828 -0.169 0.429 0.803 yes\n",
" 0.000 0.143 0.000 0.000 0.000 -0.169 0.429 0.176 no\n",
"Weighted Avg. 0.706 0.849 0.659 0.706 0.682 -0.169 0.429 0.693 \n",
"\n",
"=== Confusion Matrix ===\n",
"\n",
" a b <-- classified as\n",
" 635 131 | a = yes\n",
" 165 69 | b = no\n",
"\n"
],
"language": "python",
"metadata": {
},
"outputs": [
]
},
{
"cell_type": "markdown",
"metadata": {
},
"source": [
"##Kitchen Sink"
]
},
{
"cell_type": "markdown",
"metadata": {
},
"source": [
"Combining the Features of the Regular Expression, Clustering (Both Bag of Words and Brown Clustering) and Dependency Features the following output is obtained in Weka."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"Weka Gives a case by case output for the clustering\n",
"\n",
"\n",
" inst# actual predicted error prediction\n",
" 1 1:yes 1:yes 1 \n",
" 2 2:no 1:yes + 1 \n",
" 3 1:yes 1:yes 1 \n",
" 4 2:no 1:yes + 1 \n",
" 5 2:no 1:yes + 1 \n",
" 6 2:no 1:yes + 1 \n",
" 7 2:no 1:yes + 1 \n",
" 8 1:yes 1:yes 1 \n",
" 9 1:yes 2:no + 1 \n",
" 10 2:no 1:yes + 1 \n",
" 11 1:yes 1:yes 1 \n",
" 12 1:yes 1:yes 1 \n",
" 13 1:yes 1:yes 1 \n",
" 14 2:no 2:no 1 \n",
" 15 1:yes 1:yes 1 \n",
" 16 2:no 1:yes + 1 \n",
" 17 1:yes 1:yes 1 \n",
" 18 1:yes 1:yes 1 \n",
" 19 1:yes 1:yes 1 \n",
" 20 1:yes 1:yes 1 \n",
" 21 2:no 1:yes + 1 \n",
" 22 1:yes 1:yes 1 \n",
" 23 1:yes 1:yes 1 \n",
" 24 2:no 1:yes + 1 \n",
" 25 1:yes 1:yes 1 \n",
" 26 1:yes 1:yes 1 \n",
" 27 1:yes 1:yes 1 \n",
" 28 1:yes 1:yes 1 \n",
" 29 1:yes 1:yes 1 \n",
" 30 2:no 1:yes + 1 \n",
" 31 2:no 1:yes + 1 "
],
"language": "python",
"metadata": {
},
"outputs": [
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"Final Training Set Format:\n",
"\n",
"feature 1\tfeature 2\tfeature 3\tRegular Expression\tClustering\tTarget\n",
"5\t2\tno\tyes\tyes\tyes\n",
"6\t3\tno\tyes\tyes\tno\n",
"0\t0\tno\tno\tyes\tyes\n",
"0\t0\tno\tyes\tyes\tno\n",
"5\t3\tyes\tyes\tyes\tno\n",
"5\t1\tyes\tyes\tyes\tno\n",
"5\t0\tyes\tno\tyes\tno\n",
"8\t4\tyes\tno\tyes\tyes\n",
"0\t0\tyes\tyes\tno\tyes\n",
"7\t3\tyes\tno\tyes\tno\n",
"4\t3\tyes\tyes\tyes\tyes\n",
"5\t3\tyes\tno\tyes\tyes\n",
"4\t0\tno\tno\tyes\tyes\n",
"5\t2\tyes\tyes\tno\tno\n",
"3\t3\tyes\tyes\tyes\tyes\n",
"0\t0\tyes\tyes\tyes\tno\n",
"4\t3\tyes\tno\tyes\tyes\n",
"0\t0\tyes\tno\tyes\tyes\n",
"0\t0\tyes\tyes\tyes\tyes\n",
"0\t0\tno\tno\tyes\tyes\n",
"4\t1\tyes\tyes\tyes\tno\n",
"6\t3\tyes\tno\tyes\tyes\n",
"0\t0\tno\tno\tyes\tyes\n",
"6\t3\tyes\tyes\tyes\tno\n",
"0\t0\tyes\tyes\tyes\tyes\n",
"0\t0\tno\tyes\tyes\tyes\n",
"0\t0\tno\tyes\tyes\tyes\n",
"6\t3\tno\tyes\tyes\tyes\n",
"4\t3\tyes\tyes\tyes\tyes\n",
"5\t3\tno\tyes\tyes\tno\n",
"4\t2\tyes\tno\tyes\tno\n",
"\n",
" \n",
" "
],
"language": "python",
"metadata": {
},
"outputs": [
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"=== Summary ===\n",
"\n",
"Correctly Classified Instances 764 76.4 %\n",
"Incorrectly Classified Instances 236 23.6 %\n",
"Kappa statistic -0.0968\n",
"Mean absolute error 0.2353\n",
"Root mean squared error 0.4851\n",
"Coverage of cases (0.95 level) 76.4 %\n",
"Total Number of Instances 1000 \n",
"\n",
"=== Detailed Accuracy By Class ===\n",
"\n",
" TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class\n",
" 0.929 1.000 0.813 0.929 0.867 -0.116 0.464 0.813 yes\n",
" 0.000 0.071 0.000 0.000 0.000 -0.116 0.464 0.176 no\n",
"Weighted Avg. 0.765 0.836 0.669 0.765 0.714 -0.116 0.464 0.701 "
],
"language": "python",
"metadata": {
},
"outputs": [
]
},
{
"cell_type": "markdown",
"metadata": {
},
"source": [
"####We observe that the Kitchen sink in the end gave us the best result. The Reason might be that together most of the models tend to point to the correct output. Whereas when taken individually they have a certain weakness for few cases and hence the accuracy went down.\n"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
],
"language": "python",
"metadata": {
},
"outputs": [
]
}
],
"metadata": {
}
}]
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment