ravenscroftj/gist:bce23b9c14ab1d5c2925

## gistfile1.txt
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Citation classifier excite\n",
    "\n",
    "\n",
    "GIT repo: `git clone git@nopro.be:james/excite.git` \n",
    "\n",
    "Implementation of LSTM that can classify citation strings into regions denoting author, article title, journal, volume, year, pages, doi, notes and some other classes. There are 13 classes in total.\n",
    "\n",
    "## Preparing Data regions\n",
    "\n",
    "The following function takes citations that have been annotated and builds a mapping of character to classes. Since neural networks are also completely numerical constructs, we create an alphabet that maps numerical indices in a vector to letters.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import lxml.etree as ET\n",
    "from collections import Counter\n",
    "\n",
    "def prepare_data(citefile):\n",
    "\n",
    "    #a list of all citations in training data\n",
    "    citations = []\n",
    "    #letters is essentially our alphabet used to map alphanumerical chars to integers\n",
    "    letters = set()\n",
    "    #this is a map of example classes found in the training data\n",
    "    classes = Counter()\n",
    "\n",
    "    with open(citefile) as f:\n",
    "\n",
    "        for line in f.readlines():\n",
    "            #add citation element to beginning and end of each example line so that we can parse into xml doc\n",
    "            root = ET.XML(\"<citation>\" + line.replace(\"&\",\"&amp;\") + \"</citation>\")\n",
    "            cite = \"\"\n",
    "            regions = []\n",
    "\n",
    "            #iterate over child elements of our citation doc\n",
    "            for el in root.iterchildren():\n",
    "                classes[el.tag] = 1\n",
    "                regions.append( (el.tag, len(cite)) )\n",
    "                cite += el.text.replace(\"&amp;\",\"&\")\n",
    "\n",
    "            #letters is a set so we just union against our new citation string to get unique chars used\n",
    "            letters = letters.union( set(list(cite)) )\n",
    "            citations.append( (cite, regions) )\n",
    "\n",
    "    return citations, sorted(classes.keys()), sorted(letters)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can use this function to build test and training sets. We can import data from the training sets attached"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Total classes found: 13\n",
      "Classes:  ['author', 'booktitle', 'date', 'editor', 'institution', 'journal', 'location', 'note', 'pages', 'publisher', 'tech', 'title', 'volume']\n"
     ]
    }
   ],
   "source": [
    "#extract training data from examples\n",
    "citations, classes, alphabet = prepare_data(\"excite/data/citeseerx.tagged.txt\")\n",
    "\n",
    "# Find out how many classes there were in the training data as a sanity check\n",
    "print ( str.format(\"Total classes found: {}\",len(classes) ) )\n",
    "print (\"Classes: \", classes)\n",
    "    \n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Architecture of Network\n",
    "\n",
    "Now that we have an idea of the number of classes we can start to plan network architecture.\n",
    "\n",
    "### Network input\n",
    "\n",
    "We are using a 5 character context window over our time series (which is essentially moving the context window along one character at a time. We need a context window function, $c(s,t)$ which given an input citation $s$ and an offset, $t$ creates a context window for network input, $x$\n",
    "\n",
    "So you might expect the following:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "x for t=1:  ['t', 'h', 'i', 's', ' ']\n",
      "x for t=2:  ['h', 'i', 's', ' ', 'i']\n"
     ]
    }
   ],
   "source": [
    "#given a citation string that looks like this\n",
    "citation = \"this is an example\"\n",
    "\n",
    "x_t_1 = list(citation[0:5])\n",
    "\n",
    "print (\"x for t=1: \", x_t_1)\n",
    "\n",
    "x_t_2 = list(citation[1:6])\n",
    "\n",
    "print (\"x for t=2: \", x_t_2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The problem is that neural networks are completely numerical and strings must be encoded as numbers in order to be passed in. Therefore we use the alphabet collected above to map our x values to something more RNN friendly."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Encoded x for t=1 [73 61 62 72  0]\n",
      "Encoded x for t=2 [61 62 72  0 62]\n"
     ]
    }
   ],
   "source": [
    "import numpy as np\n",
    "\n",
    "def amap(letters):\n",
    "    return np.array([ alphabet.index(x) for x in letters ])\n",
    "\n",
    "print( \"Encoded x for t=1\", amap(x_t_1) )\n",
    "print( \"Encoded x for t=2\", amap(x_t_2) )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "    \n",
    "### Network output\n",
    "    \n",
    "\n",
    "The output will be a vector of values between 0,1 as wide as the number of classes. \n",
    "\n",
    "The aim is to use back propogation through time to make the input match with an output of all zeroes except the correct class in the vector space."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import theano\n",
    "import theano.tensor as T\n",
    "\n",
    "#we set up the various layers of the network \n",
    "\n",
    "#the input layer is just a vector - yes it is 5 wide but we don't care about this yet\n",
    "L_in = T.vector('x_in')\n",
    "\n",
    "# States is a vector of memory unit values\n",
    "S_lstm = T.vector('S_lstm')\n",
    "\n",
    "#output layer is a vector of values which will eventually be len(classes) wide\n",
    "L_out = T.vector('y_out')\n",
    "\n",
    "# set up all the weights - these are matrices that weight values map connections between layers\n",
    "# W[l,m] is the weight of connection from unit m to unit l\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Forget gate value:\n",
    "\n",
    "$$y_{\\varphi_j}(t)= f_{\\varphi_j}(z_{\\varphi_j}(t))$$\n",
    "\n",
    "$$ z_{\\varphi_j}(t)=\\sum_m w_{\\varphi_j m} y_m (t-1)$$\n",
    "\n",
    "Initial cell state is zero \n",
    "\n",
    "$$ s_{c^v_j} (0) = 0 $$\n",
    "\n",
    "If $ t \\gt 0$ then it is calculated like so:\n",
    "\n",
    "$$s_{c^v_j} (t) =  y_{\\varphi j}(t)s_{c^v_j} (t-t) + y_{in_j}(t) g(z_{c^v_j}(t))  $$\n",
    "\n",
    "Note that functions $z_{c^v_j}(t)$ and $y_{in_j}(t)$ are included as products in this value.\n",
    "\n",
    "The previous state value remains inside the CEC and a product of $s_{c^v_j} (t)$ provided that $y_{\\varphi j}(t) \\approx 1$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.4.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}
	{
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"# Citation classifier excite\n",
	"\n",
	"\n",
	"GIT repo: `git clone git@nopro.be:james/excite.git` \n",
	"\n",
	"Implementation of LSTM that can classify citation strings into regions denoting author, article title, journal, volume, year, pages, doi, notes and some other classes. There are 13 classes in total.\n",
	"\n",
	"## Preparing Data regions\n",
	"\n",
	"The following function takes citations that have been annotated and builds a mapping of character to classes. Since neural networks are also completely numerical constructs, we create an alphabet that maps numerical indices in a vector to letters.\n"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 22,
	"metadata": {
	"collapsed": true
	},
	"outputs": [],
	"source": [
	"import lxml.etree as ET\n",
	"from collections import Counter\n",
	"\n",
	"def prepare_data(citefile):\n",
	"\n",
	" #a list of all citations in training data\n",
	" citations = []\n",
	" #letters is essentially our alphabet used to map alphanumerical chars to integers\n",
	" letters = set()\n",
	" #this is a map of example classes found in the training data\n",
	" classes = Counter()\n",
	"\n",
	" with open(citefile) as f:\n",
	"\n",
	" for line in f.readlines():\n",
	" #add citation element to beginning and end of each example line so that we can parse into xml doc\n",
	" root = ET.XML(\"<citation>\" + line.replace(\"&\",\"&\") + \"</citation>\")\n",
	" cite = \"\"\n",
	" regions = []\n",
	"\n",
	" #iterate over child elements of our citation doc\n",
	" for el in root.iterchildren():\n",
	" classes[el.tag] = 1\n",
	" regions.append( (el.tag, len(cite)) )\n",
	" cite += el.text.replace(\"&\",\"&\")\n",
	"\n",
	" #letters is a set so we just union against our new citation string to get unique chars used\n",
	" letters = letters.union( set(list(cite)) )\n",
	" citations.append( (cite, regions) )\n",
	"\n",
	" return citations, sorted(classes.keys()), sorted(letters)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"We can use this function to build test and training sets. We can import data from the training sets attached"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 23,
	"metadata": {
	"collapsed": false
	},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"Total classes found: 13\n",
	"Classes: ['author', 'booktitle', 'date', 'editor', 'institution', 'journal', 'location', 'note', 'pages', 'publisher', 'tech', 'title', 'volume']\n"
	]
	}
	],
	"source": [
	"#extract training data from examples\n",
	"citations, classes, alphabet = prepare_data(\"excite/data/citeseerx.tagged.txt\")\n",
	"\n",
	"# Find out how many classes there were in the training data as a sanity check\n",
	"print ( str.format(\"Total classes found: {}\",len(classes) ) )\n",
	"print (\"Classes: \", classes)\n",
	" \n"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Architecture of Network\n",
	"\n",
	"Now that we have an idea of the number of classes we can start to plan network architecture.\n",
	"\n",
	"### Network input\n",
	"\n",
	"We are using a 5 character context window over our time series (which is essentially moving the context window along one character at a time. We need a context window function, $c(s,t)$ which given an input citation $s$ and an offset, $t$ creates a context window for network input, $x$\n",
	"\n",
	"So you might expect the following:"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 25,
	"metadata": {
	"collapsed": false
	},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"x for t=1: ['t', 'h', 'i', 's', ' ']\n",
	"x for t=2: ['h', 'i', 's', ' ', 'i']\n"
	]
	}
	],
	"source": [
	"#given a citation string that looks like this\n",
	"citation = \"this is an example\"\n",
	"\n",
	"x_t_1 = list(citation[0:5])\n",
	"\n",
	"print (\"x for t=1: \", x_t_1)\n",
	"\n",
	"x_t_2 = list(citation[1:6])\n",
	"\n",
	"print (\"x for t=2: \", x_t_2)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"The problem is that neural networks are completely numerical and strings must be encoded as numbers in order to be passed in. Therefore we use the alphabet collected above to map our x values to something more RNN friendly."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 28,
	"metadata": {
	"collapsed": false
	},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"Encoded x for t=1 [73 61 62 72 0]\n",
	"Encoded x for t=2 [61 62 72 0 62]\n"
	]
	}
	],
	"source": [
	"import numpy as np\n",
	"\n",
	"def amap(letters):\n",
	" return np.array([ alphabet.index(x) for x in letters ])\n",
	"\n",
	"print( \"Encoded x for t=1\", amap(x_t_1) )\n",
	"print( \"Encoded x for t=2\", amap(x_t_2) )"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	" \n",
	"### Network output\n",
	" \n",
	"\n",
	"The output will be a vector of values between 0,1 as wide as the number of classes. \n",
	"\n",
	"The aim is to use back propogation through time to make the input match with an output of all zeroes except the correct class in the vector space."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 30,
	"metadata": {
	"collapsed": true
	},
	"outputs": [],
	"source": [
	"import theano\n",
	"import theano.tensor as T\n",
	"\n",
	"#we set up the various layers of the network \n",
	"\n",
	"#the input layer is just a vector - yes it is 5 wide but we don't care about this yet\n",
	"L_in = T.vector('x_in')\n",
	"\n",
	"# States is a vector of memory unit values\n",
	"S_lstm = T.vector('S_lstm')\n",
	"\n",
	"#output layer is a vector of values which will eventually be len(classes) wide\n",
	"L_out = T.vector('y_out')\n",
	"\n",
	"# set up all the weights - these are matrices that weight values map connections between layers\n",
	"# W[l,m] is the weight of connection from unit m to unit l\n",
	"\n"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Forget gate value:\n",
	"\n",
	"$$y_{\\varphi_j}(t)= f_{\\varphi_j}(z_{\\varphi_j}(t))$$\n",
	"\n",
	"$$ z_{\\varphi_j}(t)=\\sum_m w_{\\varphi_j m} y_m (t-1)$$\n",
	"\n",
	"Initial cell state is zero \n",
	"\n",
	"$$ s_{c^v_j} (0) = 0 $$\n",
	"\n",
	"If $ t \\gt 0$ then it is calculated like so:\n",
	"\n",
	"$$s_{c^v_j} (t) = y_{\\varphi j}(t)s_{c^v_j} (t-t) + y_{in_j}(t) g(z_{c^v_j}(t)) $$\n",
	"\n",
	"Note that functions $z_{c^v_j}(t)$ and $y_{in_j}(t)$ are included as products in this value.\n",
	"\n",
	"The previous state value remains inside the CEC and a product of $s_{c^v_j} (t)$ provided that $y_{\\varphi j}(t) \\approx 1$"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": true
	},
	"outputs": [],
	"source": []
	}
	],
	"metadata": {
	"kernelspec": {
	"display_name": "Python 3",
	"language": "python",
	"name": "python3"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 3
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython3",
	"version": "3.4.3"
	}
	},
	"nbformat": 4,
	"nbformat_minor": 0
	}