act65/deeplearning.ipynb

## deeplearning.ipynb
{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import functools\n",
    "import itertools\n",
    "\n",
    "import numpy as np\n",
    "import sklearn.datasets\n",
    "import sklearn.preprocessing\n",
    "import matplotlib.pyplot as plt\n",
    "%matplotlib inline"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Layers and activation functions\n",
    "\n",
    "The core idea behind deep learning is that \"deep\" functions should be built using simpler, parameterised, differentiable functions (aka layers). So, let's define some simple layers we can work with.\n",
    "\n",
    "Each layer needs a forward method -- `__call__` and a method to calculate the gradients of the outputs w.r.t the inputs (evaluated at a given input) -- `grad`. They also need to save their inputs/activations so we can calculate their gradients at a given input later.\n",
    "\n",
    "#### Automatic differentiation\n",
    "\n",
    "So how can we actually calculate the derivatives of our layers? We can use symbolic differentiation to do the work for us. Done naively this can be quite expensive.\n",
    "\n",
    "Automatic differentiation is an efficient way to calculate the derivatives of complex functions coposed of simpler ones. More on this later.\n",
    "\n",
    "#### Matrix calculus\n",
    "\n",
    "When we take the derivative of a function with respect to its inputs we are asking; how (much) do changes in _each_ input effect _each_ of the outputs. Thus `grad` should return an array of shape `[n_inputs, batch_size, n_outputs, batch_size]`.\n",
    "\n",
    "However, we can flatten across the 2nd batch dimension as we expect batches to be computed independently of each other, thus the `[batch_size , batch_size]` dimensions will be diagonal and flattening across one of them preserves all the info."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "class Sigmoid():\n",
    "    def __call__(self, x):\n",
    "        self.x = x\n",
    "        return 1/(1+np.exp(-x))\n",
    "    \n",
    "    def grad(self, x=None):\n",
    "        if x is None:\n",
    "            x = self.x\n",
    "            \n",
    "        y = self.__call__(x)\n",
    "        g = np.zeros((x.shape[1], x.shape[0], y.shape[1]))\n",
    "        \n",
    "        # sigmoids act element wise, thus have a diagonal\n",
    "        # jacobian/grad\n",
    "        i, j = np.diag_indices(x.shape[1])\n",
    "        g[i, :, j] = (y*(1-y)).T  \n",
    "        return g\n",
    "    \n",
    "class Linear():\n",
    "    def __init__(self, shape):\n",
    "        # this layer actually has some state.\n",
    "        # its parameters (which we will train later)\n",
    "        self.weights = (2/(shape[0]+shape[1]))*np.random.standard_normal(shape)\n",
    "        self.biases = np.random.standard_normal((1,shape[-1]))\n",
    "        \n",
    "    def __call__(self, x):\n",
    "        self.x = x  \n",
    "        return np.dot(x, self.weights) + self.biases\n",
    "\n",
    "    def grad(self, x=None):\n",
    "        if x is None:\n",
    "            x = self.x\n",
    "        return np.tile(np.expand_dims(self.weights, axis=1), (1, x.shape[0], 1))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Empirical differentiation\n",
    "\n",
    "Cool, now that we have some layers, but we would really like to check that we have got them right. We can use an empirical estimate of the gradients and compare them to our `grad`s.\n",
    "\n",
    "Since a gradient asks how each input effects each output, we can literally just change each input and observe the change in each output. This is known as a finite difference approximation, and as the change in inputs approaches zero, the accuracy of this estimate approaches the true derivative.\n",
    "\n",
    "$$\n",
    "\\lim_{h\\rightarrow 0}\\frac{\\partial f(x)}{\\partial x} = \\frac{f(x+h) - f(x)}{h}\n",
    "$$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def finite_difference(func, inputs, direction, epsilon=1e-8):\n",
    "    \"\"\"\n",
    "    Calculate a finite difference on the first argument of\n",
    "    `func` according to the direction vector.\n",
    "    \"\"\"\n",
    "    return (func(inputs[0]+epsilon*direction, *inputs[1:]) - func(*inputs))/epsilon\n",
    "\n",
    "def empirical_gradient(func, inputs, targets=None):\n",
    "    \"\"\"\n",
    "    Do our gradients match finite difference approximations?\n",
    "    \n",
    "    Args:\n",
    "        func (callable):  The target function. `func: inputs, targets -> outputs`\n",
    "        inputs (np.array): an array of inputs [batch_size, ...]\n",
    "        targets optional(np.array): an array of targets [batch_size, ...]\n",
    "            \n",
    "    Returns:\n",
    "        the gradients of `func` w.r.t inputs summed over the batch.\n",
    "    \"\"\" \n",
    "    grads = [None]*inputs.shape[0]\n",
    "    # for each element of the batch\n",
    "    for i, x in enumerate(inputs):\n",
    "        grad = [None]*inputs.shape[-1]\n",
    "        x = x.reshape((1, inputs.shape[-1]))\n",
    "        \n",
    "        # for each input variable\n",
    "        for j in range(x.shape[-1]):\n",
    "            # create a direction to perturb along\n",
    "            direction = np.zeros(x.shape)\n",
    "            direction[0,j] = 1\n",
    "            \n",
    "            # finite difference to approximate\n",
    "            if targets is None:\n",
    "                grad[j] = finite_difference(func, [x], direction)\n",
    "            else:\n",
    "                grad[j] = finite_difference(func, [x, targets[i, :]], direction)\n",
    "\n",
    "        grads[i] = grad\n",
    "    grads = np.array(grads)\n",
    "    grads = np.squeeze(grads)\n",
    "    grads = np.transpose(grads, [1, 0, 2])\n",
    "    return grads\n",
    "\n",
    "def sanity_check(fn, *args):\n",
    "    \"\"\"\n",
    "    Check that the layers works as expected.\n",
    "    Estimate their symbolic and empirical gradients\n",
    "    and print the difference.\n",
    "    \"\"\"\n",
    "    print('### {}'.format(fn.__class__.__name__))\n",
    "    print('Output shape {}'.format(fn(*args).shape))\n",
    "    print('Grad shape {}'.format(fn.grad(*args).shape))\n",
    "    empirical = empirical_gradient(fn, *args)\n",
    "    symbolic = fn.grad(*args)\n",
    "    difference = empirical - symbolic\n",
    "    print('Empirical: {:.3f} {} Symbolic : {:.3f} {}'.format(\n",
    "        np.mean(empirical), empirical.shape, np.mean(symbolic), symbolic.shape))\n",
    "    print('Difference: {}'.format(np.mean(np.abs(difference))))\n",
    "    print('\\n')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "batch_size = 50\n",
    "n_inputs = 64\n",
    "n_classes = 10\n",
    "\n",
    "# some variables for testing with\n",
    "x = np.random.random((batch_size, n_inputs))\n",
    "y = np.random.random((batch_size, n_classes))\n",
    "T = np.random.randint(0, 2, (batch_size, n_classes)).astype(np.float32)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "### Linear\n",
      "Output shape (50, 10)\n",
      "Grad shape (64, 50, 10)\n",
      "Empirical: -0.001 (64, 50, 10) Symbolic : -0.001 (64, 50, 10)\n",
      "Difference: 3.2874463269992963e-09\n",
      "\n",
      "\n",
      "### Sigmoid\n",
      "Output shape (50, 64)\n",
      "Grad shape (64, 50, 64)\n",
      "Empirical: 0.004 (64, 50, 64) Symbolic : 0.004 (64, 50, 64)\n",
      "Difference: 7.781569369080744e-11\n",
      "\n",
      "\n"
     ]
    }
   ],
   "source": [
    "sanity_check(Linear((64, 10)), x)\n",
    "sanity_check(Sigmoid(), x)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Part 1: Implement some new layers\n",
    "\n",
    "There are many types of 'simple' layer that can be used, it just needs to be differentiable. You have been started with a few above. \n",
    "\n",
    "__Implement__ some activation functions;\n",
    "* rectified linear unit,\n",
    "* elu,\n",
    "* and an activation function of your own invention.\n",
    "\n",
    "_(what makes a good activation function?)._\n",
    "\n",
    "__Implement__ softmax for classification (layer). "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "class ReLU():\n",
    "    def __call__(self, x):\n",
    "        self.x = x\n",
    "        return np.maximum(0, x)\n",
    "        \n",
    "    def grad(self, x=None):\n",
    "        if x is None:\n",
    "            x = self.x\n",
    "        \n",
    "        g = np.zeros((x.shape[1], x.shape[0], x.shape[1]))\n",
    "        i, j = np.diag_indices(x.shape[1])\n",
    "        g[i, :, j] = ((x>0).astype(np.float32)).T\n",
    "        return g\n",
    "    \n",
    "class Softmax():\n",
    "    def __call__(self, x):\n",
    "        self.x = x\n",
    "        return np.exp(x)/np.sum(np.exp(x), axis=1, keepdims=True)\n",
    "        \n",
    "    def grad(self, x=None):\n",
    "        if x is None:\n",
    "            x = self.x\n",
    "        y = self.__call__(x)\n",
    "        \n",
    "        g = np.zeros((x.shape[1], x.shape[0], y.shape[1]))\n",
    "        i, j = np.diag_indices(x.shape[1])\n",
    "        g[:, :, :] = -np.expand_dims(y, 0)*np.expand_dims(y.T, -1)\n",
    "        g[i, :, j] = y.T*(1-y.T)\n",
    "        return g"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "### Softmax\n",
      "Output shape (50, 64)\n",
      "Grad shape (64, 50, 64)\n",
      "Empirical: -0.000 (64, 50, 64) Symbolic : -0.000 (64, 50, 64)\n",
      "Difference: 1.1914430442511248e-10\n",
      "\n",
      "\n",
      "### ReLU\n",
      "Output shape (50, 64)\n",
      "Grad shape (64, 50, 64)\n",
      "Empirical: 0.016 (64, 50, 64) Symbolic : 0.016 (64, 50, 64)\n",
      "Difference: 4.249758826313857e-11\n",
      "\n",
      "\n"
     ]
    }
   ],
   "source": [
    "sanity_check(Softmax(), x)\n",
    "sanity_check(ReLU(), x)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Deep networks\n",
    "\n",
    "Ok, now that we have some working layers we can compose them together to build 'deeper' functions. What is the point of this? We want to construct a set of functions that are able to approximate relatively arbitrary target functions of interest. By controlling the type of layers, how many of them, etc... we can control how well our network can approximate certain (more or less complex) functions. \n",
    "\n",
    "Assuming these layers are linear transformations followed by element-wise non-linearities, this stack of layers is typically refered to as a neural network.\n",
    "\n",
    "#### Topology\n",
    "\n",
    "There are many different ways to compose these layers together (connection topologies), however, here we are only going to consider the simplest case, feed-forward architectures."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Input shape: (50, 64), Output shape: (50, 10)\n"
     ]
    }
   ],
   "source": [
    "\"\"\"\n",
    "Now we want to define a neural network that takes inputs to \n",
    "f: x -> y.\n",
    "\"\"\"\n",
    "layers = [\n",
    "          Linear((n_inputs, 30)),\n",
    "          Sigmoid(),\n",
    "          Linear((30, 20)),\n",
    "          Sigmoid(),\n",
    "          Linear((20, n_classes)),\n",
    "          Softmax()\n",
    "         ]\n",
    "\n",
    "def compose(f1, f2):\n",
    "    # nested function composition\n",
    "    return lambda x: f2(f1(x))\n",
    "\n",
    "# you may be interested in `reduce` and `accumulate` \n",
    "# if you haven't come across them before\n",
    "forward_prop = functools.reduce(compose, layers)\n",
    "logits = forward_prop(x)\n",
    "print('Input shape: {}, Output shape: {}'.format(x.shape, logits.shape))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Learning\n",
    "\n",
    "Cool, we can construct classes of functions, but how can we find the one that accurately approximates our target function? This is known as learning. A few problems arise if we want to train these deep functions.\n",
    "\n",
    "* How can we estimate how well our approximation is performing and get useful __feedback__?\n",
    "* Given some feedback how can we __assign credit__ to layers?\n",
    "* Once we know which layers were responsible for the feedback we recieved, how should we __update__ them?\n",
    "\n",
    "## Feedback\n",
    "\n",
    "All learning requires that we have some measure of how well we are doing, the loss, or risk, or cost. Using this metric we can get feedback by asking how well we have done on some data. In deep learning we use metrics that are differentiable so we can ask them: how we can we change our outputs to have done better.\n",
    "\n",
    "We want to estimate $\\frac{\\partial \\mathcal L}{\\partial y}$, effectively this asks: how should our outputs have been different to achieve a lower loss, to do a better job?\n",
    "\n",
    "<!-- So, we want to estimate the gradient of the loss with respect to outputs. -->"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "class CrossEntropy():\n",
    "    def __call__(self, y, T):\n",
    "        self.y = y\n",
    "        self.T = T\n",
    "        return -T*np.log(y+1e-6)\n",
    "\n",
    "    def grad(self, y=None, T=None):\n",
    "        if y is None:\n",
    "            y = self.y\n",
    "        if T is None:\n",
    "            T = self.T\n",
    "            \n",
    "        g = np.zeros((y.shape[1], y.shape[0], T.shape[1]))\n",
    "        i, j = np.diag_indices(y.shape[1])\n",
    "        g[i, :, j] = (-T/(y+1e-6)).T\n",
    "        return g"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "### CrossEntropy\n",
      "Output shape (50, 10)\n",
      "Grad shape (10, 50, 10)\n",
      "Empirical: -1.582 (10, 50, 10) Symbolic : -1.582 (10, 50, 10)\n",
      "Difference: 3.306297476259714e-05\n",
      "\n",
      "\n"
     ]
    }
   ],
   "source": [
    "# Check that cross entropy works as expected\n",
    "ce = CrossEntropy()\n",
    "# hold T constant over the batch\n",
    "t = np.random.randint(0, 2, (1, n_classes)).astype(np.float32)\n",
    "T = np.concatenate([t for n in range(batch_size)], axis=0)\n",
    "\n",
    "sanity_check(CrossEntropy(), y, T)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Regularisers\n",
    "\n",
    "In addition of a metric of loss, we may also have other beliefs about what we want.\n",
    "\n",
    "__Implement__ Weight decay."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "class WeightDecay():\n",
    "    def __call__(self, parameters):\n",
    "        self.parameters = parameters\n",
    "        return 0.5*np.sum(np.square(self.parameters))\n",
    "    \n",
    "    def grad(self, parameters=None):\n",
    "        if parameters is None:\n",
    "            self.parameters = parameters\n",
    "        return np.expand_dims(parameters, -1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Check that weight decay works as expected\n",
    "wd = WeightDecay()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Assigning credit\n",
    "\n",
    "Awesome, we have some feedback with respect to the output of our approximation, but how did each layer contribute to that feedback? We want to get feedback for each layer (which we use to update our parameters).  \n",
    "\n",
    "#### Chain rule\n",
    "\n",
    "We want to propagate feedback to functions nested in our heirarchy. A simple nested function is $\\frac{d}{dx} f(g(x))$. But how can we differentiate this with respect to $x$? We can take a linear approximation of $f, g$, at $x$ and use that estimate $\\frac{d}{dx} f(g(x))$.\n",
    "\n",
    "So if we let $u = g(x)$ and $y = f(u)$ then we can calculate $\\frac{dy}{dx} = \\frac{dy}{du}\\cdot\\frac{du}{dx}$. Aka the [chain rule](https://en.wikipedia.org/wiki/Chain_rule).\n",
    "\n",
    "<!--\n",
    "Note $\\sum \\frac{dL}{dy} \\frac{dy}{dz} \\neq \\sum \\frac{dL}{dy} \\sum \\frac{dy}{dz}$ -->"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def chain_rule(x, y, reverse=False):\n",
    "    \"\"\"\n",
    "    Nested function composition gives multiplication \n",
    "    of the nested function's gradients.\n",
    "    \"\"\"\n",
    "    x = evaluate(x)\n",
    "    y = evaluate(y)\n",
    "    if reverse:\n",
    "        y = y.T\n",
    "    # TODO not sure about this. the sum over axis 1 feels weird\n",
    "    return np.sum(np.tensordot(x, y, axes=(2, 0)), axis=1)  \n",
    "\n",
    "def evaluate(func):\n",
    "    \"\"\"\n",
    "    This is just a helper for partial evaluation\n",
    "    so we dont need to store all the gradients in memory.\n",
    "    we can just calculate them as needed.\n",
    "    This relates to why reverse-mode AD is more\n",
    "    efficient (in memory) than forward-mode.\n",
    "    \"\"\"\n",
    "    if not isinstance(func, np.ndarray):\n",
    "        return func()\n",
    "    else:\n",
    "        return func"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Backward propagation\n",
    "\n",
    "So we know how to propagate feedback/gradients through two functions, use the chain rule. So all that is left is to recursively apply the chain rule and propagate gradients backwards (aka reverse mode) from the gradient of the loss w.r.t outputs through our layers.\n",
    "\n",
    "$$\n",
    "\\begin{align}\n",
    "\\big[f_1, \\dots &f_L, \\mathcal L \\big] \\tag{nested fns} \\\\\n",
    "\\big[\n",
    "\\frac{\\partial z_1}{\\partial x}, \\dots &\\frac{\\partial y}{\\partial z_L}, \\frac{\\partial \\mathcal L}{\\partial y} \\big] \\tag{gradients at $a$}\\\\\n",
    "\\big[ (n_{inputs} \\times n_{b}\\times n_{1}), \\; \\dots \\; &(n_{L} \\times n_{b}\\times n_{outputs}), \\;  (n_{outputs} \\times n_{b} \\times 1) \\big] \\tag{grad shapes}\n",
    "\\end{align}\n",
    "$$\n",
    "\n",
    "<i>\n",
    "* The derivative of a vector w.r.t a vector is a matrix, but in some special cases (when inputs independently effect the outputs) you can get away with only using the diagonal.\n",
    "* The derivative of a matrix w.r.t a matrix is a 4-tensor. However, in this case, we can take the sum over one of the batch dimensions. This is because each batch is independent of the others and therefore the gradient w.r.t batch by batch is diagonal. Therefore we just turn this matrix into a vector by summing across on of the batch dimensions.\n",
    "</i>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[(64, 50, 10), (30, 50, 10), (30, 50, 10), (20, 50, 10), (20, 50, 10), (10, 50, 10), (10, 50, 10)]\n"
     ]
    }
   ],
   "source": [
    "### Reverse mode automatic differentiation\n",
    "# need to call forward propagation first to get the activations\n",
    "# (but we already did it above)\n",
    "\n",
    "# now go back through the network and propagate the gradients\n",
    "grads = [layer.grad for layer in reversed(layers)]  # get handles to the grad fns\n",
    "dLdy = np.diag(np.ones(10, dtype=np.float32))  # make up a pseudo feedback vector\n",
    "dLdy = np.stack([dLdy for _ in range(50)], axis=1)  # stack into a batch\n",
    "\n",
    "chain = functools.partial(chain_rule, reverse=True)  \n",
    "deltas = list(itertools.accumulate([dLdy] + grads, chain))  \n",
    "\n",
    "print([d.T.shape for d in reversed(deltas)])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "class Feedforward():\n",
    "    def __init__(self, layers):\n",
    "        self.layers = layers\n",
    "        self.fn = functools.reduce(compose, layers)\n",
    "        self.chain = functools.partial(chain_rule, reverse=True)  \n",
    "\n",
    "    def __call__(self, x):\n",
    "        self.x = x\n",
    "        return self.fn(x)\n",
    "    \n",
    "    def grad(self, x=None):\n",
    "        if x is None:\n",
    "            x = self.x\n",
    "            \n",
    "        # ahh, need this as __call__ might \n",
    "        # not be called directly before grad\n",
    "        self.fn(x)  \n",
    "\n",
    "        grads = [layer.grad for layer in reversed(self.layers)]\n",
    "        self.grads = [evaluate(g).shape for g in grads]\n",
    "        return np.array(functools.reduce(self.chain, grads)).T "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "collapsed": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "### Feedforward\n",
      "Output shape (50, 10)\n",
      "Grad shape (64, 50, 10)\n",
      "Empirical: -0.000 (64, 50, 10) Symbolic : -0.000 (64, 50, 10)\n",
      "Difference: 1401.932542531316\n",
      "\n",
      "\n"
     ]
    }
   ],
   "source": [
    "ff = Feedforward(layers)\n",
    "sanity_check(ff, x)  # something isnt right here..."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Forward propagation (of gradients)\n",
    "\n",
    "Alternatively, we could propagate the gradients forward, aka forward-mode automatic differentiation.\n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "collapsed": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "grads\n",
      "[(64, 50, 30), (30, 50, 30), (30, 50, 20), (20, 50, 20), (20, 50, 10), (10, 50, 10)]\n",
      "deltas\n",
      "[(64, 50, 10), (30, 50, 10), (30, 50, 10), (20, 50, 10), (20, 50, 10), (10, 50, 10), (10, 50, 10)]\n"
     ]
    }
   ],
   "source": [
    "### Forward mode automatic differentiation\n",
    "# need to call a layers activation before we can evaluate its gradient\n",
    "\n",
    "def forward_mode(var, layer):\n",
    "    x, delta = var\n",
    "    # calculate the output at x\n",
    "    activation = layer(x)\n",
    "    # calculate the grad at x\n",
    "    gradient = layer.grad(x) \n",
    "    # note that in forward mode we dont want the \n",
    "    # calculation of the gradients to be lazy as\n",
    "    # if it was we would have to wait longer.\n",
    "    return activation, gradient\n",
    "\n",
    "# propagate inputs forward through the network\n",
    "# and calculate activations and gradients as you go.\n",
    "results = itertools.accumulate([(x, None), layers[0]] + layers[1:], forward_mode)\n",
    "acts, grads = tuple(zip(*results))  # reorganise list of tuples to tuple of lists\n",
    "\n",
    "# each grad is dz_l/dx\n",
    "# note the shapes!!!\n",
    "print('grads')\n",
    "print([g.shape for g in grads[1:]])\n",
    "\n",
    "# note the memory footprint here. we needed to construct every gradient matrix \n",
    "# as we went before we could reduce them back down to [n x m x 1]\n",
    "\n",
    "# now we can apply chain rule to find dL/dz\n",
    "print('deltas')\n",
    "deltas = list(itertools.accumulate([dLdy] + list(reversed(grads[1:])), chain))\n",
    "print([d.T.shape for d in reversed(list(deltas))])\n",
    "\n",
    "## TODO want to profile forward vs backward\n",
    "## TODO this could be done with partial eval!?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You may have noted that when we did backprop we required a list containing each of the layers. This allowed us to reverse the computational steps and propagate the gradients in the correct order. This concept (gradient propagation through computations) can be far more general by representing the networks as computation graphs, allowing you to propagate gradients here there and everywhere.\n",
    "\n",
    "Also, `for` loops, `if` statements, ... can yield differentiable results (wrt to previous operations, not these operations).\n",
    "\n",
    "#### Updating parameters\n",
    "\n",
    "Finally, we need to propagate the error from the outputs of a layer into the parameters of a layer. This can be done using an `updage` method. For the linear layer, we need to propagate the gradients to the weights and biases, this can be done using the same method as before, chain rule.\n",
    "\n",
    "$$\n",
    "\\begin{align}\n",
    "y &= Wx + b \\\\\n",
    "\\frac{\\partial y}{\\partial W} &= x \\\\\n",
    "\\frac{\\partial y}{\\partial W} &= \\mathbb 1 \\\\\n",
    "\\end{align}\n",
    "$$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [],
   "source": [
    "class LinearLayer(Linear):  # want to write this as a wrapper/decorator!?!\n",
    "    def __init__(self, shape):\n",
    "        super(self.__class__, self).__init__(shape)\n",
    "        \n",
    "    def init(self, opt):\n",
    "        self.w_opt = opt(self.weights)\n",
    "        self.b_opt = opt(self.biases)\n",
    "\n",
    "    def update(self, delta):\n",
    "        # need to be able to take a delta from the layers outputs and \n",
    "        # propagate the gradients to the desired parameters\n",
    "        delta_w = np.dot(np.mean(self.x, axis=0, keepdims=True).T, delta)\n",
    "        delta_b = delta\n",
    "        \n",
    "        self.weights += self.w_opt(delta_w)\n",
    "        self.biases += self.b_opt(delta_b)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Optimisation \n",
    "\n",
    "Because we have been provided a gradient of the loss w.r.t our parameters, via feedback and backprop, we can use gradient descent to follow that gradient to a (local) minima.\n",
    "\n",
    "$$\n",
    "x_{t+1} = x_{t} - \\eta \\nabla f(x) \\tag{gradient descent}\n",
    "$$\n",
    "\n",
    "#### Aggregation\n",
    "\n",
    "We define our loss as the mean of the loss over a batch, thus to make an update we take the average of the gradients over that batch. (Alternatively you can view this as make a more accurate estimate of the gradient by taking many samples at the current parameter setting)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "class GradientDescent():\n",
    "    def __init__(self, param, lr, momentum=0.0):\n",
    "        self.lr = lr\n",
    "        self.momentum = momentum\n",
    "        \n",
    "        # for momentum\n",
    "        if momentum:\n",
    "            self.g = np.zeros_like(param)\n",
    "        \n",
    "    def __call__(self, delta):\n",
    "        if self.momentum:\n",
    "            self.g = self.momentum*self.g + delta\n",
    "            return - self.lr*self.g\n",
    "        else:\n",
    "            return - self.lr*delta\n",
    "        \n",
    "# TODO how can we verify our optimisers?\n",
    "# empirical checks? plots of attention over past grads?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Part 2: Training\n",
    "\n",
    "Like the structure of our network, regularisers are also used to enforce priors. However, regularisers allow the learning algorithms to find a balance between regularisation terms and loss.\n",
    "Regularisers are soft priors as opposed to the ones hardwired into the structure of our network.\n",
    "\n",
    "__Implement__ some regularisers;\n",
    "* certainty, \n",
    "* sparsity,\n",
    "* orthogonal.\n",
    "\n",
    "_(What are some reasonable prior beliefs about our parameters w.r.t the data?)_\n",
    "\n",
    "***\n",
    "\n",
    "There is a vast array of optimisers. Here we only consider first-order gradient based optimisers.\n",
    "\n",
    "\n",
    "__Implement__ some momentum optimisers;\n",
    "* nesterov momentum,\n",
    "* coordinate descent,\n",
    "* adam.\n",
    "\n",
    "_(How should we update our parameters given a noisy estimate of the gradient? How should past estimates influence our current decision?)_"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [],
   "source": [
    "class Adam():\n",
    "    def __init__(self, param, lr, d1=0.9, d2=0.999):\n",
    "        self.lr = lr\n",
    "        self.d1 = d1\n",
    "        self.d2 = d2\n",
    "        \n",
    "        # for momentum\n",
    "        self.m = np.zeros_like(param)\n",
    "        self.v = np.zeros_like(param)\n",
    "\n",
    "    def __call__(self, delta):\n",
    "        self.m = self.d1*self.m + delta\n",
    "        self.v = self.d2*self.v + delta**2\n",
    "        \n",
    "        return - self.lr*self.m/np.sqrt(self.v + 1e-8)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# TODO. these are non trivial in my framework. \n",
    "# The lack of a graph to help propagate gradients make these\n",
    "# hard to work with.\n",
    "\n",
    "class Certainty():\n",
    "    pass\n",
    "\n",
    "class Sparsity():\n",
    "    pass\n",
    "\n",
    "class Orthogonal():\n",
    "    pass\n",
    "\n",
    "# visualisations of these!! vector flow plots?!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Data\n",
    "\n",
    "The most important part is the data. It is the data that contains all the patterns we are interested in. The approximator is just a way to extract them."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {
    "collapsed": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(1797, 64)\n",
      "(1797, 10)\n"
     ]
    },
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAA5wAAABuCAYAAABLABxeAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMS4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvNQv5yAAACU5JREFUeJzt3bFSW9caBeDtO/QWeYEQ8gCQwb3lmbiGhrSQyqXpTGfo7AqlJE2gppFqa8a4j8fiBRjyAgE9gW6XzK32yuXscIS/r17+z/bR0easUbGfLBaLAgAAAF37z0MvAAAAgMdJ4QQAAKAJhRMAAIAmFE4AAACaUDgBAABoQuEEAACgCYUTAACAJhROAAAAmlA4AQAAaGKlxdAnT54supq1u7tbzbx79y6aNZ1Oq5nDw8No1u3tbZRLLBaLJ//Pv+vyPicuLy+j3GAwqGbevn0bzZpMJlEu8f/e51L+/Xs9HA6j3Hg8rmZms1mn10z05V6/efOmmkn3j+vr62rm2bNn0aw+7B+l/PvPdbI3lFLK2dlZNbOzs3PP1fxzfbnXyV58c3MTzdrf37/XWlr5Gv8ubm5u3nM1/1xfnumDg4NqJt0/kr1hY2MjmjWfz6uZtbW1aNbt7W0v7vVoNKpm0v012auT65VSyt3dXZRL9OW5Tt7R0ue6y3e0LqX32i+cAAAANKFwAgAA0ITCCQAAQBMKJwAAAE0onAAAADShcAIAANCEwgkAAEATCicAAABNrDz0AmqSQ9nX19ejWaurq9XMn3/+Gc366aefqpmLi4to1rJID+V9/vx5NfPixYto1mQyiXLLJDnc++PHj9GsLg+lXibJvlBKKbu7u9XMq1evolmnp6fVzNbWVjRrOp1Gucdmf38/ys1ms7YLWXLJdzrZh0spZW9vr5r5448/olmPba/Z3t6uZtL7fHx8fN/lfPXSd5CDg4NOMqWUMhgMqpl0XX2RvIOkkj19OBxGs9JcH6R7XbKHpBaLRTVzdXUVzeryGUj5hRMAAIAmFE4AAACaUDgBAABoQuEEAACgCYUTAACAJhROAAAAmlA4AQAAaELhBAAAoAmFEwAAgCZWHurCW1tbUW59fb2a+f7776NZ19fX1cyHDx+iWcn6Ly4uoll9sLm5Wc0Mh8POrjebzTqbtWx2dnaqmaurq2jWeDyuZt6+fRvNWia//vprlHv//n018/vvv0ezkv1jOp1Gsx6jwWBQzezv70ezRqNRNbO2thbNStzc3HQ2699wd3dXzXz77bfRrPl8Xs1cXl5Gs5JnIFl7XxwfH3c2K9mrv2bJdz51dHRUzaT7R5fvPX2RvH+le2Kyp6ff+eRep3tRa8lel/r06VOUSz6TPj+vfuEEAACgCYUTAACAJhROAAAAmlA4AQAAaELhBAAAoAmFEwAAgCYUTgAAAJpQOAEAAGhi5aEuvLq6GuU+f/5czSQHsqeS6y2Tg4ODKJcclPz06dN7ruZvfTm89yEkB1ynhy4nsyaTSTRrmaTf+fX19U4ypZQynU6rmXRfu729jXLLJDkAPD1s/ezsrJpJD4pPDh1P9r8+SfaHjY2NaFayrycHxZeSH/C+LJLD3a+urqJZ6T18bNKD6Ls8sD5970ns7OxUM8l+1SfJer98+RLNSvb0dF9I33v6oMu1Js9YKaWMx+NqJtmzHopfOAEAAGhC4QQAAKAJhRMAAIAmFE4AAACaUDgBAABoQuEEAACgCYUTAACAJhROAAAAmlA4AQAAaGLloS68uroa5abTaeOV/K90Xbe3t41X0o3RaBTlzs7Oqpku/8+DwaCzWX2R/p8ODg6qmZ2dnfsu5y/7+/udzVo219fX1cw333wTzfrw4UMnmVJKefnyZTXTlz1me3s7yp2cnFQz5+fn913OX16/fh3lfv75586u2RfJ/jAcDqNZm5ub1Uzy2abSv0l9kOzpNzc30axk3x+Px9Gs9Jp9kK41eQ7TZzqR/o29vLzs7Jp90eX71/Pnz6uZ7777Lpq1TM/13d1dlLu6uqpm0r/1v/zySzWTfI9KKWVtba2a6frz8AsnAAAATSicAAAANKFwAgAA0ITCCQAAQBMKJwAAAE0onAAAADShcAIAANCEwgkAAEATKw914fSg062trc6uubq62tn1Li4u7rucr1p6OO1sNmu8ku4cHR1FufTA+kRyeHV6QPHXKt2LXr58Wc2cnp5Gs968eVPNHB4eRrNam8/nneX29vaiWen+kBiPx53NWiYPcWB9cpj4MkkOPk8Ovi+llMFgUM2cnJxEs3744Ydqpi9/O9PD45O/ZYvForNZD/H9aC3dNz9+/FjNHB8fR7OS73y6ByefW/o89UXymTzE+/BoNKpmks/jn/ALJwAAAE0onAAAADShcAIAANCEwgkAAEATCicAAABNKJwAAAA0oXACAADQhMIJAABAEwonAAAATaw81IWvr6+j3NbWVjWzu7sbzUpziffv33c2i8fh7Owsyg2Hw2pmY2MjmjUej6uZyWQSzfrtt986m9UX7969q2am02k0a3V1tZr58ccfo1kXFxdRrg8uLy+j3GAwqGY2Nzc7u+b5+Xk06+7uLsotk+3t7WpmPp9Hs46Oju65mr8l+9EySfb0k5OTaNbNzU01s7a2Fs3a2dmpZmazWTSrL0ajUTWTPtOfPn2673KWUvKMlZLdx+TzKCV7Zr98+RLN2t/fr2a63K/6Iv2uJp9Jcg9LyfaQrvmFEwAAgCYUTgAAAJpQOAEAAGhC4QQAAKAJhRMAAIAmFE4AAACaUDgBAABoQuEEAACgiZWHuvD19XWUOzw8rGaSw91LKeXz58/VzLNnz6JZj01yOPpkMolmJYeSD4fDaFZy8HZfpIf3bm5udpIpJTsEOfk8SskOjU6fgb64vb2tZk5PTzu73sXFRZR79epVZ9dcJsk+U0opT58+rWaWaW/o2osXL6qZ169fd3a98/PzKHd5ednZNfsgecaSg+9LyQ5kT+/feDyOcsskeSfY29uLZqX7zGOT/r+T5yz521lKKfP5vJpJ3xtGo1GUWybJ/yl93xsMBtVM+m6dvq92yS+cAAAANKFwAgAA0ITCCQAAQBMKJwAAAE0onAAAADShcAIAANCEwgkAAEATCicAAABNKJwAAAA08WSxWDz0GgAAAHiE/MIJAABAEwonAAAATSicAAAANKFwAgAA0ITCCQAAQBMKJwAAAE0onAAAADShcAIAANCEwgkAAEATCicAAABNKJwAAAA0oXACAADQhMIJAABAEwonAAAATSicAAAANKFwAgAA0ITCCQAAQBMKJwAAAE0onAAAADShcAIAANCEwgkAAEATCicAAABNKJwAAAA08V/YG2T1WKmSmAAAAABJRU5ErkJggg==\n",
      "text/plain": [
       "<matplotlib.figure.Figure at 0x7f6de34c4470>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "digits = sklearn.datasets.load_digits(n_class=10)\n",
    "onehot = sklearn.preprocessing.OneHotEncoder(sparse=False)\n",
    "\n",
    "plt.figure(figsize=(16,4))\n",
    "for i in range(10):\n",
    "    plt.subplot(1,10,i+1)\n",
    "    plt.imshow(digits.images[i], cmap='gray', interpolation='nearest')\n",
    "    plt.axis('off')\n",
    "    \n",
    "images = digits.images.reshape((-1, 8*8)).astype(np.float32)/np.max(digits.images)\n",
    "print(images.shape)\n",
    "labels = digits.target\n",
    "labels = onehot.fit_transform(labels.reshape((-1, 1))).astype(np.float32)\n",
    "print(labels.shape)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def make_batches(x, y, batch_size):\n",
    "    n_of_data = x.shape[0]\n",
    "    assert n_of_data == y.shape[0]\n",
    "    for i in range(n_of_data//batch_size):\n",
    "        yield (x[batch_size*i:batch_size*(i+1), ...],\n",
    "               y[batch_size*i:batch_size*(i+1), ...])\n",
    "        \n",
    "def train(inputs, targets, layers, epochs, batch_size, optimiser):\n",
    "    \"\"\"\n",
    "    Arg:\n",
    "        inputs (np.array): the input vectors in [N_examples, N_features]\n",
    "        targets (np.array): the target vectors in [N_examples, N_classes]\n",
    "        layers (list): a list of layer objects (must be callable and have\n",
    "            a gradient method defined)\n",
    "        epochs (int): the number of times to pass through the N_examples\n",
    "        batch_size (int): the number of examples to use to estimate the gradient\n",
    "            for each parameter update\n",
    "        optimiser (): such as GradientDescent\n",
    "        \n",
    "    Returns:\n",
    "        layers (list): a list of trained layers.\n",
    "    \"\"\"\n",
    "    # init an optimiser for each layer\n",
    "    for layer in layers:\n",
    "        if hasattr(layer, \"init\"):\n",
    "            layer.init(optimiser)\n",
    "    \n",
    "    forward_prop = functools.reduce(compose, layers)\n",
    "    loss_fn = CrossEntropy()\n",
    "    \n",
    "    for e in range(epochs):\n",
    "        # shuffle the data and make batches\n",
    "        idx = np.random.permutation(range(len(inputs)))\n",
    "        images, labels = inputs[idx, ...], targets[idx, ...]\n",
    "        for x, T in make_batches(images, labels, batch_size):\n",
    "            \n",
    "            ### Estimate the gradient (dloss/doutput)\n",
    "            preds = forward_prop(x)\n",
    "            dLdy = np.sum(loss_fn.grad(preds, T), axis=0, keepdims=True)\n",
    "\n",
    "            ### Assign credit via backprop\n",
    "            grads = [layer.grad for layer in reversed(layers)]\n",
    "            deltas = itertools.accumulate([dLdy] + grads, chain)\n",
    "            \n",
    "            ### Update the parameters\n",
    "            for layer, delta in zip(reversed(layers), deltas):\n",
    "                if hasattr(layer, \"update\"):\n",
    "                    layer.update(np.mean(delta, axis=1))\n",
    "\n",
    "        loss = loss_fn(preds, T)\n",
    "        print('\\repoch: {} loss: {:.4f}'.format(e, \n",
    "                                                np.mean(np.sum(loss, axis=1))),\n",
    "              end='', flush=True)\n",
    "    return layers"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Part 3: Inference\n",
    "\n",
    "How do we know we have learn anything useful?\n",
    "__Prove__ to us that this network has 'learned'.\n",
    "\n",
    "_(What were we training it to do? How good is it at doing that?)_"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Pick a subset of the images/labels for training\n",
    "# and use the left overs for testing our accuracy.\n",
    "# It is easy to fool ourselves by memorising/overfitting to the training data\n",
    "# we need to check that the model generalises to unseen data.\n",
    "\n",
    "# (this is the difference between optimisation and learning. \n",
    "# we expect learners to generalise)\n",
    "\n",
    "idx = np.random.permutation(range(len(images)))\n",
    "split = int(0.8 * len(images))\n",
    "train_images, train_labels = images[:split, ...], labels[:split, ...]\n",
    "test_images, test_labels = images[split:, ...], labels[split:, ...]\n",
    "\n",
    "n_repeats = 3"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def accuracy(images, labels, layers):\n",
    "    forward_prop = functools.reduce(compose, layers)\n",
    "    predictions = forward_prop(images)\n",
    "    return np.mean(np.equal(np.argmax(predictions, axis=1), \n",
    "                            np.argmax(labels, axis=1)))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "epoch: 49 loss: 0.5636\n",
      " accuracy: 0.869\n",
      "epoch: 49 loss: 0.4582\n",
      " accuracy: 0.872\n",
      "epoch: 49 loss: 0.7004\n",
      " accuracy: 0.881\n"
     ]
    }
   ],
   "source": [
    "for i in range(n_repeats):\n",
    "    layers = [\n",
    "          LinearLayer((n_inputs, n_classes)),\n",
    "          Softmax()\n",
    "         ]\n",
    "\n",
    "    layers = train(images, \n",
    "                   labels, \n",
    "                   layers, \n",
    "                   epochs=50, \n",
    "                   batch_size=12, \n",
    "                   optimiser=lambda x: GradientDescent(x, 0.001, momentum=0.9))\n",
    "    \n",
    "    valid_accuracy = accuracy(test_images, test_labels, layers)\n",
    "    print('\\n accuracy: {:.3f}'.format(valid_accuracy))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAA5wAAABuCAYAAABLABxeAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMS4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvNQv5yAAADwtJREFUeJzt3VuI1uXaB+D/OOPMOI5Oo1m20crSasxsZ2kboj0l7aAwKqiDjoRKMmiDbSgoiIqIqGiLVHhUYVkUFUk725CWqZRCaVpqOuOoo+mMOut4feuD+17z+nzrY3Fdxz9+8/q8/81z+x48df39/RUAAADsb4P+0x8AAACA/04GTgAAAIowcAIAAFCEgRMAAIAiDJwAAAAUYeAEAACgCAMnAAAARRg4AQAAKMLACQAAQBENJUrnzp3bH2Xq6+tTXd3d3WHmmGOOSXX99ddfYeaLL75IdZ1wwglhZsSIEamuG2+8sS4V/B9uu+22cJ2zfvvttzCze/fuVFdDQ3xZTZkyJdXV1dUVZurqcsv39NNPD2idq6qqnnnmmXCtW1paUl2nn356mFm5cmWq66uvvgozmTWsqtx9lL1v77rrrgGv9dNPPx2udXt7e6qro6MjzGS7hgwZEmZWr16d6vr+++/DzKBBuf8PnDlz5oDX+t577w3XeufOnamujz76aKAf41+cd955YSb7rD7ttNPCzJFHHpnqmjNnzoDX+oUXXgjXuqmpKdV1/fXXh5n33nsv1dXW1hZmMt9HVVXV4sWLw8xnn32W6po1a9aA1vqpp54K13n79u2prkWLFoWZCy64INWVeTasXbs21XXssceGmYMPPjjVdccddxR9VmefY5nrMLvXa25uDjPZe76zszPMLF++PNV15ZVXDnitH3300XCtBw8enOpas2ZNmOnr60t1HXjggWEmu3fMrHX2Grj33nsHvNaPPfZYuNZbtmwZaP2/6OnpSeV6e3vDzGGHHZbqam1t3S9/r6qq6u67706ttV84AQAAKMLACQAAQBEGTgAAAIowcAIAAFCEgRMAAIAiDJwAAAAUYeAEAACgCAMnAAAARTSUKM0ckL5u3bpU16RJk8JM9mDysWPHhpkbb7wx1bVkyZIwkz00daCGDh0aZpYtW5bqWrBgQZh56KGHUl0nnXRSmPnwww9TXdu2bQszmYOH/y+ceOKJqVzmwOCurq5U15lnnhlmPv7441RX5qDn7MHStcgc2j1+/PhUV+bA6RdffDHVlTngvaOjI9W1cuXKMJM9nL4WTU1NYWbDhg2prlWrVoWZOXPmpLoWLlwYZn755ZdU15QpU8JMQ0ORV+E/2bNnT5jJHka/efPmMLNp06ZU17fffhtm3njjjVTXGWecEWba2tpSXQOVeb5mr+nRo0eHmQkTJqS6vvjiizCT+S6qqqpGjRoVZkqvc1Xl7pvsPqilpSXMdHd3p7qGDRsWZjJ7i6qqqrq6+Ez79evXp7pqkdlXZ679qqqqadOmhZmPPvoo1bVixYow88ADD6S6vvzyyzCT/d5q0dPTE2ZGjhyZ6mpvbw8z8+bNS3WdfPLJYWbq1KmprtWrV4eZ7P2W5RdOAAAAijBwAgAAUISBEwAAgCIMnAAAABRh4AQAAKAIAycAAABFGDgBAAAowsAJAABAEQZOAAAAimgoUdrf3x9mTj311FTXqFGjwsxbb72V6lq1alWYmT17dqrruOOOCzNdXV2proGqq6sLMwceeGCq69JLLw0z9913X6qrs7MzzCxYsCDVNWhQ/H8i+/btS3XVorGxMcz88ccfqa758+eHmd7e3lTX7bffHmZGjBiR6vr/InNdf/3116muzDU7ffr0VNeFF14YZt58881U1+bNm8NMS0tLqqsWfX19YSZzD1ZVVd1///1h5pxzzkl1/fnnn2FmwoQJqa6pU6eGmQ0bNqS6apG5rrPf+ZAhQ8LM2rVrU10ffvhhmFm8eHGqa9iwYWFm4sSJqa6Bam5uDjOZPUpVVdVVV10VZnbs2JHqWrFiRZhZv359qmvw4MFhprW1NdVV2saNG1O5hQsXhpns/ZG55zPXalXlnkUNDUW20v9kf+73hg4dGmYye++qqqpDDjkkzLS1taW6Mnu5zDrUKrPfa29vT3Vl3lNHH310qmvMmDFh5uKLL051vfbaa2Hmr7/+SnVl+YUTAACAIgycAAAAFGHgBAAAoAgDJwAAAEUYOAEAACjCwAkAAEARBk4AAACKMHACAABQRJHTajOHt2YP8P3777/DTPaw9R9++CHMXHvttamuSZMmhZns4fQlDR8+PJXLHGK7aNGiVNfSpUvDzK5du1Jdme8/+2+sReYA3F9//TXV9eqrr4aZI444ItV11llnhZmtW7emupqamsLMuHHjUl21yBzK3tXVlerq6ekJM+eee26qq7u7O8zMnz8/1TV69OgwM3HixFRXLTIHyF900UWprksuuSTM3HrrramuzEHqV199daorcxj6smXLUl21GDQo/v/d1tbWVFfmmZc9ADzzLu7o6Eh1ZQ6U7+vrS3UNVOYzZJ9jw4YNCzPff/99qivzjj377LNTXb29vWEm87yq1c6dO8NM9vtubm4OM5l7uapyz87Jkyenuurr68PM6tWrU1212L59e5j5/fffU12Z3D333JPqWrlyZZh57LHHUl2dnZ1h5owzzkh11SKzB1mzZk2qK3Pf33DDDamuzH4489mrKvfuz2T+HX7hBAAAoAgDJwAAAEUYOAEAACjCwAkAAEARBk4AAACKMHACAABQhIETAACAIgycAAAAFGHgBAAAoIiGEqV1dXVhZtSoUamuvXv3hpne3t5UV3Nzc5jp7OxMde3evTvMDBpUdp7fsWNHmOnu7k51DR8+PMx89913qa7Mv3v79u2prqampjAzZMiQVFct2trawszWrVtTXePGjQsz119/faorc3988MEHqa4zzzwzzGzZsiXVVYvM8yP7nZ977rlhZvHixamu6dOnh5kHH3ww1fXOO++kcqVlnmMzZsxIdfX09ISZww8/PNV1/vnnh5mjjjoq1fXTTz+Fmcy7oVY7d+4MM59//nmqa9myZWEme13v2rUrzGTuo6qqqoaGeEuRfV8PVOaabm9vT3VNmzYtzAwbNizVNX78+DAzePDgVNfGjRvDTOl1rqrce7GxsTHVdeWVV4aZ7D2feT9k1rCqcnuovr6+VFctMtfZtm3bUl2ZPdqIESNSXZm93HPPPZfq6ujoCDOTJ09OddUi8xz78ccfU13z588PM9m1Xr58eZh55ZVXUl39/f1hJvMs/Xf4hRMAAIAiDJwAAAAUYeAEAACgCAMnAAAARRg4AQAAKMLACQAAQBEGTgAAAIowcAIAAFBEfLppIatWrUrlLrroojBz/PHHp7oOPfTQMHPyySenujIH7O7ZsyfVNVB1dXVhpr6+PtW1YcOGMDNhwoRU1wknnBBmFi5cmOravHlzmBkzZkyqqxaZQ3Iz11dVVdXMmTPDzOzZs1NdTz75ZJjp6upKdWWulX379qW6apE5lHrHjh2prtbW1jDz1Vdfpbqef/75MPPwww+nuq655pow8/7776e6apE5cPq3335Lda1bty7M3HXXXamuAw44IMy89NJLqa533303zGTfIbVoaWkJM+vXr091zZ07N8wMHz481TV16tRULiNzUHhvb+9++3v/m8wzKvv8aG5u3m9d33zzTZg566yzUl3XXXddmJk3b16qqxY9PT1hJrsHWbJkSZg5/fTTU10///xzmHnkkUdSXePGjQszY8eOTXXVInOdnXLKKamuWbNmhZnXX3891ZW5Bl5++eVUV+ad19jYmOqqxd69e8NM5rqoqqp6++23w0x2X3XMMceEmR9++GG/de3vGcYvnAAAABRh4AQAAKAIAycAAABFGDgBAAAowsAJAABAEQZOAAAAijBwAgAAUISBEwAAgCIMnAAAABTRUKJ00KB4jt2+fXuqq7e3N8zccsstqa6dO3eGmUMOOSTVNXfu3DDT3t6e6hqozNo0NjamujZs2BBmenp6Ul1bt24NM0cffXSqq6urK5Urra6uLsyMHj061XXKKaeEmV27dqW6Lr744jAzbty4VFdHR0eYWbRoUaqrFg0N8WMp+28aM2ZMmFmxYkWqa/z48WEm8+yrqtx1XV9fn+qqRWdnZ5iZP39+quupp54KM2effXaq66abbgozs2bNSnWNGjUqzGSvp1r09fWFmbFjx6a6rrjiijDT3d2d6po2bVqYWbJkSapr06ZNYebYY49NdQ3U/tx/fPLJJ2Eme5+uWbMmzFx++eWprhEjRoSZffv2pbpqMXjw4DCzZ8+eVNfKlSvDzOrVq1NdTzzxRJj59NNPU10jR44MM/39/amuWgwfPjzMZL/ztWvXhpm33nor1ZXZF958882prsy9u2PHjlRXLTL3dEtLS6prxowZYWbp0qWprubm5lQuY9u2bWGmra1tv/29qvILJwAAAIUYOAEAACjCwAkAAEARBk4AAACKMHACAABQhIETAACAIgycAAAAFGHgBAAAoIj4hPUByBwE29ramup69tlnw8xxxx2X6sr8zddeey3VlTnot7e3N9U1UEOHDg0ze/fuTXVlPus777yT6sr8zVNPPTXV9ccff4SZ0utcVVW1e/fuMLN169ZUV2Z9FixYkOpqamoKMxdeeGGqK3N//PTTT6muWmS+z4aG3KNr+vTp+yWT9fXXX6dymQPlDzrooFo/TqixsTHMdHd3p7pGjx4dZn755ZdUV+b6zx4APnXq1DCTub9rlTnUPHtd33nnnWFm06ZNqa7MO/abb75JdU2ZMiXMZK+ngdq1a1eYGTJkSKrr8ccfDzOTJk1Kdc2ePTvMjB8/PtX1/vvvh5ns/VGLffv2hZmRI0emui677LIws3Tp0lRXZh930kknpboyn7++vj7VVYv9uc+ZM2dOmMmuz/HHHx9mVqxYkeo64IADwkxm712rzFpnnudVlVuf5ubmVFdmz595D1dVbk7b3+9Fv3ACAABQhIETAACAIgycAAAAFGHgBAAAoAgDJwAAAEUYOAEAACjCwAkAAEARBk4AAACKMHACAABQRF1/f/9/+jMAAADwX8gvnAAAABRh4AQAAKAIAycAAABFGDgBAAAowsAJAABAEQZOAAAAijBwAgAAUISBEwAAgCIMnAAAABRh4AQAAKAIAycAAABFGDgBAAAowsAJAABAEQZOAAAAijBwAgAAUISBEwAAgCIMnAAAABRh4AQAAKAIAycAAABFGDgBAAAowsAJAABAEQZOAAAAijBwAgAAUMQ/ADA9skYZhFyqAAAAAElFTkSuQmCC\n",
      "text/plain": [
       "<matplotlib.figure.Figure at 0x7f6de34f3e80>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "# lets have a peek at the linear fns layers\n",
    "plt.figure(figsize=(16,4))\n",
    "for i in range(10):\n",
    "    plt.subplot(1,10,i+1)\n",
    "    plt.imshow(layers[0].weights[:, i].reshape((8,8)), cmap='gray', interpolation='nearest')\n",
    "    plt.axis('off')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Play\n",
    "\n",
    "Cool, every thing seems to work. Now we just\\* need to tinker with some hyper parameters to get everything running smoothly."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "epoch: 49 loss: 1.3402\n",
      " accuracy: 0.800\n",
      "epoch: 49 loss: 1.4363\n",
      " accuracy: 0.683\n",
      "epoch: 49 loss: 1.3971\n",
      " accuracy: 0.789\n"
     ]
    }
   ],
   "source": [
    "for i in range(n_repeats):\n",
    "    layers = [\n",
    "          LinearLayer((n_inputs, 30)),\n",
    "          Sigmoid(),\n",
    "          LinearLayer((30, n_classes)),\n",
    "          Softmax()\n",
    "         ]\n",
    "\n",
    "    layers = train(images, \n",
    "                   labels, \n",
    "                   layers, \n",
    "                   epochs=50, \n",
    "                   batch_size=4, \n",
    "                   optimiser=lambda x: Adam(x, 0.0005))\n",
    "    \n",
    "    valid_accuracy = accuracy(test_images, test_labels, layers)\n",
    "    print('\\n accuracy: {:.3f}'.format(valid_accuracy))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {
    "collapsed": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "epoch: 99 loss: 1.9657\n",
      " accuracy: 0.369\n",
      "epoch: 99 loss: 1.8371\n",
      " accuracy: 0.389\n",
      "epoch: 99 loss: 2.1320\n",
      " accuracy: 0.319\n"
     ]
    }
   ],
   "source": [
    "for i in range(n_repeats):\n",
    "    layers = [\n",
    "          LinearLayer((n_inputs, 30)),\n",
    "          ReLU(),\n",
    "          LinearLayer((30, 30)),\n",
    "          ReLU(),\n",
    "          LinearLayer((30, n_classes)),\n",
    "          Softmax()\n",
    "         ]\n",
    "\n",
    "    layers = train(images, \n",
    "                   labels, \n",
    "                   layers, \n",
    "                   epochs=100, \n",
    "                   batch_size=4, \n",
    "                   optimiser=lambda x: Adam(x, 0.0001))\n",
    "    \n",
    "    valid_accuracy = accuracy(test_images, test_labels, layers)\n",
    "    print('\\n accuracy: {:.3f}'.format(valid_accuracy))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {
    "collapsed": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "epoch: 19 loss: 2.4913\n",
      " accuracy: 0.100\n",
      "epoch: 19 loss: 2.5088\n",
      " accuracy: 0.092\n",
      "epoch: 19 loss: 2.7148\n",
      " accuracy: 0.103\n"
     ]
    }
   ],
   "source": [
    "for i in range(n_repeats):\n",
    "    layers = [\n",
    "              Linear((n_inputs, 128)),\n",
    "              Sigmoid(),\n",
    "              Linear((128, n_classes)),\n",
    "              Softmax()\n",
    "             ]\n",
    "    layers = train(images, \n",
    "                   labels, \n",
    "                   layers, \n",
    "                   epochs=20, \n",
    "                   batch_size=16, \n",
    "                   optimiser=lambda x: GradientDescent(x, 0.0001, momentum=0.9))\n",
    "    \n",
    "    valid_accuracy = accuracy(test_images, test_labels, layers)\n",
    "    print('\\n accuracy: {:.3f}'.format(valid_accuracy))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {
    "collapsed": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "epoch: 199 loss: 2.9327\n",
      " accuracy: 0.092\n",
      "epoch: 199 loss: 3.9877\n",
      " accuracy: 0.103\n",
      "epoch: 199 loss: 3.3845\n",
      " accuracy: 0.103\n"
     ]
    }
   ],
   "source": [
    "for i in range(n_repeats):\n",
    "    \n",
    "    width = 8\n",
    "    depth = 8\n",
    "    layers = [Linear((n_inputs, width))]\n",
    "    for _ in range(depth):\n",
    "        layers += [Sigmoid(), Linear((width, width))]\n",
    "    layers += [Linear((width, n_classes)), Softmax()]\n",
    "    \n",
    "    layers = train(images, \n",
    "                   labels, \n",
    "                   layers, \n",
    "                   epochs=200, \n",
    "                   batch_size=8, \n",
    "                   optimiser=lambda x: GradientDescent(x, 0.0001, momentum=0.9))\n",
    "    \n",
    "    valid_accuracy = accuracy(test_images, test_labels, layers)\n",
    "    print('\\n accuracy: {:.3f}'.format(valid_accuracy))"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}