Skip to content

Instantly share code, notes, and snippets.

@sebastien-attia
Created January 29, 2017 06:03
Show Gist options
  • Save sebastien-attia/59f3d17e619d5e170b0804067782a639 to your computer and use it in GitHub Desktop.
Save sebastien-attia/59f3d17e619d5e170b0804067782a639 to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Cheat sheet on Neural Networks 1 \n",
"\n",
"The following article is a cheat sheet on neural networks. My sources are based on the following course and article:\n",
"- the excellent [Machine Learning course](https://www.coursera.org/learn/machine-learning) on Coursera from Professor Andrew Ng, Stanford,\n",
"- the very good [article from Michael Nielsen](http://neuralnetworksanddeeplearning.com/chap2.html), explaining the backpropagation algorithm.\n",
"\n",
"\n",
"## Why the neural networks are powerful ?\n",
"\n",
"It is proven mathematically that: \n",
"\n",
"> Suppose we’re given a [continuous] function f(x) which we’d like to compute to within some desired accuracy ϵ>0. The guarantee is that by using enough hidden neurons we can always find a neural network whose output g(x) satisfies:\n",
"|g(x)−f(x)|<ϵ, for all inputs x. \n",
"\n",
"_Michael Nielsen — From the following [article](http://neuralnetworksanddeeplearning.com/chap4.html)_\n",
"\n",
"## Conventions \n",
"Let’s define a neural network with the following convention:\n",
"\n",
"L = total number of layers in the network. \n",
"$s_l$ = number of units (not counting bias unit) in layer l. \n",
"K = number of units in output layer ( = $s_L$ ). \n",
"\n",
"<img src=\"images/Neural_Network_definition.png\" />\n",
"\n",
"___With___:\n",
"\n",
"$$ \n",
"\\forall l \\in\\ {2, ..., L},\\ a^{(l)} \\in \\mathbb{R}^{s_{l}+1} and\\ \n",
"\\begin{cases}\n",
"a^{(l)}_0\\ =\\ 1\\ (bias\\ node)\\\\\n",
"a^{(l)}_i = g(z^{(l)}_i), \\forall i \\in {1, ..., s_l}\\\\ \n",
"\\end{cases}\\\\\n",
"z^{(l)}_i = [ \\theta^{(l)}_i ]^T . a^{(l-1)}, \\forall i \\in {1, ..., s_l}\\\\ \n",
"with\\ \\theta^{(l)}_i \\in \\mathbb{R}^{s_{l-1}+1}\\ and\\ z^{(l)} \\in \\mathbb{R}^{s_{l}}\\ \n",
"$$\n",
"\n",
"We define the matrix θ of the weights for the layer l as following:\n",
"\n",
"$$\n",
"\\theta^{(l)} \\in \\mathbb{R}^{s_l \\times (s_{(l-1)}+1)}\n",
"$$\n",
"\n",
"$$\n",
"\\theta^{(l)} = \n",
"\\begin{bmatrix}\n",
" [ \\theta^{(l)}_1 ]^T \\\\\n",
" [ \\theta^{(l)}_2 ]^T \\\\\n",
" \\vdots \\\\\n",
" [ \\theta^{(l)}_{s_{l}} ]^T\n",
"\\end{bmatrix} =\n",
"\\begin{bmatrix}\n",
" \\theta_{1,0} & \\dots & \\theta_{1,j} & \\dots & \\theta_{1,s_{l-1}} \\\\\n",
" \\vdots & & \\vdots & & \\vdots \\\\\n",
" \\theta_{i,0} & \\dots & \\theta_{i,j} & \\dots & \\theta_{i,s_{l-1}} \\\\\n",
" \\vdots & & \\vdots & & \\vdots \\\\\n",
" \\theta_{s_l,0} & \\dots & \\theta_{s_l,j} & \\dots & \\theta_{s_l,s_{l-1}} \\\\\n",
"\\end{bmatrix}\n",
"$$\n",
"\n",
"Hence, we have the following relation: \n",
"$$\n",
"a^{(l)} =\n",
"\\left(\\begin{smallmatrix}1 \\\\ g(\\theta^{(l)}.a^{(l-1)})\\end{smallmatrix}\\right)\n",
"$$\n",
"\n",
"\n",
"## The cost function of a Neural Network\n",
"\n",
"The training set is defined by: $ { (x^1,y^1), ..., (x^m,y^m) } $\n",
"\n",
"x and y are vectors, with respectively the same dimensions as the input and output layers of the neural network. \n",
"\n",
"The cost function of a neural network is the following:\n",
"\n",
"\n",
"$$\n",
"J(\\theta) = - \\frac{1}{m} \\sum_{i=1}^m \\sum_{k=1}^K \\left[ cost( a^{(L)}_k, y^{(i)}_k) \\right] + \\frac{\\lambda}{2m}\\sum_{l=2}^{L} \\sum_{j=0}^{s_l} \\sum_{i=1}^{s_{l+1}} ( \\theta_{i,j}^{(l)})^2\n",
"$$\n",
"\n",
"$a^{(L)}_k$ is the output of the neural network, and is dependent of the weights 𝜃 of the neural network. \n",
"\n",
"Now, the objective is to train the neural network and find the minimum of the cost function J(𝜃).\n",
"\n",
"## Mathematic reminder: the chain rule\n",
"\n",
"Let’s define the functions f, g and h as following:\n",
"\n",
"$$ f:\\mathbb{R}^n \\rightarrow \\mathbb{R} $$\n",
"\n",
"$$ g:\\mathbb{R}^p \\rightarrow \\mathbb{R}^n $$\n",
"\n",
"$$ h = f \\circ g $$\n",
"\n",
"The derivative of h is given by the chain rule theorem:\n",
"\n",
"$$\n",
"\\forall_i \\in \\{ \\!1, ... , \\!p \\}, \n",
"\\frac{\\partial h}{\\partial x_i} = \n",
"\\sum_{k=1}^{n} \\frac{\\partial f}{\\partial g_k} \\frac{\\partial g_k}{\\partial x_i}\n",
"$$\n",
"\n",
"(See the following [course online](https://ocw.mit.edu/courses/mathematics/18-02sc-multivariable-calculus-fall-2010/2.-partial-derivatives/) on partial derivation from the MIT)\n",
"\n",
"\n",
"## The backpropagation algorithm\n",
"\n",
"We use the __gradient descent__ to find the minimum of J on 𝜃: $ \\min\\limits_{\\theta} J(\\theta)$\n",
"\n",
"The gradient descent requires to compute: \n",
"\n",
"$$ \\frac{\\partial J(\\theta)}{\\partial \\theta^{(l)}_{i,j}} $$\n",
"\n",
"___In the following parts, we consider only the first part of J(θ) (as if the regularisation term λ=0). The partial derivative of the second term of J(θ) is easy to compute.___\n",
"\n",
"\n",
"### Definition of ẟ\n",
"\n",
"Let’s define the function ẟ. When ẟ of the layer l is multiplied by the output of the layer (l-1), we obtain the partial derivative of the cost function on θ.\n",
"\n",
"Let’s use the chain rule and develop this derivative on z:\n",
"\n",
"$$ \n",
"\\frac{\\partial J(\\theta)}{\\partial \\theta^{(l)}_{i,j}} \n",
"=\n",
"\\sum^{s_l}_{k = 1} \\frac{\\partial J(\\theta)}{\\partial z^{(l)}_k} \\frac{\\partial z^{(l)}_k}{\\partial \\theta^{(l)}_{i,j}}\n",
"$$\n",
"\n",
"(Remind that J is dependent of z)\n",
"\n",
"As: \n",
"$$z^{(l)}_k = [ \\theta^{(l)}_k ]^T . a^{(l-1)} = \\sum_{p=0}^{s_{l-1}} \\theta^{(l)}_{k,p} \\times a^{(l-1)}_p$$\n",
"\n",
"$$\\frac{\\partial z^{(l)}_k}{\\partial \\theta^{(l)}_{i,j}} = 0\\ for\\ k\\ ≠\\ i\\ and\\ p\\ ≠\\ j\\ in\\ the\\ sum.$$\n",
"\n",
"$$And\\ \\frac{\\partial z^{(l)}_k}{\\partial \\theta^{(l)}_{i,j}} = a^{(l-1)}_j\\ for\\ k\\ =\\ i\\ and\\ p\\ =\\ j\\ in\\ the\\ sum.$$\n",
"\n",
"We define the __output error 𝛿__: \n",
"$$ \n",
"\\delta^{(l)}_k = \n",
"\\frac{\\partial J(\\theta)}{\\partial z^{(l)}_k},\\ \\forall k \\in {1, ..., s_l},\\\n",
"and\\ \n",
"\\delta^{(l)} = \\nabla_{z^{(l)}} J(\\theta) \\in \\mathbb{R}^{s_{l}}\n",
"$$\n",
"\n",
"So we have:\n",
"\n",
"---\n",
"\n",
"$$ \n",
"\\frac{\\partial J(\\theta)}{\\partial \\theta^{(l)}_{i,j}} \n",
"=\n",
"\\delta^{(l)}_i . a^{(l-1)}_j\n",
"$$\n",
"\n",
"---\n",
"\n",
"### Value of ẟ for the layer L\n",
"\n",
"Now let’s find 𝛿 for the output layer (layer L):\n",
"\n",
"$$\n",
"\\delta^L_i = \\frac{\\partial J(\\theta)}{\\partial z^{(L)}_i} = \\sum^{s_{L}}_{k = 0} \\frac{\\partial J(\\theta)}{\\partial a^{(L)}_k} \\frac{\\partial a^{(L)}_k}{\\partial z^{(L)}_i}\n",
"$$\n",
"\n",
"As:\n",
"\n",
"$$\n",
"a^{(l)}_k = g(z^{(l)}_k),\\ \\frac{\\partial a^{(L)}_k}{\\partial z^{(L)}_i} = 0\\ if\\ k\\ ≠\\ i\n",
"$$\n",
"\n",
"Hence:\n",
"\n",
"---\n",
"\n",
"$$\n",
"\\delta^L_i = \n",
"\\frac{\\partial J(\\theta)}{\\partial a^{(L)}_i} \\frac{\\partial g(z^{(L)}_i)}{\\partial z^{(L)}_i} =\n",
"\\frac{\\partial J(\\theta)}{\\partial a^{(L)}_i} . g'(z^{(L)}_i)\n",
"$$\n",
"\n",
"---\n",
"\n",
"### Relationship of ẟ between the layer l and the layer (l-1)\n",
"\n",
"Now, we try to find a relation between 𝛿 of the layer l and 𝛿 of the next layer (l+1):\n",
"\n",
"$$\n",
"\\delta^{(l)}_i = \\frac{\\partial J(\\theta)}{\\partial z^{(l)}_i} = \\sum^{s_{l+1}}_{k = 1} \\frac{\\partial J(\\theta)}{\\partial z^{(l+1)}_k} \\frac{\\partial z^{(l+1)}_k}{\\partial z^{(l)}_i}\n",
"$$\n",
"\n",
"With:\n",
"\n",
"$$\n",
"z^{(l+1)}_k = [ \\theta^{(l+1)}_k ]^T . g(z^{(l)}_p) = \\sum_{p=0}^{s_{l}} \\theta^{(l+1)}_{k,p} \\times g(z^{(l)}_p)\n",
"$$\n",
"\n",
"And:\n",
"\n",
"$$\n",
"\\frac{\\partial z^{(l+1)}_k}{\\partial z^{(l)}_i} = \\theta^{(l+1)}_{k,i} . g'(z^{(l)}_i)\\ \\ for\\ p\\ =\\ i,\\ else\\ 0.\n",
"$$\n",
"\n",
"Hence:\n",
"\n",
"---\n",
"\n",
"$$\n",
"\\delta^{(l)}_i \n",
"=\n",
"\\sum^{s_{l+1}}_{k = 0} \\delta^{(l+1)}_k . \\theta^{(l+1)}_{k,i} . g'(z^{(l)}_i) \n",
"$$\n",
"\n",
"---\n",
"\n",
"### The backpropagation algorithm explained\n",
"\n",
"We have the following:\n",
"- we have found a function ẟ for the layer l such that when we multiply this function by the output of the layer (l-1), we obtain the partial derivative of the cost function J on the weights θ of the layer l,\n",
"- the function ẟ for the layer l has a relation with ẟ of the layer (l+1),\n",
"- as we have a training set, we can compute the values of ẟ for the layer L.\n",
"\n",
"So, we start to compute the values of ẟ for the layer L. As we have a relation between ẟ for the layer l and ẟ for the layer (l+1), we can compute the values for the layers (L-1), (L-2), …, 2.\n",
"\n",
"We can then compute the partial derivative of the cost function J on the weights θ of the layer l, by multiplying ẟ for the layer l by the output of the layer (l-1).\n",
"\n",
"\n",
"### The vectorized backpropagation’s equations\n",
"\n",
"The __first equation__ gives the partial derivatives of J with respect to θ. We have added the regularization term.\n",
"\n",
"$$\n",
"\\nabla_{\\theta^{(l)}} J(\\theta) = \n",
"\\begin{bmatrix}\n",
" \\frac{\\partial J}{\\partial \\theta^{(l)}_{1,0}} & \\dots & \\frac{\\partial J}{\\partial \\theta^{(l)}_{1,j}} & \\dots & \\frac{\\partial J}{\\partial \\theta^{(l)}_{1,s_{l-1}}} \\\\\n",
" \\vdots & & \\vdots & & \\vdots \\\\\n",
" \\frac{\\partial J}{\\partial \\theta^{(l)}_{i,0}} & \\dots & \\frac{\\partial J}{\\partial \\theta^{(l)}_{i,j}} & \\dots & \\frac{\\partial J}{\\partial \\theta^{(l)}_{i,s_{l-1}}} \\\\\n",
" \\vdots & & \\vdots & & \\vdots \\\\\n",
" \\frac{\\partial J}{\\partial \\theta^{(l)}_{s_l,0}} & \\dots & \\frac{\\partial J}{\\partial \\theta^{(l)}_{s_l,j}} & \\dots & \\frac{\\partial J}{\\partial \\theta^{(l)}_{s_l,s_{l-1}}} \\\\\n",
"\\end{bmatrix}\n",
"= \\delta^{(l)} \\otimes [a^{(l-1)}]^T + \\frac{\\lambda}{m} \\theta^{(l)}\n",
"$$\n",
"\n",
"The __second equation__ gives the relation between ẟ in layer l and ẟ in layer (l+1):\n",
"\n",
"$$\n",
"\\delta^{(l)} = [(\\theta^{(l+1)})^T . \\delta^{(l+1)}] \\odot g'(z^l)\n",
"$$\n",
"\n",
"The __third equation__ gives the value of ẟ for the layer L:\n",
"\n",
"$$\n",
"\\delta^{(L)} = \\nabla_{a^{(L)}} J(\\theta) \\odot g'(z^L)\n",
"$$\n",
"\n",
"\n",
"## Conclusion\n",
"\n",
"This cheat sheet explains the backpropagation algorithm used to train a neural network. I have created this article after following the great [Machine Learning’s course of Professor Andrew Ng](https://www.coursera.org/learn/machine-learning) on Coursera. The conventions used in this article are not exactly the ones used in the course of Professor Ng, nor exactly the ones used in [the article of Michael Nielsen](http://neuralnetworksanddeeplearning.com/chap2.html).\n",
"\n",
"If you notice any error, please do not hesitate to contact me. \n",
"\n",
"<div style=\"text-align: right\"> To Victor, Oscar and all those will follow </div>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.5.1"
}
},
"nbformat": 4,
"nbformat_minor": 1
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment