syntaxaire/nbviewer_test.ipynb

## nbviewer_test.ipynb
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Week 2\n",
    "## Key concepts\n",
    "A training set consists of m training examples\n",
    "\n",
    "One concept for this week: We don't use a for loop to step over these m examples - when implementing a neural network, we want to process our entire training set without a for loop\n",
    "\n",
    "Another concept for this week: When you organize the computation of a neural network, first you have a forward pass (forward propagation step), followed by a backward pass (backprop).\n",
    "\n",
    "## Binary Classification and key notation\n",
    "Logistic regression is an algorithm for binary classification. Example: cat vs. non cat. Input is an image, output is a label (1 or 0). The output label is denoted by $y$.\n",
    "\n",
    "An image is represented by three separate matrices (representing the red, green and blue color channels). These matrices are concatenated in linear form to make up the training example (a \"feature vector\"). The pixel values are unrolled into the input feature vector. If the feature vector is $x$ then $n_{x}$ is the dimension of this feature vector, or sometimes just $n$.\n",
    "\n",
    "In binary classification our goal is to learn a classifier that can input an image represented by this feature vector x, and predict whether the corresponding label y is 1 or 0, that is, whether this is a cat image or a non-cat image.\n",
    "\n",
    "A single training example is represented by $(x,y)$ where x is an $n_{x}$-size feature vector and $y$, the label, is either $0$ or $1$.\n",
    "\n",
    "$m$ is the number of training examples.\n",
    "\n",
    "The training sets are written $(x^{1}, y^{1})$ for the first training example, up to $(x^{m},y^{m})$ for the last training example.\n",
    "\n",
    "$m_{train}$ might be the number of examples in a training set and $m_{test}$ might be the number of examples in a test set.\n",
    "\n",
    "$X$ would be a matrix where each feature vector is a column, so it would have $m$ columns and $n_{x}$ rows.\n",
    "\n",
    "If using numpy, X.shape would yield $(n_{x},m)$\n",
    "\n",
    "Using examples as columns makes the implementation of a neural network easier. This applies to $y$ as well, so with a matrix $Y$ containing the labels, it would have 1 row and its shape would be $(1,m)$."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Logistic Regression\n",
    "Logistic regression is used when the output labels Y in a supervised learning problem are all either 0 or 1 (binary classification problems).\n",
    "\n",
    "$\\hat{y}$ is the estimate of the probability that $y$ is 1 for each example, so it has to be between 0 and 1.\n",
    "\n",
    "$\\hat{y}=P(y=1|x)$\n",
    "\n",
    "$x\\in\\mathbb{R}^{n_x}$\n",
    "\n",
    "The parameters of logistic regression are $w\\in\\mathbb{R}^{n_x}$ and $b\\in\\mathbb{R}$.\n",
    "\n",
    "### The sigmoid function\n",
    "We could try $\\hat{y}=w^Tx+b$ (which we would use in linear regression), but that would not work because it would not be constrained to $0\\leq\\hat{y}\\leq1$, and $\\hat{y}$ is supposed to be a probability. But we can apply the sigmoid ($\\sigma$) function to constrain it. So our output $\\hat{y}$ becomes $\\hat{y}=\\sigma(w^Tx+b)$\n",
    "\n",
    "The sigmoid function is $\\sigma(z)=\\frac{1}{1+e^{-z}}$\n",
    "\n",
    "Where $z$ is large and negative, this approaches 0. Where $z$ is large and positive, this approaches 1.\n",
    "\n",
    "So our job is to try to learn parameters $w$ and $b$ so that $\\hat{y}$ becomes a good estimate."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Logistic Regression Cost Function\n",
    "Measures how well our parameters, w and b, are doing on our entire training set. We need to define a cost function in order to train these parameters.\n",
    "\n",
    "Given $\\{(x^{(1)},y^{(1)}),(x^{(m)},y^{(m)})\\}$, we want to find parameters w and b such that $\\hat{y}^{(i)}$ (our estimated $y$) is as close to the ground truth $y^{(i)}$ as possible.\n",
    "\n",
    "Superscript $i$ refers to the $i$th training example.\n",
    "\n",
    "Our loss function, $\\mathscr{L}$, measures how good our output $\\hat y$ is compared to the true label $y$. $$\\mathscr{L}(\\hat y,y)$$ For reasons, a \"square error\" function doesn't work well here - we want a function that gives a convex gradient without local minima so that gradient descent works better and is easier to optimize.\n",
    "\n",
    "In logistic regression, we use $$\\mathscr{L}(\\hat y,y)=-(y\\log \\hat y+(1-y)\\log (1-\\hat y))$$\n",
    "Looking at how this multiplies out, we see that:\n",
    "\n",
    "if $y=1: \\mathscr{L}(\\hat y,y)=-\\log \\hat y \\leftarrow$ we want $\\log \\hat y$ large, so that $\\hat y$ large.<br />\n",
    "if $y=0: \\mathscr{L}(\\hat y,y)=-\\log(1-\\hat y) \\leftarrow$ want $\\log(1-\\hat y)$ large, so that $\\hat y$ small.\n",
    "\n",
    "And $\\hat y$ is always constrained to 0-1 by the sigmoid function.\n",
    "\n",
    "This loss function was for a single training example - the cost function: $J(w,b)$ is the average loss across the training set.\n",
    "$$\\begin{align}\n",
    "J(w,b) &= \\frac{1}{m} \\sum_{i=1}^{m} \\mathscr{L}(\\hat y^{(i)},y^{(i)}) \\\\\n",
    "&= -\\frac{1}{m} \\sum_{i=1}^{m} \\left[ y^{(i)}\\log \\hat y^{(i)} + (1-y^{(i)})\\log (1-\\hat y^{(i)}) \\right]\n",
    "\\end{align}$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Gradient Descent\n",
    "The gradient descent algorithm is how we actually train or learn the parameters w and b on our training set. We want to find $w$ and $b$ that minimize $J(w,b)$. We pick $J$ functions (as above) that form a single convex surface to search for the minimum loss.\n",
    "\n",
    "Process: Initialize $w$ and $b$ to some start value. Gradient descent takes one step in the steepest downhill direction. Stops when it reaches the global optimum.\n",
    "\n",
    "In detail, using only w for simplicity: At each step, gradient descent updates w by the learning rate $\\alpha$ times the derivative of J.\n",
    "\n",
    "$$w := w - \\alpha \\frac{dJ(w)}{dw}$$\n",
    "\n",
    "$dw$ is conventionally used as shorthand for the entire derivative while writing code.$\n",
    "\n",
    "For tuning both $w$ and $b$,\n",
    "$$\n",
    "\\begin{align}\n",
    "w & := w - \\alpha \\frac{dJ(w,b)}{dw} \\\\\n",
    "b & := b - \\alpha \\frac{dJ(w,b)}{db}\n",
    "\\end{align}\n",
    "$$\n",
    "\n",
    "Tuning the learning rate is another topic."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Computation graph"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Computation Graph\n",
    "The computation graph shows how the end result of a mathematical computation dependent on multiple variables is reached. If $a+bc$ and $d=2a$, then $d$ depends on $a$ which depends on $b$ and $c$. The graph can be drawn from left to right with the \"root\" variables at the left. Math can be done from left to right and derivatives can be computed (using the chain rule) from right to left."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Derivatives with a Computation Graph\n",
    "\n",
    "## Logistic Regression Gradient Descent\n",
    "Recap, our setup:\n",
    "\n",
    "$$z = w^{T}x+b \\\\\n",
    "\\hat y = a = \\sigma(z) \\\\\n",
    "\\mathscr{L}(a,y)=-(y\\log a+(1-y)\\log (1-a))$$\n",
    "\n",
    "So, if there are are 2 training samples in $x$ (and thus 2 numbers in $w$), then z multiplies out as (remembering that $w^T$ is w transposed for vector dot product with $x$):\n",
    "\n",
    "$$z = w_{1}x_{1}+w_{2}x_{2}+b$$\n",
    "\n",
    "So our computation graph for the loss function is [$x_1$, $w_1$, $x_2$, $w_2$, $b$] $\\rightarrow$ $z$ $\\rightarrow$ $\\hat y$ $\\rightarrow$ $\\mathscr{L}$\n",
    "\n",
    "So in logistic regression we want to figure out $w$ and $b$ such that we minimize the loss. We can work from right to left to compute the derivatives and do gradient descent to minimize $\\mathscr{L}$.\n",
    "\n",
    "Calculus tells us that `da`, our shorthand for $\\frac{d\\mathscr{L}(a,y)}{da}$ or the slope of the loss function with respect to the sigmoid-ized output, is $-\\frac{y}{a}+\\frac{1-y}{1-a}$.\n",
    "\n",
    "`dz`, or $\\frac{d\\mathscr{L}}{dz}$ or $\\frac{d\\mathscr{L}(a,y)}{dz}$, or the slope of the loss function with respect to the output, comes out as $a-y$.\n",
    "\n",
    "Continuing the chain of derivatives,\n",
    "$$\\begin{align}\n",
    "\\frac{\\partial \\mathscr{L}}{\\partial w_1} = dw_1 &= x_{1}dz \\\\\n",
    "dw_2 &= x_{2}dz \\\\\n",
    "db &= dz\n",
    "\\end{align}\n",
    "$$\n",
    "\n",
    "__So if we wanted to do gradient descent in code on just training example 1, we would calculate `dz` as `a-y` and use the above formulas to perform the following updates:__\n",
    "\n",
    "$$\\begin{align}\n",
    "w_1 &:= w_1 - \\alpha dw_1 \\\\\n",
    "w_2 &:= w_2 - \\alpha dw_2 \\\\\n",
    "b &:= b - \\alpha db\n",
    "\\end{align}\n",
    "$$\n",
    "($\\alpha$ being the learning rate)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Gradient Descent on m Examples\n",
    "But we don't usually work on just one example. We want to do this for $m$ training examples! This means also that we are not just working with the loss function $\\mathscr{L}$, but with the cost function $J$ (the average of the loss across all the training examples).\n",
    "\n",
    "*Recap*:\n",
    "$$J(w,b)=\\frac{1}{m}\\sum_{i=1}^{m}\\mathscr{L}(a^{(i)},y) \\\\\n",
    "a^{(i)} = \\hat y^{(i)} = \\sigma (z^{(i)}) = \\sigma (w^T x^{(i)}+b)$$\n",
    "\n",
    "In the previous section, we saw how to compute the derivatives for just one training example. Using the formula for $J$ shown in the recap, calculus shows us that the derivative of J for $w_1$ is:\n",
    "$$\\frac{\\partial}{\\partial w_1}J(w,b)=\\frac{1}{m} \\sum_{i=1} ^{m} \\frac{\\partial}{\\partial w_1}\\mathscr{L}(a^{(i)},y^{(i)})$$\n",
    "We already know how to calculate everything after the sum sign (from the previous lesson) so all we have to do to find the slope of the cost function is take the average of the slopes of the losses for each training example.\n",
    "\n",
    "**So what do we actually have to implement?**  \n",
    "Here's how we can do it:  \n",
    "```J=0, dw1=0, dw2=0, db=0  \n",
    "for i=1 to m  ```<br />\n",
    "&nbsp;&nbsp;$z^{(i)}$ = $w^{T}x^{(i)}+b$  \n",
    "&nbsp;&nbsp;$a^{(i)} = \\sigma(z^{(i)})$ # prediction, $\\hat y$  \n",
    "&nbsp;&nbsp;$J$ += $-\\left[ y^{(i)}\\log a^{(i)} + (1-y^{(i)})\\log (1-a^{(i)}) \\right]$  \n",
    "&nbsp;&nbsp;$dz^{(i)} = a^{(i)}-y^{(i)}$  # superscript i because this is with respect to a single training example  \n",
    "&nbsp;&nbsp;$dw_1 += x_1^{(i)}dz^{(i)}$  # no superscript i because we are using it as an accumulator  \n",
    "&nbsp;&nbsp;$dw_2 += x_2^{(i)}dz^{(i)}$  \n",
    "&nbsp;&nbsp;$db += dz^{(i)}$  \n",
    "*(this all assumes only 2 features)*  \n",
    "Then,  \n",
    "$J /= m$  \n",
    "$dw_1 /= m; dw_2 /=m; db /=m$\n",
    "\n",
    "Phew. With all these calculations we've computed the derivatives of the cost function J with respect to w1, w2, and b. The variables $dw_1$ and $dw_2$ were used as accumulators then divided to get averages at the end. That is why they were not superscripted with $i$.\n",
    "\n",
    "With the calculations above, we can now complete a single step of gradient descent by updating the variables:\n",
    "$$\\begin{align}\n",
    "w_1 &:= w_1 - \\alpha dw_1 \\\\\n",
    "w_2 &:= w_2 - \\alpha dw_2 \\\\\n",
    "b &:= b - \\alpha db\n",
    "\\end{align}\n",
    "$$\n",
    "($\\alpha$ being the learning rate)\n",
    "\n",
    "Then all the calculations must be run again to take the next step of gradient descent.\n",
    "\n",
    "In the main loop above, we manually calculated for 2 features in $w$ and $x$. In reality there are probably more, so this would be implemented as a second for loop inside the main for loop. This is not very efficient, so vectorization is important in allowing us to scale to larger data."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Vectorization\n",
    "Vectorization is the art of getting rid of explicit for loops. For example, when computing $z=w^{T}x +b$ where both w and x are one-column vectors, we might do this non-vectorized:\n",
    "\n",
    "`z=0\n",
    "for i in range(n_x):\n",
    "    z += w[i] * x[i]\n",
    "z += b`\n",
    "\n",
    "But if we did this vectorized, we might do:\n",
    "\n",
    "`z = np.dot(w, x) + b`\n",
    "\n",
    "Both GPUs and CPUs have SIMD instructions (single instruction multiple data)\n",
    "\n",
    "Most operations with numpy are vectorized.\n",
    "\n",
    "## More vectorization examples\n",
    "The code from \"Gradient Descent on m Examples\" only supports two features. We can replace `dw1` and `dw2` with a vector `dw` initialized by `dw = np.zeros((n_x, 1))`, and instead of assigning to `dw1` and `dw2` we can use dw += x$^{(i)}$dz$^{(i)}$. At the end, instead of dividing `dw1` and `dw2` by `m`, we can do `dw /= m`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Vectorizing logistic regression\n",
    "How do we calculate the output and $\\hat y$ of an entire training set for one iteration of logistic regression, without looping over the training set?\n",
    "\n",
    "Recall that we set up the training examples as vector columns within a matrix X, making an $n_x$ by $m$ matrix.\n",
    "\n",
    "We can calculate a 1 by $m$ matrix with $Z=\\left[ z^{(1)}, z^{(2)},z^{(3)}, \\ldots, z^{(m)} \\right] =$ $w^{T}X+[b\\ b\\ b\\ \\ldots b]$ (last being a 1 by $m$ row vector.)\n",
    "\n",
    "In Python: ```Z = np.dot(w.T, X) + b```\n",
    "\n",
    "(Then: ```A = sigmoid(Z)``` (if we have a sigmoid function))\n",
    "\n",
    "**Broadcasting** is when the scalar `b` in the above line is automatically expanded to a row vector at addition time. https://docs.scipy.org/doc/numpy-1.10.1/user/basics.broadcasting.html"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Vectorizing logistic regression's gradient output\n",
    "How do we now calculate the gradients for an entire set of training examples all at the same time?\n",
    "\n",
    "From the lesson on logistic regression gradient descent above, $$dz^{(1)} = a^{(1)} - y^{(1)}$$ $$dz^{(2)} = a^{(2)} - y^{(2)}$$ $$\\ldots$$\n",
    "\n",
    "We define a row vector (in numpy, a 1 x m matrix) **dZ**, $$dZ = \\left[ dz^{(1)}, dz^{(2)}, dz^{(3)}, \\ldots, dz^{(m)}\\right]$$\n",
    "We also have $$A = \\left[ a^{(1)}, a^{(2)}, a^{(3)}, \\ldots, a^{(m)}\\right]$$\n",
    "and $$Y = \\left[ y^{(1)}, y^{(2)}, y^{(3)}, \\ldots, y^{(m)}\\right]$$\n",
    "Using numpy, the calculation of dZ can be vectorized as  \n",
    "**```dZ = A - Y```**\n",
    "\n",
    "dw and db are calculated similarly. In the lesson *Gradient Descent on m Examples*, both were used as accumulators then divided by the number of iterations. We can just use `(1 / m) * np.sum` to do this for `db`, but `np.mean` works just as well.\n",
    "\n",
    "**```db = np.mean(dZ)\n",
    "dw = (1 / m) * np.dot(X, dZ.T)```**\n",
    "\n",
    "When we go to update with the backpropagation,\n",
    "\n",
    "**```w = w - (lr * dw)    # lr is learning rate\n",
    "b = b - (lr * db)```**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Programming assignment 1 (Logistic regression with a neural network mindset)\n",
    "**What you need to remember:**\n",
    "\n",
    "Common steps for pre-processing a new dataset are:\n",
    "- Figure out the dimensions and shapes of the problem (m_train, m_test, num_px, ...)\n",
    "- Reshape the datasets such that each example is now a vector of size (num_px \\* num_px \\* 3, 1)\n",
    "- \"Standardize\" the data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Python Basics with Numpy\n",
    "Key functions: np.exp, np.log, np.reshape\n",
    "\n",
    "math.exp: e to the x"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Week 3: Shallow Neural Network\n",
    "## Neural Networks Overview\n",
    "In a neural network, instead of the inputs directly going into the calculation of $z$ (the output) then $a$ (the final activation) and $\\mathscr{L}$ (the loss), they can produce activations that are then inputs into other nodes, leading to a final node where the activation and loss are calculated.\n",
    "\n",
    "The property of belonging to the first stack of neural nodes (the first layer) is superscript square bracket, like $^{[1]}$. So instead of\n",
    "$$z=wx+b \\rightarrow a=\\sigma(z)$$\n",
    "we can have, for a two-layer neural network,\n",
    "$$z^{[1]}=W^{[1]}x+b^{[1]} \\rightarrow a^{[1]}=\\sigma(z^{[1]}) \\rightarrow z^{[2]} = W^{[2]}a^{[1]} + b^{[2]} \\rightarrow a^{[2]} = \\sigma (z^{[2]}) \\rightarrow \\mathscr{L}(a^{[2]},y)$$\n",
    "We still use superscript round bracket to refer to individual training examples.\n",
    "\n",
    "Similarly, for the backpropagation step, we calculate the derivative respective to each layer while working backward through the neural network."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Neural Network Representation\n",
    "The features are the **input layer**.  \n",
    "The first layer, represented by circles, is the first **hidden layer** of nodes. The hidden layer is not generally observed during training.\n",
    "The last layer, represented by a circle, is the **output layer** responsible for generating $\\hat y$.  \n",
    "An alternate name for the input layer, where before we used $X$, is $a^{[0]}$, the input activations. The input layer passes the value X to the hidden layer.  \n",
    "The activations produced by the first hidden layer are called $a^{[1]}_1$, $a^{[1]}_2$ and so on.  \n",
    "$a^{[1]}$ refers to the column vector of the entire set of output activations for the first hidden layer.  \n",
    "$a^{[2]}$, in this one hidden layer example, would be the output of the neural network, or $\\hat y$.  \n",
    "This example would be called a **two layer neural network** because we do not count the input layer.  \n",
    "Each layer gets its own learned set of parameters, so w and b become $w^{[1]}$, $b^{[1]}$ for the first hidden layer and so on.  \n",
    "\n",
    "The entire 2-layer neural network, with 1 hidden layer, can be vectorized with these 4 operations:\n",
    "$$\\begin{align}\n",
    "z^{[1]} & = W^{[1]} a^{[0]} + b^{[1]} (where\\ a^{[0]} = x) \\\\\n",
    "a^{[1]} & = \\sigma (z^{[1]}) \\\\\n",
    "z^{[2]} & = W^{[2]}a^{[1]} + b^{[2]} \\\\\n",
    "a^{[2]} & = \\sigma(z^{[2]})\n",
    "\\end{align}$$\n",
    "This is for only one training example, so wherever an example index is required it would be like $x^{(1)}$ for the first training example."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Vectorizing across multiple (training) examples\n",
    "This is the forward propagation step. For the backward propagation step, see <a href=\"#Gradient-descent-(backprop)-for-Neural-Networks\">Gradient descent (backprop) for Neural Networks</a> section below.\n",
    "$$\\begin{align}\n",
    "Z^{[1]} & = W^{[1]}X + b^{[1]} \\\\\n",
    "A^{[1]} & = \\sigma (Z^{[1]}) \\\\\n",
    "Z^{[2]} & = W^{[2]}A^{[1]} + b^{[2]} \\\\\n",
    "A^{[2]} & = \\sigma (Z^{[2]})\n",
    "\\end{align}$$\n",
    "\n",
    "W doesn't need to be transposed anymore because each **row** of W is the set of weights for one neuron."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Explanation (justification) for vectorized implementation\n",
    "It works because we've been laying out our matrices to allow for this.\n",
    "<img src=\"2017-09-06 21_24_19-Explanation for Vectorized Implementation - deeplearning.ai _ Coursera.png\">"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Activation functions\n",
    "In the more general case, we can have a function g(z) instead of $\\sigma (z)$ that acts as our activation function, that may or may not be the sigmoid function.\n",
    "\n",
    "**The tanh function** (hyperbolic tangent) is almost always strictly superior to sigmoid for activation. Instead of going between 0 and 1, it goes between -1 and +1.\n",
    "\n",
    "The formula for tanh(z) is $\\frac{e^z - e^{-z}}{e^z + e^{-z}}$\n",
    "\n",
    "tanh is a shifted version of the sigmoid function. Since it runs from -1 to +1, the outputs of the hidden layer tend to have a mean around 0. This is similar to how we center and scale our input data.\n",
    "\n",
    "Sigmoid activation does make sense for the output layer when using binary classification, where $\\hat y$ should be between 0 and 1.\n",
    "\n",
    "To indicate a particular activation function is used on a certain layer, the notation is like $g^{[1]}(z^{[1]})=\\tanh(z^{[1]})$\n",
    "\n",
    "**Another choice** that is popular is the Rectified Linear Unit (**ReLU**), $a=\\max(0,z)$\n",
    "\n",
    "With ReLU, the derivative is 1 with positive z and 0 with negative 0, with an inflection point at (0,0) which is not differentiable (but you can just fake it). One advantage with ReLU is that even with large positive values of $z$ the derivative does not flatten toward zero, which would slow down gradient descent. Of course the derivative is 0 for z values <1, but in practice most training examples have z>0 and learning still occurs quickly.\n",
    "\n",
    "**Leaky relu** is a variant on ReLU that has a slight positive differential for z<0. It might be implemented like $a=\\max(0.01z,z)$ (the 0.01 could even be parameterized as part of the learning...)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Why do you need non-linear activation functions?\n",
    "If you just set $g(z)=z$ (a \"linear\" or \"identity\" activation), then $\\hat y$, even with hidden layers, simply becomes a linear function of the input features $x$. In fact, no matter how many hidden layers you add, with this activation you are still only computing a function of the input.\n",
    "\n",
    "One place where a linear activation function might be used is in the *output* node if $\\hat y$ is to be a real number (e.g. machine learning on real estate data where $\\$0 \\leq \\hat y \\leq \\$1000000$)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Derivatives of activation functions\n",
    "### Sigmoid function\n",
    "$$\\begin{align}\n",
    "g(z) =\\sigma(z) & =\\frac{1}{1+e^{-z}} \\\\\n",
    "g'(z) =\\frac{d}{dz}g(z) & =g(z)(1-g(z)) \\\\\n",
    "& =a(1-a)\n",
    "\\end{align}$$\n",
    "\n",
    "### TanH function\n",
    "$$\\begin{align}\n",
    "g(z)=\\tanh(z) & =\\frac{e^z-e^{-z}}{e^z+e^{-z}} \\\\\n",
    "g'(z)=\\frac{d}{dz}g(z) & =1-(\\tanh(z))^2 \\\\\n",
    "& =1-a^2\n",
    "\\end{align}$$\n",
    "\n",
    "### ReLU\n",
    "$$\\begin{align}\n",
    "g(z) & =\\max(0,z) \\\\\n",
    "g'(z) & = \\begin{cases}\n",
    "0, & \\text{if } z<0 \\\\\n",
    "1, & \\text{if } z>0 \\\\\n",
    "\\text{undefined}, & \\text{if } z=0\n",
    "\\end{cases}\n",
    "\\end{align}$$\n",
    "(In code, we would just let $g'=1 \\text{ if } z \\geq 0$.)\n",
    "\n",
    "## Leaky ReLU\n",
    "$$\\begin{align}\n",
    "g(z) & = \\max(0.01z, z) \\\\\n",
    "g'(z) & = \\begin{cases}\n",
    "0.01, & \\text{if } z<0 \\\\\n",
    "1, & \\text{if } z>0 \\\\\n",
    "\\text{undefined}, & \\text{if } z=0\n",
    "\\end{cases}\n",
    "\\end{align}$$\n",
    "(In code, we would treat the zero condition like ReLU above.)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Gradient descent (backprop) for Neural Networks\n",
    "In a network like the one we've been considering with 1 hidden layer, the matrix sizes are as follows (where $n^{[0]}=n_x$):\n",
    "$$W^{[1]}:\\ (n^{[1]},n^{[0]}) \\\\\n",
    "b^{[1]}:\\ (n^{[1]},1) \\\\\n",
    "W^{[2]}:\\ (n^{[2]},n^{[1]}) \\\\\n",
    "b^{[2]}:\\ (n^{[2]},1)$$\n",
    "(and $n^{[2]}=1$ because we have only one output node.)  \n",
    "Cost function:\n",
    "$$J(W^{[1]},b^{[1]},W^{[2]},b^{[2]})=\\frac{1}{m}\\sum_{i=1}^{n}\\mathscr{L}(\\hat y,y)$$\n",
    "(and $\\hat y$ is $a^{[2]}$).\n",
    "\n",
    "To do gradient descent:  \n",
    "Repeat:  \n",
    "&nbsp;&nbsp;Forward propagation (compute $\\hat y^{(i)}, i=1,\\ldots,m$)  \n",
    "&nbsp;&nbsp;Backward propagation (compute $dW^{[1]}=\\frac{dJ}{dW^{[1]}}, db^{[1]}=\\frac{dJ}{db^{[1]}}, \\ldots$ other parameters)  \n",
    "&nbsp;&nbsp;Update ($W^{[1]}:=W^{[1]}-\\alpha dW^{[1]},\\ b^{[1]}:=b^{[1]}-\\alpha db^{[1]},\\ \\ldots$ other parameters)\n",
    "\n",
    "For the **forward propagation** step, this is copied from the <a href=\"#Vectorizing-across-multiple-(training)-examples\">Vectorizing across multiple (training) examples</a> section above:\n",
    "$$\\begin{align}\n",
    "Z^{[1]} & = W^{[1]}X + b^{[1]} \\\\\n",
    "A^{[1]} & = \\sigma (Z^{[1]}) \\\\\n",
    "Z^{[2]} & = W^{[2]}A^{[1]} + b^{[2]} \\\\\n",
    "A^{[2]} & = \\sigma (Z^{[2]})\n",
    "\\end{align}$$\n",
    "For the **backward propagation step**, we compute the derivatives as follows:\n",
    "$$\\begin{align}\n",
    "dZ^{[2]} & = A^{[2]}-Y \\\\\n",
    "dW^{[2]} & = \\frac{1}{m}dZ^{[2]}A^{[1]T} \\\\\n",
    "db^{[2]} & = \\frac{1}{m}\\texttt{np.sum(dZ}^{[2]}\\texttt{, axis=1, keepdims=True)} \\tag a \\\\\n",
    "dZ^{[1]} & = W^{[2]T}dZ^{[2]} \\times g^{[1]'}(Z^{[1]}) \\tag b \\\\\n",
    "dW^{[1]} & = \\frac{1}{m}dZ^{[1]}X^T \\\\\n",
    "db^{[1]} & = \\frac{1}{m}\\texttt{np.sum(dZ}^{[1]}\\texttt{, axis=1, keepdims=True)} \\\\\n",
    "\\end{align}$$\n",
    "$(\\text a)$ the `keepdims=True` option keeps numpy from turning a matrix with one row or column into a rank 1 array.  \n",
    "$(\\text b)$ both the first and second terms of the product are of size $(n^{[1]},m)$ and the product is element-wise, not a matrix multiplication."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Random initialization\n",
    "If we initialize all the weights to zero, all the neurons start out identical and their derivatives are also identical. Sad! After each training iteration our neurons would still be the same. (We can initialize the bias to zero without hurting anything.)\n",
    "\n",
    "The solution is to initialize all our parameters randomly.  \n",
    "```W1 = np.random.randn((dimsize1, dimsize2)) * 0.01\n",
    "b1 = np.zero((dimsize1, 1))\n",
    "W2 = ...\n",
    "b2 = 0```\n",
    "\n",
    "We multiply by 0.01 to get very small random values. This is to keep us from ending up with large negative or positive $z$ values and ending up on the \"flat\" areas of the sigmoid or tanh curves."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Week 4\n",
    "## Deep L-layer neural network\n",
    "### Notation\n",
    "$L$: the number of layers in a neural network  \n",
    "$n^{[\\mathscr{l}]}$: the number of units in layer $\\mathscr{l}$  \n",
    "$a^{[\\mathscr{l}]}$: the activations in layer $\\mathscr{l}$  \n",
    "$a^{[\\mathscr{l}]}=g^{[\\mathscr{l}]}(z^{[\\mathscr{l}]})$  \n",
    "$W^{[\\mathscr{l}]}$: the weights for computing $z^{[\\mathscr{l}]}$  \n",
    "$b^{[\\mathscr{l}]}$: the bias for computing $z^{[\\mathscr{l}]}$  \n",
    "The input layer is $x$, and $x = a^{[0]}$  \n",
    "The final output is $\\hat y$, and $\\hat y = a^{[L]}$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Forward propagation in a deep neural network\n",
    "For each layer, the forward propagation looks the same:\n",
    "$$z^{[\\mathscr{l}]}=w^{[\\mathscr{l}]}a^{[\\mathscr{l-1}]}+b^{[\\mathscr{l}]} \\\\\n",
    "a^{[\\mathscr{l}]}=g^{[\\mathscr{l}]}(z^{[\\mathscr{l}]})$$\n",
    "for one training example. How about vectorizing it? The matrices and bias vectors are set up so that it multiplies correctly. $Z$ contains outputs stacked in one column per training example.\n",
    "$$Z^{[\\mathscr{l}]}=W^{[\\mathscr{l}]}A^{[\\mathscr{l-1}]}+b^{[\\mathscr{l}]} \\\\\n",
    "A^{[\\mathscr{l}]}=g^{[\\mathscr{l}]}(Z^{[\\mathscr{l}]})$$\n",
    "\n",
    "When implementing this across $L$ layers, it is ok to use a `for` loop!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Getting your matrix dimensions right\n",
    "Andrew Ng likes to pull out pencil and paper to check his matrix dimensions to eliminate one common source of implementation bugs.\n",
    "\n",
    "### Matrix multiplication refresher\n",
    "You can only matrix-multiply matrices $I$ and $J$, $IJ$, if the number of columns in $I$ is equal to the number of rows in $J$.  \n",
    "When you do, the resulting matrix will have the number of columns of $J$ and the number of rows of $I$.  \n",
    "Each position in the resulting matrix gets the dot product of the corresponding *row* of the first matrix and the corresponding *column* of the second matrix.\n",
    "\n",
    "**Remember:** Can I multiply them? Count right, then down.  \n",
    "What will the result be? Count down, then right.\n",
    "\n",
    "**Example:**\n",
    "$$\\begin{bmatrix}\n",
    "3 & 8 & 1 \\\\\n",
    "2 & 9 & 3\n",
    "\\end{bmatrix}\n",
    "\\begin{bmatrix}\n",
    "7 \\\\\n",
    "1 \\\\\n",
    "3\n",
    "\\end{bmatrix}$$\n",
    "Counting right in the first matrix, then down in the second, gets you $3$ and $3$ so yes, you can multiply them.  \n",
    "Counting down in the first matrix and across in the second gets you $(2,1)$ so that will be the dimensions of the product.\n",
    "\n",
    "### Non-vectorized (single training example) matrix sizes\n",
    "The output, $z^{[\\mathscr l]}$, is an $(n^{[\\mathscr l]}, 1)$ dimensional matrix, because for each neural node in layer $\\mathscr l$, $z^{[\\mathscr l]}$ has one number in a row by itself.\n",
    "$$Z=\\begin{bmatrix}1 \\\\ 2 \\\\ 3 \\\\ \\vdots \\end{bmatrix}$$\n",
    "\n",
    "On the input layer, for each input example, lowercase x is a $(n^0,1)$ dimensional matrix.  \n",
    "Therefore since $z=Wx+b$, $W$'s dimensions must be $(n^{[1]}, n^{[0]})$.  \n",
    "**More generally**, the dimensions of $W^{[\\mathscr l]}$ must be $(n^{[\\mathscr l]}, n^{[\\mathscr{l} - 1]})$.  \n",
    "So in this neural network, in a non-vectorized, single-training example version:\n",
    "<img src=\"2017-09-10 19_19_10-Getting your matrix dimensions right - deeplearning.ai _ Coursera.png\">  \n",
    "The dimensions of $z^{[1]}$ are $(3,1)$  \n",
    "The dimensions of $z^{[2]}$ are $(5,1)$  \n",
    "$\\ldots$\n",
    "\n",
    "The dimensions of $W^{[1]}$ are $(3,2)$  \n",
    "The dimensions of $W^{[2]}$ are $(5,3)$  \n",
    "The dimensions of $W^{[3]}$ are $(4,5)$  \n",
    "The dimensions of $W^{[4]}$ are $(2,4)$  \n",
    "The dimensions of $W^{[L]}$ are $(1,2)$  \n",
    "\n",
    "Going on to think about the dimensions of $b^{[\\mathscr l]}$, it must be the same as $Z$ and $W^{[\\mathscr l]}A^{[\\mathscr{l}-1]}$ in order to be added element-wise. (One bias number must be added to each output.)  \n",
    "**More generally**, the dimensions of $b^{[\\mathscr l]}$ must be $(n^{[\\mathscr l]},1)$.\n",
    "\n",
    "**In backpropagation**, the dimensions of $dW$ and $db$ for each layer are the same as $W$ and $b$ for that layer.\n",
    "\n",
    "### Dimensions of a vectorized implementation\n",
    "Non-vectorized (single training example:)\n",
    "$$\\begin{align}\n",
    "z^{[1]} & = & W^{[1]} & x & + & b^{[1]} \\\\\n",
    "(n^{[1]},1) & & (n^{[1]},n^{[0]}) & (n^{[0]},1) & & (n^{[1]},1) \\\\\n",
    "\\end{align}$$\n",
    "Vectorized (all $m$ training examples simultaneously:)\n",
    "$$\\begin{align}\n",
    "Z^{[1]} & = & W^{[1]} & X & + & b^{[1]} \\\\\n",
    "(n^{[1]},m) & & (n^{[1]},n^{[0]}) & (n^{[0]},m) & & (n^{[1]},1) \\\\\n",
    "\\end{align}$$\n",
    "The size of W is the same.  \n",
    "Note that the product of $W^{[1]}$ and $X$ is a $(n^{[1]},m)$ dimension matrix. The (1) dimension in $b$ is broadcast across the result.  \n",
    "**More generally**, $Z^{[\\mathscr{l}]}, A^{[\\mathscr{l}]}$ are of dimension $(n^{[\\mathscr{l}]},m)$. In backpropagation, $dZ^{[\\mathscr{l}]}$ and $dA^{[\\mathscr{l}]}$ have the same dimensions."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Why *deep* representations?\n",
    "Early layers of a deep neural network can detect simpler features like edges, allowing the deeper layers to compose them together to perform more complex functions. In this case the earlier layers would be looking at smaller segments of the image and the later layers would be looking at larger areas.\n",
    "\n",
    "### Circuit theory and deep learning\n",
    "Informally: There are functions that you can compute with relatively \"small\" L-layer deep neural networks that shallower neural networks require exponentially more hidden units to compute."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Building blocks of deep neural networks\n",
    "For a given layer $\\mathscr l$ of a deep neural network, we need to implement:  \n",
    "&nbsp;&nbsp;Forward propagation: Takes $a^{[\\mathscr l - 1]}$, uses W and b, and gives $a^{[\\mathscr l]}$, caches W, b, $a^{[\\mathscr l]}$ and $z^{[\\mathscr l]}$ for later use  \n",
    "&nbsp;&nbsp;Backward propagation: Takes $da^{[\\mathscr l]}$, uses W, b and the cached $z$ and gives $da^{[\\mathscr l - 1]}$ as well as $dw^{[\\mathscr l]}$ and $db^{[\\mathscr l]}$ for gradient descent  \n",
    "### Sketch of a single layer's computations\n",
    "<img src=\"2017-09-13 20_50_32-Building blocks of deep neural networks - deeplearning.ai _ Coursera.png\">  \n",
    "### Sketch of an entire deep neural network's computations\n",
    "<img src=\"2017-09-13 21_29_33-Building blocks of deep neural networks - deeplearning.ai _ Coursera.png\">"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Implementing forward and backward propagation in a deep neural network\n",
    "**Vectorized forwardprop:**\n",
    "$$\\begin{align}\n",
    "& Z^{[\\mathscr{l}]} = W^{[\\mathscr{l}]} A^{[\\mathscr{l-1}]}+b^{[\\mathscr{l}]} \\\\\n",
    "& A^{[\\mathscr{l}]} = g^{[\\mathscr{l}]}(Z^{[\\mathscr{l}]})\n",
    "\\end{align}$$\n",
    "\n",
    "**Vectorized backprop:**\n",
    "$$\\begin{align}\n",
    "& dZ^{[\\mathscr{l}]} = dA^{[\\mathscr{l}]} * g^{[\\mathscr{l}]\\prime} (Z^{[\\mathscr{l}]}) \\\\\n",
    "& dW^{[\\mathscr{l}]} = \\frac{1}{m}dZ^{[\\mathscr{l}]}A^{[\\mathscr{l-1}]T} \\\\\n",
    "& db^{[\\mathscr{l}]} = \\frac{1}{m}\\texttt{np.sum(dZ}^{[\\mathscr{l}]}\\texttt{, axis=1, keepdims=True)} \\\\\n",
    "& dA^{[\\mathscr{l}]} = W^{[\\mathscr{l}]T}dZ^{[\\mathscr{l}]}\n",
    "\\end{align}$$\n",
    "\n",
    "The initial input to forward propagation is obviously X, but the initial input to backprop is the derivative of loss with respect to $\\hat y$, which is $-\\frac{y}{a}+\\frac{1-y}{1-a}$ (or rather the vector containing the result of this formula for each training example)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "## Parameters vs. hyperparameters\n",
    "Parameters are W1, b1, W2, b2, etc.  \n",
    "Hyperparameters are things like learning rate $\\alpha$, number of iterations of gradient descent, number of hidden layers $L$, number of hidden units like $n^{[1]}$, $n^{[2]}$, etc., choice of activation functions. These are called hyperparameters because they end up determining the final value of the parameters.  \n",
    "In the second course, we will see more hyperparameters like momentum, mini batch size, and regularization."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}