petered/dl-course-a1-math

## dl-course-a1-math
$$
\newcommand{\pderiv}[2]{\frac{\partial #1}{\partial #2}}
\newcommand{\pderivsq}[2]{\frac{\partial^2 #1}{\partial #2^2}}
\newcommand{\lderiv}[1]{\frac{\partial \mathcal L}{\partial #1}}
\newcommand{\pderivgiven}[3]{\left.\frac{\partial #1}{\partial #2}\right|_{#3}}
\newcommand{\norm}[1]{\frac12\| #1 \|_2^2}
\newcommand{argmax}[1]{\underset{#1}{\operatorname{argmax}}}
\newcommand{argmin}[1]{\underset{#1}{\operatorname{argmin}}}
\newcommand{blue}[1]{\color{blue}{#1}}
\newcommand{red}[1]{\color{red}{#1}}
\newcommand{\numel}[1]{|#1|}
\newcommand{\switch}[3]{\begin{cases} #2 & \text{if } {#1} \\ #3&\text{otherwise}\end{cases}}
\newcommand{\pderivdim}[4]{\overset{\big[#3 \times #4 \big]}{\frac{\partial #1}{\partial #2}}}
\newcommand{\overdim}[3]{\overset{\big[#1 \times #2 \big]}{#3}}
$$

## The Loss

Correction from pen-and-paper, 2.step-2.3:

$$
\mathcal L = 0.5 \cdot (y_{out}-y)^2
$$

Should be
$$
\mathcal L = 0.5 \cdot \|y_{out}-y\|^2 = 0.5 \cdot \sum_i^N (y_{out, i} - y_i)^2
$$
i.e. Loss is a scalar - or if you're using matrix notation: $\mathcal L \in \mathbb R^{1x1}$.  The Derivative of the output with respect to the loss will be a vector:

$$
\pderiv{\mathcal L}{y_{out}} \in \mathbb R^{N \times 1}
$$

## On order of operations:

There are two ways to describe the forward and backward pass.  It's useful to be aware of both of them.

Lets say $N$ is the number of samples in a minibatch, and $D_k$ is the dimentionality of layer $k$ (Where $D_0$ is the input dimension, $z_0:=x$, the input)

**1: Dimension First in Forward, Sample-First in backward (common in mathmatical notation, this pen-and-paper assignment, Matlab code.  Avoids matrix transpose.)**
$$
\overdim{D_k}{N}{s_k} = \overdim{D_{k}}{D_{k-1}}{W_k} \cdot \overdim{D_{k-1}}{N}{z_{k-1}}
$$

$$
\overdim{N}{D_{k-1}}{\delta_{k-1}} = \overdim{N}{D_k}{\delta_k} \cdot \overdim{D_k}{D_{k-1}} {W_k}\odot \overdim{N}{D_{k-1}}{\left[\pderiv{z_{k-1,ji}}{s_{k-1,ji}}\right]}_{\begin{matrix}i\in 1..N\\j \in 1..D_k\end{matrix}}
$$

**2: Sample first (common convention in tensorflow, theano, pytorch)**
$$
\overdim{N}{D_k}{s_k} = \overdim{N}{D_{k-1}}{z_{k-1}} \cdot \overdim{D_{k-1}}{D_{k}} {W}
$$

$$
\overdim{N}{D_k}{\delta_{k-1}} = \overdim{N}{D_{k}}{\delta_k}\cdot \overdim{D_{k}}{D_{k-1}} {W^T}\odot \overdim{N}{D_{k-1}}{\left[\pderiv{z_{k-1,ij}}{s_{k-1,ij}}\right]}_{\begin{matrix}i\in 1..N\\j \in 1..D_k\end{matrix}}
$$

## On Matrix Calculus

Lets take a small network with input, hidden, output sizes of $D_x, D_z, 1$ respectively, and a minibatch of size $N$:

\begin{align}
x &\in \mathbb R^{D_x \times N} \\
W &\in \mathbb R^{D_z \times D_x} \\
w &\in \mathbb R^{1 \times D_z} \\
s_z &:= W\cdot x &\in \mathbb R^{D_z, N} \\
z &:= relu(s_z) &\in \mathbb R^{D_z, N}\\
y_{out} &:= w\cdot z &\in \mathbb R^{D_y, N} \\
\mathcal L &:= \|y-y_{out}\|^2 &\in \mathbb R^{1x1}
\end{align}

Matrix notation is just a compact way to express the gradient, but we can also write them in terms of scalars:

**In terms of scalars**
\begin{align}
{\pderiv{\mathcal L}{y_{out, i}}} &\in \mathbb R \\
{\pderiv{\mathcal L}{z_i}} &= \sum_j^{D_z} \pderiv{\mathcal L}{y_{out, j}} \pderiv{y_{out,j}}{z_i} \\
&= \sum_j^{D_z} \pderiv{\mathcal L}{y_{out, j}} w_{1,i} \\
...
\end{align}

**In terms of vectors**
\begin{align}
\pderiv{\mathcal L}{y_{out}} &\in \mathbb R^{N\times 1} \\
\pderiv{\mathcal L}{z} &= \pderiv{\mathcal L}{y_{out}} \pderiv{y_{out}}{z}\in \mathbb R^{N\times D_z} \\
&= \pderiv{\mathcal L}{y_{out}} \cdot w  \\
...
\end{align}
	$$
	\newcommand{\pderiv}[2]{\frac{\partial #1}{\partial #2}}
	\newcommand{\pderivsq}[2]{\frac{\partial^2 #1}{\partial #2^2}}
	\newcommand{\lderiv}[1]{\frac{\partial \mathcal L}{\partial #1}}
	\newcommand{\pderivgiven}[3]{\left.\frac{\partial #1}{\partial #2}\right\|_{#3}}
	\newcommand{\norm}[1]{\frac12\\| #1 \\|_2^2}
	\newcommand{argmax}[1]{\underset{#1}{\operatorname{argmax}}}
	\newcommand{argmin}[1]{\underset{#1}{\operatorname{argmin}}}
	\newcommand{blue}[1]{\color{blue}{#1}}
	\newcommand{red}[1]{\color{red}{#1}}
	\newcommand{\numel}[1]{\|#1\|}
	\newcommand{\switch}[3]{\begin{cases} #2 & \text{if } {#1} \\ #3&\text{otherwise}\end{cases}}
	\newcommand{\pderivdim}[4]{\overset{\big[#3 \times #4 \big]}{\frac{\partial #1}{\partial #2}}}
	\newcommand{\overdim}[3]{\overset{\big[#1 \times #2 \big]}{#3}}
	$$

	## The Loss

	Correction from pen-and-paper, 2.step-2.3:

	$$
	\mathcal L = 0.5 \cdot (y_{out}-y)^2
	$$

	Should be
	$$
	\mathcal L = 0.5 \cdot \\|y_{out}-y\\|^2 = 0.5 \cdot \sum_i^N (y_{out, i} - y_i)^2
	$$
	i.e. Loss is a scalar - or if you're using matrix notation: $\mathcal L \in \mathbb R^{1x1}$. The Derivative of the output with respect to the loss will be a vector:

	$$
	\pderiv{\mathcal L}{y_{out}} \in \mathbb R^{N \times 1}
	$$

	## On order of operations:

	There are two ways to describe the forward and backward pass. It's useful to be aware of both of them.

	Lets say $N$ is the number of samples in a minibatch, and $D_k$ is the dimentionality of layer $k$ (Where $D_0$ is the input dimension, $z_0:=x$, the input)

	1: Dimension First in Forward, Sample-First in backward (common in mathmatical notation, this pen-and-paper assignment, Matlab code. Avoids matrix transpose.)
	$$
	\overdim{D_k}{N}{s_k} = \overdim{D_{k}}{D_{k-1}}{W_k} \cdot \overdim{D_{k-1}}{N}{z_{k-1}}
	$$

	$$
	\overdim{N}{D_{k-1}}{\delta_{k-1}} = \overdim{N}{D_k}{\delta_k} \cdot \overdim{D_k}{D_{k-1}} {W_k}\odot \overdim{N}{D_{k-1}}{\left[\pderiv{z_{k-1,ji}}{s_{k-1,ji}}\right]}_{\begin{matrix}i\in 1..N\\j \in 1..D_k\end{matrix}}
	$$

	2: Sample first (common convention in tensorflow, theano, pytorch)
	$$
	\overdim{N}{D_k}{s_k} = \overdim{N}{D_{k-1}}{z_{k-1}} \cdot \overdim{D_{k-1}}{D_{k}} {W}
	$$

	$$
	\overdim{N}{D_k}{\delta_{k-1}} = \overdim{N}{D_{k}}{\delta_k}\cdot \overdim{D_{k}}{D_{k-1}} {W^T}\odot \overdim{N}{D_{k-1}}{\left[\pderiv{z_{k-1,ij}}{s_{k-1,ij}}\right]}_{\begin{matrix}i\in 1..N\\j \in 1..D_k\end{matrix}}
	$$

	## On Matrix Calculus

	Lets take a small network with input, hidden, output sizes of $D_x, D_z, 1$ respectively, and a minibatch of size $N$:

	\begin{align}
	x &\in \mathbb R^{D_x \times N} \\
	W &\in \mathbb R^{D_z \times D_x} \\
	w &\in \mathbb R^{1 \times D_z} \\
	s_z &:= W\cdot x &\in \mathbb R^{D_z, N} \\
	z &:= relu(s_z) &\in \mathbb R^{D_z, N}\\
	y_{out} &:= w\cdot z &\in \mathbb R^{D_y, N} \\
	\mathcal L &:= \\|y-y_{out}\\|^2 &\in \mathbb R^{1x1}
	\end{align}

	Matrix notation is just a compact way to express the gradient, but we can also write them in terms of scalars:

	In terms of scalars
	\begin{align}
	{\pderiv{\mathcal L}{y_{out, i}}} &\in \mathbb R \\
	{\pderiv{\mathcal L}{z_i}} &= \sum_j^{D_z} \pderiv{\mathcal L}{y_{out, j}} \pderiv{y_{out,j}}{z_i} \\
	&= \sum_j^{D_z} \pderiv{\mathcal L}{y_{out, j}} w_{1,i} \\
	...
	\end{align}

	In terms of vectors
	\begin{align}
	\pderiv{\mathcal L}{y_{out}} &\in \mathbb R^{N\times 1} \\
	\pderiv{\mathcal L}{z} &= \pderiv{\mathcal L}{y_{out}} \pderiv{y_{out}}{z}\in \mathbb R^{N\times D_z} \\
	&= \pderiv{\mathcal L}{y_{out}} \cdot w \\
	...
	\end{align}