Last active
November 6, 2017 11:46
-
-
Save petered/57c751280871332c6566fcf64fe00983 to your computer and use it in GitHub Desktop.
2017-11-03 Matrix Math
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
$$ | |
\newcommand{\pderiv}[2]{\frac{\partial #1}{\partial #2}} | |
\newcommand{\pderivsq}[2]{\frac{\partial^2 #1}{\partial #2^2}} | |
\newcommand{\lderiv}[1]{\frac{\partial \mathcal L}{\partial #1}} | |
\newcommand{\pderivgiven}[3]{\left.\frac{\partial #1}{\partial #2}\right|_{#3}} | |
\newcommand{\norm}[1]{\frac12\| #1 \|_2^2} | |
\newcommand{argmax}[1]{\underset{#1}{\operatorname{argmax}}} | |
\newcommand{argmin}[1]{\underset{#1}{\operatorname{argmin}}} | |
\newcommand{blue}[1]{\color{blue}{#1}} | |
\newcommand{red}[1]{\color{red}{#1}} | |
\newcommand{\numel}[1]{|#1|} | |
\newcommand{\switch}[3]{\begin{cases} #2 & \text{if } {#1} \\ #3&\text{otherwise}\end{cases}} | |
\newcommand{\pderivdim}[4]{\overset{\big[#3 \times #4 \big]}{\frac{\partial #1}{\partial #2}}} | |
\newcommand{\overdim}[3]{\overset{\big[#1 \times #2 \big]}{#3}} | |
$$ | |
## The Loss | |
Correction from pen-and-paper, 2.step-2.3: | |
$$ | |
\mathcal L = 0.5 \cdot (y_{out}-y)^2 | |
$$ | |
Should be | |
$$ | |
\mathcal L = 0.5 \cdot \|y_{out}-y\|^2 = 0.5 \cdot \sum_i^N (y_{out, i} - y_i)^2 | |
$$ | |
i.e. Loss is a scalar - or if you're using matrix notation: $\mathcal L \in \mathbb R^{1x1}$. The Derivative of the output with respect to the loss will be a vector: | |
$$ | |
\pderiv{\mathcal L}{y_{out}} \in \mathbb R^{N \times 1} | |
$$ | |
## On order of operations: | |
There are two ways to describe the forward and backward pass. It's useful to be aware of both of them. | |
Lets say $N$ is the number of samples in a minibatch, and $D_k$ is the dimentionality of layer $k$ (Where $D_0$ is the input dimension, $z_0:=x$, the input) | |
**1: Dimension First in Forward, Sample-First in backward (common in mathmatical notation, this pen-and-paper assignment, Matlab code. Avoids matrix transpose.)** | |
$$ | |
\overdim{D_k}{N}{s_k} = \overdim{D_{k}}{D_{k-1}}{W_k} \cdot \overdim{D_{k-1}}{N}{z_{k-1}} | |
$$ | |
$$ | |
\overdim{N}{D_{k-1}}{\delta_{k-1}} = \overdim{N}{D_k}{\delta_k} \cdot \overdim{D_k}{D_{k-1}} {W_k}\odot \overdim{N}{D_{k-1}}{\left[\pderiv{z_{k-1,ji}}{s_{k-1,ji}}\right]}_{\begin{matrix}i\in 1..N\\j \in 1..D_k\end{matrix}} | |
$$ | |
**2: Sample first (common convention in tensorflow, theano, pytorch)** | |
$$ | |
\overdim{N}{D_k}{s_k} = \overdim{N}{D_{k-1}}{z_{k-1}} \cdot \overdim{D_{k-1}}{D_{k}} {W} | |
$$ | |
$$ | |
\overdim{N}{D_k}{\delta_{k-1}} = \overdim{N}{D_{k}}{\delta_k}\cdot \overdim{D_{k}}{D_{k-1}} {W^T}\odot \overdim{N}{D_{k-1}}{\left[\pderiv{z_{k-1,ij}}{s_{k-1,ij}}\right]}_{\begin{matrix}i\in 1..N\\j \in 1..D_k\end{matrix}} | |
$$ | |
## On Matrix Calculus | |
Lets take a small network with input, hidden, output sizes of $D_x, D_z, 1$ respectively, and a minibatch of size $N$: | |
\begin{align} | |
x &\in \mathbb R^{D_x \times N} \\ | |
W &\in \mathbb R^{D_z \times D_x} \\ | |
w &\in \mathbb R^{1 \times D_z} \\ | |
s_z &:= W\cdot x &\in \mathbb R^{D_z, N} \\ | |
z &:= relu(s_z) &\in \mathbb R^{D_z, N}\\ | |
y_{out} &:= w\cdot z &\in \mathbb R^{D_y, N} \\ | |
\mathcal L &:= \|y-y_{out}\|^2 &\in \mathbb R^{1x1} | |
\end{align} | |
Matrix notation is just a compact way to express the gradient, but we can also write them in terms of scalars: | |
**In terms of scalars** | |
\begin{align} | |
{\pderiv{\mathcal L}{y_{out, i}}} &\in \mathbb R \\ | |
{\pderiv{\mathcal L}{z_i}} &= \sum_j^{D_z} \pderiv{\mathcal L}{y_{out, j}} \pderiv{y_{out,j}}{z_i} \\ | |
&= \sum_j^{D_z} \pderiv{\mathcal L}{y_{out, j}} w_{1,i} \\ | |
... | |
\end{align} | |
**In terms of vectors** | |
\begin{align} | |
\pderiv{\mathcal L}{y_{out}} &\in \mathbb R^{N\times 1} \\ | |
\pderiv{\mathcal L}{z} &= \pderiv{\mathcal L}{y_{out}} \pderiv{y_{out}}{z}\in \mathbb R^{N\times D_z} \\ | |
&= \pderiv{\mathcal L}{y_{out}} \cdot w \\ | |
... | |
\end{align} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment