Skip to content

Instantly share code, notes, and snippets.

@petered
Last active August 25, 2017 12:57
Show Gist options
  • Save petered/4a036004ef1f27c6b78c592abc7f3844 to your computer and use it in GitHub Desktop.
Save petered/4a036004ef1f27c6b78c592abc7f3844 to your computer and use it in GitHub Desktop.
2017-08-25 Unbiased Online Recurrent Optimization
$\newcommand{\pderiv}[2]{\frac{\partial #1}{\partial #2}}$
$\newcommand{\lderiv}[1]{\frac{\partial \mathcal L}{\partial #1}}$
$\newcommand{\pderivsq}[2]{\frac{\partial^2 #1}{\partial #2^2}}$
$\newcommand{\numel}[1]{|#1|}$
$\newcommand{\pderivdim}[2]{\overset{\big[\numel {#1} \times \numel {#2} \big]}{\frac{\partial #1}{\partial #2}}}$
$\newcommand{\pderivdimg}[4]{\overset{\big[#3 \times #4 \big]}{\frac{\partial #1}{\partial #2}}}$
$\newcommand{\pderivdimt}[2]{\overset{\big[\numel {#2} \times \numel {#1} \big]}{\frac{\partial #1}{\partial #2}^\intercal}}$
$\newcommand{\lderivdimt}[1]{\overset{\big[\numel {#1} \times 1 \big]}{\frac{\partial \mathcal L}{\partial #1}^\intercal}}$
$\newcommand{\lderivdim}[1]{\overset{\big[1\times |#1|\big]}{\frac{\partial \mathcal L}{\partial #1}}}$
$\newcommand{\norm}[1]{\frac12\| #1 \|_2^2}$
$\newcommand{argmax}[1]{\underset{#1}{\operatorname{argmax}}}$
$\newcommand{argmin}[1]{\underset{#1}{\operatorname{argmin}}}$
$\newcommand{blue}[1]{\color{blue}{#1}}$
$\newcommand{red}[1]{\color{red}{#1}}$
# Unbiased Recurrent Online Optimization
## Background 1: Forward Mode Automatic Differentiation
Backpropagation is not the only chain-rule method for calculating gradients.
Backprop can also be called "Backwards mode automatic differentiation".
Suppose you have a system:
\begin{align}
s_1 &= f_1(x, \theta_1) \\
s_2 &= f_2(s_1, \theta_2) \\
... \\
s_L &= f_L(s_{L-1}, \theta_L) \\
\mathcal L &= \ell(s_L, y)
\end{align}
If we want gradients w.r.t. $\theta_l$, you go
$$
\lderivdim{\theta_l} =\lderivdim{s_L}\pderivdim{s_L}{s_{L-1}} ... \pderivdim{s_l}{\theta_l}
$$
That costs:
$$
\mathcal O(1\cdot\numel{s_L}\cdot \numel{s_{L-1}} + 1\cdot\numel{s_{l-1}}\cdot \numel {s_{l-2}} + ... +\blue{1\cdot \numel{s_l}\cdot \numel{\theta_l}})
$$
(In practice the $\blue{\text{last term}}$ is often sparse, because not every element of $s_l$ depends on every element of $\theta_l$), eg for a fully-connected layer it's $\blue{1\cdot \numel{s_l} \cdot \numel{s_{l-1}}}$ to compute $\lderiv{\theta}=\lderiv{s_l}\pderiv{s_l}{\theta}=\lderiv{s_l}\otimes s_{l-1}$)
However, we could also do the computation in "Forward-mode"
$$
\lderivdim{\theta_l} =\Big(\pderivdimt{s_l}{\theta_l} ... \pderivdimt{s_L}{s_{L-1}} \lderivdimt{s_L} \Big)^\intercal
$$
Which gives costs:
$$
\mathcal O(\blue{\numel{\theta_l}\cdot \numel {s_l} \cdot \numel s_{l+1}} +\red{\numel{\theta_l}\cdot \numel {s_{l+1}} \cdot \numel {s_{l+2}}} + ... + \red{\numel{\theta}\cdot\numel{s_L}\cdot 1})
$$
Where the $\red{\text{red terms are REALLY EXPENSIVE!}}$ That's why you almost never see Forward Mode Automatic Differentiation in Machine Learning. One way to think of this is that in forward-mode we're calculating the parameter gradients for every-possible loss in advance. Whereas in reverse-mode (backprop) we start with a particular loss and then calculate gradients for only that one.
## Recurrent Networks:
\begin{align}
s_t &= F_{state}(x_t, s_{t-1}, \theta) \\
o_t &= F_{out}(x_t, s_{t-1}, \theta) \\
\mathcal L_t &= \ell(o_t, y_t)\\
\end{align}
![](https://docs.google.com/drawings/d/e/2PACX-1vSh9t4_eX2cmmjiouLQxJ63ZfzJhVTIPXm0mIfkJrlt0PTMh1xsPUoRhQDAXOFrf3JboICVOiQ8UrRz/pub?w=598&h=332)
(Note that usually in RNNs we restrict this to
$$
F_{out}(x_t, s_{t-1}, \theta) = F_{toplayer}(F_{state}(x_t, s_{t-1}, \theta_{state}), \theta_{out})
$$ But the above formulation is more general)
## Backprop Through time
BPTT can be used when we want to minimize the prediction error
$$
\mathcal L = \sum_t^T \mathcal L_t
$$
For some finite sequence of length T.
![](https://docs.google.com/drawings/d/e/2PACX-1vROEyAIgVgkM9gQQt2_OfA3y8KPOyaYNM9EZ9QhPDvXF-4rZ_-c3Uy3ATBdBCEb0GvDIterDr8MKkhc/pub?w=598&h=355)
## Truncated Backprop Through Time
However, if our sequence is infinite and we want to update online, we need to move to a regime where we stop backpropagating at some point:
![](https://docs.google.com/drawings/d/e/2PACX-1vSdHIqfAbtRI6L17Z5Xglars6NmAyuar9tZcrWKSl0g99VDtO-DmK2L0whGKECM9gFr4AarpP9xK6o7/pub?w=723&h=758)
The problem here is that memory and computation scale with the number of steps that we want to backprop (2 in the above figure)
##Real-Time Recurrent Learning
Is a way to avoid this backtracking by keeping track of $\pderiv{s_t}{\theta}$ online by keeping a live update of $\pderiv {s_t}{\theta}$
![](https://docs.google.com/drawings/d/e/2PACX-1vQR6XU-OETs26h6wqKTO4F5bXAzJFjTQY0PCy9cXrAvuEMPzsXyAHH6HwcTjAuet2QaPUx6244a-JqB/pub?w=723&h=758)
RTRL is simply the application of Forward Mode Automatic Differentiation to update the state in recurrent learning (Equation 5 in paper):
$$
\pderiv{s_{t+1}}{\theta} = \overset{|s|\times |\theta|}{\pderiv{F_{state}}{\theta}(x_{t+1}, s_t, \theta)} + \overset{|s|\times |s|}{\pderiv {F_{state}}{s_t}(x_{t+1}, s_t, \theta)} \overset{|s|\times |\theta|}{\pderiv {s_t}{\theta}}
$$
The problem is: THIS IS REALLY EXPENSIVE
## The Trick
The Trick is to do a stochastic "Rank 1 approximation" of $\pderiv{s_{t+1}}{s_t}$.
i.e, say that:
$$
\pderiv{s_{t+1}}{s_t}\approx \tilde G_t \triangleq\tilde s_t \otimes \tilde \theta_t
$$
Where $\tilde s_t \in \mathbb R^{|s|}$, and $\tilde \theta_t \in \mathbb R^{|\theta|}$
Such that
$$
\mathbb E[\tilde G_t] = \pderiv{s_{t+1}}{s_t}
$$
So that instead of keeping track of the gradient with $|s|\times |\theta|$ parameters we only use $|s| + |\theta|$.
The rest of the paper shows:
1. How to do an efficent update $(\tilde s_t, \tilde \theta_t) \rightarrow (\tilde s_{t+1}, \tilde \theta_{t+1})$
2. How to efficiently approximate $\lderiv \theta$ given $\tilde s_t, \tilde \theta_t$
## Rank 1 Trick
$A\in\mathbb R ^{M\times N}$ decomposes as:
$$
A = \sum_i^k v_i \otimes w_i
$$
Sample $\nu_i \sim U(\{-1, +1\}): i \in [1..k]$
Now:
\begin{align}
\tilde A &\triangleq \sum_i^k \sum_j^k \nu_i \nu_j v_i \otimes w_j \\
&= \sum_i^k \nu_i \nu_i v_i \otimes w_i + \sum_i^k \sum_{j\neq i}^k \nu_i \nu_j v_i \otimes w_j \\
&= \sum_i^k v_i \otimes w_i + \sum_i^k \sum_{j\neq i}^k \nu_i \nu_j v_i \otimes w_j \\
\end{align}
So,
\begin{align}
\mathbb E_\nu[\tilde A] &= \mathbb E\left[\sum_i^k v_i \otimes w_i + \sum_i^k \sum_{j\neq i}^k \nu_i \nu_j v_i \otimes w_j\right] \\
&= \sum_i^k v_i \otimes w_i + \mathbb E\left[\sum_i^k \sum_{j\neq i}^k \nu_i \nu_j v_i \otimes w_j\right] \\
&= \sum_i^k v_i \otimes w_i \\
&= A
\end{align}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment