Last active
August 25, 2017 12:57
-
-
Save petered/4a036004ef1f27c6b78c592abc7f3844 to your computer and use it in GitHub Desktop.
2017-08-25 Unbiased Online Recurrent Optimization
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
$\newcommand{\pderiv}[2]{\frac{\partial #1}{\partial #2}}$ | |
$\newcommand{\lderiv}[1]{\frac{\partial \mathcal L}{\partial #1}}$ | |
$\newcommand{\pderivsq}[2]{\frac{\partial^2 #1}{\partial #2^2}}$ | |
$\newcommand{\numel}[1]{|#1|}$ | |
$\newcommand{\pderivdim}[2]{\overset{\big[\numel {#1} \times \numel {#2} \big]}{\frac{\partial #1}{\partial #2}}}$ | |
$\newcommand{\pderivdimg}[4]{\overset{\big[#3 \times #4 \big]}{\frac{\partial #1}{\partial #2}}}$ | |
$\newcommand{\pderivdimt}[2]{\overset{\big[\numel {#2} \times \numel {#1} \big]}{\frac{\partial #1}{\partial #2}^\intercal}}$ | |
$\newcommand{\lderivdimt}[1]{\overset{\big[\numel {#1} \times 1 \big]}{\frac{\partial \mathcal L}{\partial #1}^\intercal}}$ | |
$\newcommand{\lderivdim}[1]{\overset{\big[1\times |#1|\big]}{\frac{\partial \mathcal L}{\partial #1}}}$ | |
$\newcommand{\norm}[1]{\frac12\| #1 \|_2^2}$ | |
$\newcommand{argmax}[1]{\underset{#1}{\operatorname{argmax}}}$ | |
$\newcommand{argmin}[1]{\underset{#1}{\operatorname{argmin}}}$ | |
$\newcommand{blue}[1]{\color{blue}{#1}}$ | |
$\newcommand{red}[1]{\color{red}{#1}}$ | |
# Unbiased Recurrent Online Optimization | |
## Background 1: Forward Mode Automatic Differentiation | |
Backpropagation is not the only chain-rule method for calculating gradients. | |
Backprop can also be called "Backwards mode automatic differentiation". | |
Suppose you have a system: | |
\begin{align} | |
s_1 &= f_1(x, \theta_1) \\ | |
s_2 &= f_2(s_1, \theta_2) \\ | |
... \\ | |
s_L &= f_L(s_{L-1}, \theta_L) \\ | |
\mathcal L &= \ell(s_L, y) | |
\end{align} | |
If we want gradients w.r.t. $\theta_l$, you go | |
$$ | |
\lderivdim{\theta_l} =\lderivdim{s_L}\pderivdim{s_L}{s_{L-1}} ... \pderivdim{s_l}{\theta_l} | |
$$ | |
That costs: | |
$$ | |
\mathcal O(1\cdot\numel{s_L}\cdot \numel{s_{L-1}} + 1\cdot\numel{s_{l-1}}\cdot \numel {s_{l-2}} + ... +\blue{1\cdot \numel{s_l}\cdot \numel{\theta_l}}) | |
$$ | |
(In practice the $\blue{\text{last term}}$ is often sparse, because not every element of $s_l$ depends on every element of $\theta_l$), eg for a fully-connected layer it's $\blue{1\cdot \numel{s_l} \cdot \numel{s_{l-1}}}$ to compute $\lderiv{\theta}=\lderiv{s_l}\pderiv{s_l}{\theta}=\lderiv{s_l}\otimes s_{l-1}$) | |
However, we could also do the computation in "Forward-mode" | |
$$ | |
\lderivdim{\theta_l} =\Big(\pderivdimt{s_l}{\theta_l} ... \pderivdimt{s_L}{s_{L-1}} \lderivdimt{s_L} \Big)^\intercal | |
$$ | |
Which gives costs: | |
$$ | |
\mathcal O(\blue{\numel{\theta_l}\cdot \numel {s_l} \cdot \numel s_{l+1}} +\red{\numel{\theta_l}\cdot \numel {s_{l+1}} \cdot \numel {s_{l+2}}} + ... + \red{\numel{\theta}\cdot\numel{s_L}\cdot 1}) | |
$$ | |
Where the $\red{\text{red terms are REALLY EXPENSIVE!}}$ That's why you almost never see Forward Mode Automatic Differentiation in Machine Learning. One way to think of this is that in forward-mode we're calculating the parameter gradients for every-possible loss in advance. Whereas in reverse-mode (backprop) we start with a particular loss and then calculate gradients for only that one. | |
## Recurrent Networks: | |
\begin{align} | |
s_t &= F_{state}(x_t, s_{t-1}, \theta) \\ | |
o_t &= F_{out}(x_t, s_{t-1}, \theta) \\ | |
\mathcal L_t &= \ell(o_t, y_t)\\ | |
\end{align} | |
![](https://docs.google.com/drawings/d/e/2PACX-1vSh9t4_eX2cmmjiouLQxJ63ZfzJhVTIPXm0mIfkJrlt0PTMh1xsPUoRhQDAXOFrf3JboICVOiQ8UrRz/pub?w=598&h=332) | |
(Note that usually in RNNs we restrict this to | |
$$ | |
F_{out}(x_t, s_{t-1}, \theta) = F_{toplayer}(F_{state}(x_t, s_{t-1}, \theta_{state}), \theta_{out}) | |
$$ But the above formulation is more general) | |
## Backprop Through time | |
BPTT can be used when we want to minimize the prediction error | |
$$ | |
\mathcal L = \sum_t^T \mathcal L_t | |
$$ | |
For some finite sequence of length T. | |
![](https://docs.google.com/drawings/d/e/2PACX-1vROEyAIgVgkM9gQQt2_OfA3y8KPOyaYNM9EZ9QhPDvXF-4rZ_-c3Uy3ATBdBCEb0GvDIterDr8MKkhc/pub?w=598&h=355) | |
## Truncated Backprop Through Time | |
However, if our sequence is infinite and we want to update online, we need to move to a regime where we stop backpropagating at some point: | |
![](https://docs.google.com/drawings/d/e/2PACX-1vSdHIqfAbtRI6L17Z5Xglars6NmAyuar9tZcrWKSl0g99VDtO-DmK2L0whGKECM9gFr4AarpP9xK6o7/pub?w=723&h=758) | |
The problem here is that memory and computation scale with the number of steps that we want to backprop (2 in the above figure) | |
##Real-Time Recurrent Learning | |
Is a way to avoid this backtracking by keeping track of $\pderiv{s_t}{\theta}$ online by keeping a live update of $\pderiv {s_t}{\theta}$ | |
![](https://docs.google.com/drawings/d/e/2PACX-1vQR6XU-OETs26h6wqKTO4F5bXAzJFjTQY0PCy9cXrAvuEMPzsXyAHH6HwcTjAuet2QaPUx6244a-JqB/pub?w=723&h=758) | |
RTRL is simply the application of Forward Mode Automatic Differentiation to update the state in recurrent learning (Equation 5 in paper): | |
$$ | |
\pderiv{s_{t+1}}{\theta} = \overset{|s|\times |\theta|}{\pderiv{F_{state}}{\theta}(x_{t+1}, s_t, \theta)} + \overset{|s|\times |s|}{\pderiv {F_{state}}{s_t}(x_{t+1}, s_t, \theta)} \overset{|s|\times |\theta|}{\pderiv {s_t}{\theta}} | |
$$ | |
The problem is: THIS IS REALLY EXPENSIVE | |
## The Trick | |
The Trick is to do a stochastic "Rank 1 approximation" of $\pderiv{s_{t+1}}{s_t}$. | |
i.e, say that: | |
$$ | |
\pderiv{s_{t+1}}{s_t}\approx \tilde G_t \triangleq\tilde s_t \otimes \tilde \theta_t | |
$$ | |
Where $\tilde s_t \in \mathbb R^{|s|}$, and $\tilde \theta_t \in \mathbb R^{|\theta|}$ | |
Such that | |
$$ | |
\mathbb E[\tilde G_t] = \pderiv{s_{t+1}}{s_t} | |
$$ | |
So that instead of keeping track of the gradient with $|s|\times |\theta|$ parameters we only use $|s| + |\theta|$. | |
The rest of the paper shows: | |
1. How to do an efficent update $(\tilde s_t, \tilde \theta_t) \rightarrow (\tilde s_{t+1}, \tilde \theta_{t+1})$ | |
2. How to efficiently approximate $\lderiv \theta$ given $\tilde s_t, \tilde \theta_t$ | |
## Rank 1 Trick | |
$A\in\mathbb R ^{M\times N}$ decomposes as: | |
$$ | |
A = \sum_i^k v_i \otimes w_i | |
$$ | |
Sample $\nu_i \sim U(\{-1, +1\}): i \in [1..k]$ | |
Now: | |
\begin{align} | |
\tilde A &\triangleq \sum_i^k \sum_j^k \nu_i \nu_j v_i \otimes w_j \\ | |
&= \sum_i^k \nu_i \nu_i v_i \otimes w_i + \sum_i^k \sum_{j\neq i}^k \nu_i \nu_j v_i \otimes w_j \\ | |
&= \sum_i^k v_i \otimes w_i + \sum_i^k \sum_{j\neq i}^k \nu_i \nu_j v_i \otimes w_j \\ | |
\end{align} | |
So, | |
\begin{align} | |
\mathbb E_\nu[\tilde A] &= \mathbb E\left[\sum_i^k v_i \otimes w_i + \sum_i^k \sum_{j\neq i}^k \nu_i \nu_j v_i \otimes w_j\right] \\ | |
&= \sum_i^k v_i \otimes w_i + \mathbb E\left[\sum_i^k \sum_{j\neq i}^k \nu_i \nu_j v_i \otimes w_j\right] \\ | |
&= \sum_i^k v_i \otimes w_i \\ | |
&= A | |
\end{align} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment