petered/unbiased-online-recurrent-optimization

## unbiased-online-recurrent-optimization
$\newcommand{\pderiv}[2]{\frac{\partial #1}{\partial #2}}$
$\newcommand{\lderiv}[1]{\frac{\partial \mathcal L}{\partial #1}}$
$\newcommand{\pderivsq}[2]{\frac{\partial^2 #1}{\partial #2^2}}$
$\newcommand{\numel}[1]{|#1|}$

$\newcommand{\pderivdim}[2]{\overset{\big[\numel {#1} \times \numel {#2} \big]}{\frac{\partial #1}{\partial #2}}}$


$\newcommand{\pderivdimg}[4]{\overset{\big[#3 \times #4 \big]}{\frac{\partial #1}{\partial #2}}}$

$\newcommand{\pderivdimt}[2]{\overset{\big[\numel {#2} \times \numel {#1} \big]}{\frac{\partial #1}{\partial #2}^\intercal}}$
$\newcommand{\lderivdimt}[1]{\overset{\big[\numel {#1} \times 1 \big]}{\frac{\partial \mathcal L}{\partial #1}^\intercal}}$
$\newcommand{\lderivdim}[1]{\overset{\big[1\times |#1|\big]}{\frac{\partial \mathcal L}{\partial #1}}}$
$\newcommand{\norm}[1]{\frac12\| #1 \|_2^2}$
$\newcommand{argmax}[1]{\underset{#1}{\operatorname{argmax}}}$
$\newcommand{argmin}[1]{\underset{#1}{\operatorname{argmin}}}$
$\newcommand{blue}[1]{\color{blue}{#1}}$
$\newcommand{red}[1]{\color{red}{#1}}$


# Unbiased Recurrent Online Optimization

## Background 1: Forward Mode Automatic Differentiation
Backpropagation is not the only chain-rule method for calculating gradients.

Backprop can also be called "Backwards mode automatic differentiation".

Suppose you have a system:

\begin{align}
s_1 &= f_1(x, \theta_1) \\
s_2 &= f_2(s_1, \theta_2) \\
... \\
s_L &= f_L(s_{L-1}, \theta_L) \\
\mathcal L &= \ell(s_L, y)
\end{align}

If we want gradients w.r.t. $\theta_l$, you go

$$
\lderivdim{\theta_l} =\lderivdim{s_L}\pderivdim{s_L}{s_{L-1}}  ... \pderivdim{s_l}{\theta_l}
$$

That costs:
$$
\mathcal O(1\cdot\numel{s_L}\cdot \numel{s_{L-1}} + 1\cdot\numel{s_{l-1}}\cdot \numel {s_{l-2}} + ... +\blue{1\cdot \numel{s_l}\cdot \numel{\theta_l}})
$$
(In practice the $\blue{\text{last term}}$ is often sparse, because not every element of $s_l$ depends on every element of $\theta_l$), eg for a fully-connected layer it's $\blue{1\cdot \numel{s_l} \cdot \numel{s_{l-1}}}$ to compute $\lderiv{\theta}=\lderiv{s_l}\pderiv{s_l}{\theta}=\lderiv{s_l}\otimes s_{l-1}$)

However, we could also do the computation in "Forward-mode"

$$
\lderivdim{\theta_l} =\Big(\pderivdimt{s_l}{\theta_l}  ... \pderivdimt{s_L}{s_{L-1}} \lderivdimt{s_L} \Big)^\intercal
$$

Which gives costs:
$$
\mathcal O(\blue{\numel{\theta_l}\cdot \numel {s_l} \cdot \numel s_{l+1}} +\red{\numel{\theta_l}\cdot \numel {s_{l+1}} \cdot \numel {s_{l+2}}} + ...  + \red{\numel{\theta}\cdot\numel{s_L}\cdot 1})
$$

Where the $\red{\text{red terms are REALLY EXPENSIVE!}}$  That's why you almost never see Forward Mode Automatic Differentiation in Machine Learning.  One way to think of this is that in forward-mode we're calculating the parameter gradients for every-possible loss in advance.  Whereas in reverse-mode (backprop) we start with a particular loss and then calculate gradients for only that one.

## Recurrent Networks:

\begin{align}
s_t &= F_{state}(x_t, s_{t-1}, \theta)  \\
o_t &= F_{out}(x_t, s_{t-1}, \theta) \\
\mathcal L_t &= \ell(o_t, y_t)\\
\end{align}

![](https://docs.google.com/drawings/d/e/2PACX-1vSh9t4_eX2cmmjiouLQxJ63ZfzJhVTIPXm0mIfkJrlt0PTMh1xsPUoRhQDAXOFrf3JboICVOiQ8UrRz/pub?w=598&h=332)

(Note that usually in RNNs we restrict this to
$$
F_{out}(x_t, s_{t-1}, \theta) = F_{toplayer}(F_{state}(x_t, s_{t-1}, \theta_{state}), \theta_{out})
$$ But the above formulation is more general)

## Backprop Through time

BPTT can be used when we want to minimize the prediction error

$$
\mathcal L = \sum_t^T \mathcal L_t
$$

For some finite sequence of length T.

![](https://docs.google.com/drawings/d/e/2PACX-1vROEyAIgVgkM9gQQt2_OfA3y8KPOyaYNM9EZ9QhPDvXF-4rZ_-c3Uy3ATBdBCEb0GvDIterDr8MKkhc/pub?w=598&h=355)

## Truncated Backprop Through Time

However, if our sequence is infinite and we want to update online, we need to move to a regime where we stop backpropagating at some point:

![](https://docs.google.com/drawings/d/e/2PACX-1vSdHIqfAbtRI6L17Z5Xglars6NmAyuar9tZcrWKSl0g99VDtO-DmK2L0whGKECM9gFr4AarpP9xK6o7/pub?w=723&h=758)

The problem here is that memory and computation scale with the number of steps that we want to backprop (2 in the above figure)

##Real-Time Recurrent Learning

Is a way to avoid this backtracking by keeping track of $\pderiv{s_t}{\theta}$ online by keeping a live update of $\pderiv {s_t}{\theta}$

![](https://docs.google.com/drawings/d/e/2PACX-1vQR6XU-OETs26h6wqKTO4F5bXAzJFjTQY0PCy9cXrAvuEMPzsXyAHH6HwcTjAuet2QaPUx6244a-JqB/pub?w=723&h=758)


RTRL is simply the application of Forward Mode Automatic Differentiation to update the state in recurrent learning (Equation 5 in paper):
$$
\pderiv{s_{t+1}}{\theta} = \overset{|s|\times |\theta|}{\pderiv{F_{state}}{\theta}(x_{t+1}, s_t, \theta)} + \overset{|s|\times |s|}{\pderiv {F_{state}}{s_t}(x_{t+1}, s_t, \theta)} \overset{|s|\times |\theta|}{\pderiv {s_t}{\theta}}
$$

The problem is: THIS IS REALLY EXPENSIVE


## The Trick

The Trick is to do a stochastic "Rank 1 approximation" of $\pderiv{s_{t+1}}{s_t}$.

i.e, say that:

$$
\pderiv{s_{t+1}}{s_t}\approx \tilde G_t \triangleq\tilde s_t \otimes \tilde \theta_t
$$
Where $\tilde s_t \in \mathbb R^{|s|}$, and $\tilde \theta_t \in \mathbb R^{|\theta|}$

Such that
$$
\mathbb E[\tilde G_t] = \pderiv{s_{t+1}}{s_t}
$$

So that instead of keeping track of the gradient with $|s|\times |\theta|$ parameters we only use $|s| + |\theta|$.

The rest of the paper shows:

1. How to do an efficent update $(\tilde s_t, \tilde \theta_t) \rightarrow (\tilde s_{t+1}, \tilde \theta_{t+1})$
2. How to efficiently approximate $\lderiv \theta$ given $\tilde s_t, \tilde \theta_t$


## Rank 1 Trick

$A\in\mathbb R ^{M\times N}$ decomposes as:

$$
A = \sum_i^k v_i \otimes w_i
$$

Sample $\nu_i \sim U(\{-1, +1\}): i \in [1..k]$

Now:
\begin{align}
\tilde A &\triangleq \sum_i^k \sum_j^k \nu_i \nu_j v_i \otimes w_j \\
&= \sum_i^k \nu_i \nu_i  v_i \otimes w_i + \sum_i^k \sum_{j\neq i}^k \nu_i \nu_j v_i \otimes w_j \\
&= \sum_i^k  v_i \otimes w_i + \sum_i^k \sum_{j\neq i}^k \nu_i \nu_j v_i \otimes w_j \\
\end{align}

So,

\begin{align}
\mathbb E_\nu[\tilde A] &= \mathbb E\left[\sum_i^k  v_i \otimes w_i + \sum_i^k \sum_{j\neq i}^k \nu_i \nu_j v_i \otimes w_j\right] \\
&= \sum_i^k  v_i \otimes w_i + \mathbb E\left[\sum_i^k \sum_{j\neq i}^k \nu_i \nu_j v_i \otimes w_j\right] \\
&= \sum_i^k  v_i \otimes w_i \\
&= A
\end{align}
	$\newcommand{\pderiv}[2]{\frac{\partial #1}{\partial #2}}$
	$\newcommand{\lderiv}[1]{\frac{\partial \mathcal L}{\partial #1}}$
	$\newcommand{\pderivsq}[2]{\frac{\partial^2 #1}{\partial #2^2}}$
	$\newcommand{\numel}[1]{\|#1\|}$

	$\newcommand{\pderivdim}[2]{\overset{\big[\numel {#1} \times \numel {#2} \big]}{\frac{\partial #1}{\partial #2}}}$


	$\newcommand{\pderivdimg}[4]{\overset{\big[#3 \times #4 \big]}{\frac{\partial #1}{\partial #2}}}$

	$\newcommand{\pderivdimt}[2]{\overset{\big[\numel {#2} \times \numel {#1} \big]}{\frac{\partial #1}{\partial #2}^\intercal}}$
	$\newcommand{\lderivdimt}[1]{\overset{\big[\numel {#1} \times 1 \big]}{\frac{\partial \mathcal L}{\partial #1}^\intercal}}$
	$\newcommand{\lderivdim}[1]{\overset{\big[1\times \|#1\|\big]}{\frac{\partial \mathcal L}{\partial #1}}}$
	$\newcommand{\norm}[1]{\frac12\\| #1 \\|_2^2}$
	$\newcommand{argmax}[1]{\underset{#1}{\operatorname{argmax}}}$
	$\newcommand{argmin}[1]{\underset{#1}{\operatorname{argmin}}}$
	$\newcommand{blue}[1]{\color{blue}{#1}}$
	$\newcommand{red}[1]{\color{red}{#1}}$


	# Unbiased Recurrent Online Optimization

	## Background 1: Forward Mode Automatic Differentiation
	Backpropagation is not the only chain-rule method for calculating gradients.

	Backprop can also be called "Backwards mode automatic differentiation".

	Suppose you have a system:

	\begin{align}
	s_1 &= f_1(x, \theta_1) \\
	s_2 &= f_2(s_1, \theta_2) \\
	... \\
	s_L &= f_L(s_{L-1}, \theta_L) \\
	\mathcal L &= \ell(s_L, y)
	\end{align}

	If we want gradients w.r.t. $\theta_l$, you go

	$$
	\lderivdim{\theta_l} =\lderivdim{s_L}\pderivdim{s_L}{s_{L-1}} ... \pderivdim{s_l}{\theta_l}
	$$

	That costs:
	$$
	\mathcal O(1\cdot\numel{s_L}\cdot \numel{s_{L-1}} + 1\cdot\numel{s_{l-1}}\cdot \numel {s_{l-2}} + ... +\blue{1\cdot \numel{s_l}\cdot \numel{\theta_l}})
	$$
	(In practice the $\blue{\text{last term}}$ is often sparse, because not every element of $s_l$ depends on every element of $\theta_l$), eg for a fully-connected layer it's $\blue{1\cdot \numel{s_l} \cdot \numel{s_{l-1}}}$ to compute $\lderiv{\theta}=\lderiv{s_l}\pderiv{s_l}{\theta}=\lderiv{s_l}\otimes s_{l-1}$)

	However, we could also do the computation in "Forward-mode"

	$$
	\lderivdim{\theta_l} =\Big(\pderivdimt{s_l}{\theta_l} ... \pderivdimt{s_L}{s_{L-1}} \lderivdimt{s_L} \Big)^\intercal
	$$

	Which gives costs:
	$$
	\mathcal O(\blue{\numel{\theta_l}\cdot \numel {s_l} \cdot \numel s_{l+1}} +\red{\numel{\theta_l}\cdot \numel {s_{l+1}} \cdot \numel {s_{l+2}}} + ... + \red{\numel{\theta}\cdot\numel{s_L}\cdot 1})
	$$

	Where the $\red{\text{red terms are REALLY EXPENSIVE!}}$ That's why you almost never see Forward Mode Automatic Differentiation in Machine Learning. One way to think of this is that in forward-mode we're calculating the parameter gradients for every-possible loss in advance. Whereas in reverse-mode (backprop) we start with a particular loss and then calculate gradients for only that one.

	## Recurrent Networks:

	\begin{align}
	s_t &= F_{state}(x_t, s_{t-1}, \theta) \\
	o_t &= F_{out}(x_t, s_{t-1}, \theta) \\
	\mathcal L_t &= \ell(o_t, y_t)\\
	\end{align}

	![](https://docs.google.com/drawings/d/e/2PACX-1vSh9t4_eX2cmmjiouLQxJ63ZfzJhVTIPXm0mIfkJrlt0PTMh1xsPUoRhQDAXOFrf3JboICVOiQ8UrRz/pub?w=598&h=332)

	(Note that usually in RNNs we restrict this to
	$$
	F_{out}(x_t, s_{t-1}, \theta) = F_{toplayer}(F_{state}(x_t, s_{t-1}, \theta_{state}), \theta_{out})
	$$ But the above formulation is more general)

	## Backprop Through time

	BPTT can be used when we want to minimize the prediction error

	$$
	\mathcal L = \sum_t^T \mathcal L_t
	$$

	For some finite sequence of length T.

	![](https://docs.google.com/drawings/d/e/2PACX-1vROEyAIgVgkM9gQQt2_OfA3y8KPOyaYNM9EZ9QhPDvXF-4rZ_-c3Uy3ATBdBCEb0GvDIterDr8MKkhc/pub?w=598&h=355)

	## Truncated Backprop Through Time

	However, if our sequence is infinite and we want to update online, we need to move to a regime where we stop backpropagating at some point:

	![](https://docs.google.com/drawings/d/e/2PACX-1vSdHIqfAbtRI6L17Z5Xglars6NmAyuar9tZcrWKSl0g99VDtO-DmK2L0whGKECM9gFr4AarpP9xK6o7/pub?w=723&h=758)

	The problem here is that memory and computation scale with the number of steps that we want to backprop (2 in the above figure)

	##Real-Time Recurrent Learning

	Is a way to avoid this backtracking by keeping track of $\pderiv{s_t}{\theta}$ online by keeping a live update of $\pderiv {s_t}{\theta}$

	![](https://docs.google.com/drawings/d/e/2PACX-1vQR6XU-OETs26h6wqKTO4F5bXAzJFjTQY0PCy9cXrAvuEMPzsXyAHH6HwcTjAuet2QaPUx6244a-JqB/pub?w=723&h=758)



	RTRL is simply the application of Forward Mode Automatic Differentiation to update the state in recurrent learning (Equation 5 in paper):
	$$
	\pderiv{s_{t+1}}{\theta} = \overset{\|s\|\times \|\theta\|}{\pderiv{F_{state}}{\theta}(x_{t+1}, s_t, \theta)} + \overset{\|s\|\times \|s\|}{\pderiv {F_{state}}{s_t}(x_{t+1}, s_t, \theta)} \overset{\|s\|\times \|\theta\|}{\pderiv {s_t}{\theta}}
	$$

	The problem is: THIS IS REALLY EXPENSIVE


	## The Trick

	The Trick is to do a stochastic "Rank 1 approximation" of $\pderiv{s_{t+1}}{s_t}$.

	i.e, say that:

	$$
	\pderiv{s_{t+1}}{s_t}\approx \tilde G_t \triangleq\tilde s_t \otimes \tilde \theta_t
	$$
	Where $\tilde s_t \in \mathbb R^{\|s\|}$, and $\tilde \theta_t \in \mathbb R^{\|\theta\|}$

	Such that
	$$
	\mathbb E[\tilde G_t] = \pderiv{s_{t+1}}{s_t}
	$$

	So that instead of keeping track of the gradient with $\|s\|\times \|\theta\|$ parameters we only use $\|s\| + \|\theta\|$.

	The rest of the paper shows:

	1. How to do an efficent update $(\tilde s_t, \tilde \theta_t) \rightarrow (\tilde s_{t+1}, \tilde \theta_{t+1})$
	2. How to efficiently approximate $\lderiv \theta$ given $\tilde s_t, \tilde \theta_t$


	## Rank 1 Trick

	$A\in\mathbb R ^{M\times N}$ decomposes as:

	$$
	A = \sum_i^k v_i \otimes w_i
	$$

	Sample $\nu_i \sim U(\{-1, +1\}): i \in [1..k]$

	Now:
	\begin{align}
	\tilde A &\triangleq \sum_i^k \sum_j^k \nu_i \nu_j v_i \otimes w_j \\
	&= \sum_i^k \nu_i \nu_i v_i \otimes w_i + \sum_i^k \sum_{j\neq i}^k \nu_i \nu_j v_i \otimes w_j \\
	&= \sum_i^k v_i \otimes w_i + \sum_i^k \sum_{j\neq i}^k \nu_i \nu_j v_i \otimes w_j \\
	\end{align}

	So,

	\begin{align}
	\mathbb E_\nu[\tilde A] &= \mathbb E\left[\sum_i^k v_i \otimes w_i + \sum_i^k \sum_{j\neq i}^k \nu_i \nu_j v_i \otimes w_j\right] \\
	&= \sum_i^k v_i \otimes w_i + \mathbb E\left[\sum_i^k \sum_{j\neq i}^k \nu_i \nu_j v_i \otimes w_j\right] \\
	&= \sum_i^k v_i \otimes w_i \\
	&= A
	\end{align}