petered/low-var-online-learning

## low-var-online-learning
$$
\newcommand{\pderiv}[2]{\frac{\partial #1}{\partial #2}}
\newcommand{\pderivsq}[2]{\frac{\partial^2 #1}{\partial #2^2}}
\newcommand{\lderiv}[1]{\frac{\partial \mathcal L}{\partial #1}}
\newcommand{\pderivgiven}[3]{\left.\frac{\partial #1}{\partial #2}\right|_{#3}}
\newcommand{\norm}[1]{\frac12\| #1 \|_2^2}
\newcommand{\argmax}[1]{\underset{#1}{\operatorname{argmax}}}
\newcommand{\argmin}[1]{\underset{#1}{\operatorname{argmin}}}
\newcommand{\blue}[1]{\color{blue}{#1}}
\newcommand{\red}[1]{\color{red}{#1}}
\newcommand{\numel}[1]{|#1|}
\newcommand{\switch}[3]{\begin{cases} #2 & \text{if } {#1} \\ #3 &\text{otherwise}\end{cases}}
\newcommand{\pderivdim}[4]{\overset{\big[#3 \times #4 \big]}{\frac{\partial #1}{\partial #2}}}
\newcommand{\softmax}{\operatorname{softmax}}
\newcommand{\Bern}{\operatorname{Bern}}
\newcommand{\Cat}{\operatorname{Cat}}
\newcommand{\sigm}{\operatorname{sigm}}
\newcommand{\logfrac}[2]{\log \left( \frac{#1}{#2} \right)}
$$


# Lower Variance online Learning

## Why are we here

Efficient online recurrent learning is basically a pre-requisite for useful trainable spiking networks.  The alternative is some form of backpropagation through time, which involves storing an amount of memory proportional to the time-horizon we wish to include in our learning.

## UORO does this!

[Unbiased Online Recurrent Optimization](https://arxiv.org/abs/1702.05043) is an efficient $\mathcal O(N_{hidden}^2)$ algorithm for training online networks, using an unbiased estimate of the gradient $\pderiv{\mathcal L_t}{\theta}$

We *know* that if we were to have some kind of "oracle" that gave us accurate estimates of this gradient, we could converge *much* faster.  We know this because when we use the (hugely inefficient) RTRL, we converge ~50x faster:

![](https://drive.google.com/uc?id=1f2HIPMBxzWMnH8artJ2C8tBX-zD5fjve)
**Results on $a^Nb^N(1, 32)$ dataset, where the task is to predict the next character when the input is a string of $N$ a's followed by $N$  b's, where $N\sim \mathcal U\{1..32\}$.  Top: Overall Loss, Bottom: Fraction of correct predictions at end of string of b's.  Green is RTRL, Blue is UORO.  Top is overall loss, bottom is success rate at predicting end of sequence.**

## What's the problem with UORO?

UORO gives us cheap estimates of $\pderiv{\mathcal L_t}{\theta}$, which we'll call $g_t^{uoro} :\approx \pderiv{\mathcal L_t}{\theta}$ which are unbiased - i.e. $\mathbb E[g_t^{uoro}]=\pderiv{\mathcal L_t}{\theta}$, but evidently, the variance of these estimates substantially slows down learning.  Note that these estimates can be viewed as a deterministic function of a series of past-draws of a random variables $\epsilon_\tau$, inputs $x_\tau$ and parameters values $\theta_\tau$, and the current target $y_t$: $g_t^{uoro} = g((\epsilon, x, \theta)_{[1..t]}, y_t)$.

# Backpropagation Through the void

Since our problem is basically a problem of variance reduction, it seems sensible to explore literature on reducing the variance of estimators.  The paper [BackPropagation Through the Void](https://arxiv.org/pdf/1711.00123.pdf) proposes a general method for finding low-variance and estimator $\hat g$ of the gradient of the expectation of a stochastic loss:

$$
\hat g\approx \nabla_\theta \mathcal L(\theta) := \nabla_\theta \mathbb E_{p_\theta (b) }[f(b)]
$$

This is a bit different from our situation.  In UORO, $\hat g^{uoro}$ is an estimate of the gradient of a deterministic loss.

$$
\hat g ^{uoro} \approx \nabla_\theta \mathcal L(\theta)
$$

We can compute this gradient exactly, but it is costly.  $\hat g^{uoro}$ is an unbiased approximation, but it has high variance.

# Control Variates

A standard way to reduce the variance of an unbiased estimate is with control varience.  Suppose we were to find some stochastic variable $g^{cv}$ with a known mean $\mu^{cv} = \mathbb E[g^{cv}]$, and some scaling factor $\lambda\in \mathbb R^+$.  $g^{cv}$ may be correlated with $g^{uoro}$.  We can create a variable $g'$ with the same mean as $g^{uoro}$ but a lower variance:

$$
g' = g^{uoro} -  \lambda (g^{cv} - \mu)
$$

\begin{align}
\mathbb E[g'] &= \mathbb E[g^{uoro} -  \lambda (g^{cv} - \mu)] \\
&= \mathbb E[g^{uoro}] -  \lambda (\mathbb E[g^{cv}] - \mu) \\
&= \mathbb E[g^{uoro}]
\end{align}

So while the mean remains unchanged, the variance becomes

\begin{align}
\mathrm{Var}[g'] &= \mathrm{Var}[g^{uoro} -  \lambda (g^{cv} - \mu)] \\
&= \mathrm{Var}[g^{uoro}] + \lambda^2 \mathrm{Var}[g^{cv}] - 2\lambda  \mathrm{Cov}[g^{uoro}, g^{cv}]
\end{align}

Which is minimized for $\lambda = \frac{\mathrm{Cov}[g^{uoro}, g^{cv}]}{\mathrm{Var}[g^{cv}]}$

**How can we create a good covariate?**: We could learn a recurrent linear model $g_t^{cv} = g_\phi(g_{t-1}^{cv}, x_t, \epsilon_t)$ that correlates with $g^{uoro}$, and $\mathbb E[g_t^{cv}]=0$.  It needs to be linear in $\epsilon$ and $h$ so that if $\mathbb E[h]=0$ and $\mathbb E[\epsilon_t]=0$, then $g_t^{cv}, h_t = g_\phi(h_{t-1}, x_t, \epsilon_t)$

Suppose we train it to minimize $\mathcal L_t^{cv} := \| g^{uoro}_t - g_t^{cv} \|^2$.  We of course are faced with updating our auxillary parameters $\phi$, which we could do by BPTT, RTRL or our old friend UORO.

**Problem**: $\hat g_t ^{cv}\in \mathbb R^{\dim(\Theta)}$.  This is big, and if it's an output, we'd have a fully connected parameter matrix $\phi_{hg} \in \mathbb R ^{\dim(h) \times \dim(g)}$... which is makes this not all that much cheaper than RTRL unless $\dim(h)<<\dim(s)$

**Possible Solution:** Inspired by [Hypernetworks](https://arxiv.org/pdf/1609.09106.pdf), we
we can reshape $\theta_r$ (the recurrent part of $\theta$) into a $\dim(S) \times D$ matrix (e.g. for a GRU it would be $\theta_r \in \mathbb R^{\dim(S) \times ((\dim(S)\cdot 3 + \dim(X)\cdot 3)}$,

Then have another linear, recurrent network output our predicted gradients:
$$
h_t^{(i)} = \omega_{hh} h_{t-1}^{(i)} + \omega_{xh} x_t + \omega_{\epsilon h} \epsilon_t  \\
\hat g_r^{(i)} := \omega_{h\theta} h_t^{(i)}
$$
Where $(\cdot)^{(i)}$ denodes the $i'th$ row of a matrix, and $h \in \mathbb R^{\dim(S) \times D}$ matrix, where $D$ is some arbitrarily defined size of the hidden state.

This Covariate-network also needs to be trained somehow.  For that we could also use UORO.
	$$
	\newcommand{\pderiv}[2]{\frac{\partial #1}{\partial #2}}
	\newcommand{\pderivsq}[2]{\frac{\partial^2 #1}{\partial #2^2}}
	\newcommand{\lderiv}[1]{\frac{\partial \mathcal L}{\partial #1}}
	\newcommand{\pderivgiven}[3]{\left.\frac{\partial #1}{\partial #2}\right\|_{#3}}
	\newcommand{\norm}[1]{\frac12\\| #1 \\|_2^2}
	\newcommand{\argmax}[1]{\underset{#1}{\operatorname{argmax}}}
	\newcommand{\argmin}[1]{\underset{#1}{\operatorname{argmin}}}
	\newcommand{\blue}[1]{\color{blue}{#1}}
	\newcommand{\red}[1]{\color{red}{#1}}
	\newcommand{\numel}[1]{\|#1\|}
	\newcommand{\switch}[3]{\begin{cases} #2 & \text{if } {#1} \\ #3 &\text{otherwise}\end{cases}}
	\newcommand{\pderivdim}[4]{\overset{\big[#3 \times #4 \big]}{\frac{\partial #1}{\partial #2}}}
	\newcommand{\softmax}{\operatorname{softmax}}
	\newcommand{\Bern}{\operatorname{Bern}}
	\newcommand{\Cat}{\operatorname{Cat}}
	\newcommand{\sigm}{\operatorname{sigm}}
	\newcommand{\logfrac}[2]{\log \left( \frac{#1}{#2} \right)}
	$$


	# Lower Variance online Learning

	## Why are we here

	Efficient online recurrent learning is basically a pre-requisite for useful trainable spiking networks. The alternative is some form of backpropagation through time, which involves storing an amount of memory proportional to the time-horizon we wish to include in our learning.

	## UORO does this!

	[Unbiased Online Recurrent Optimization](https://arxiv.org/abs/1702.05043) is an efficient $\mathcal O(N_{hidden}^2)$ algorithm for training online networks, using an unbiased estimate of the gradient $\pderiv{\mathcal L_t}{\theta}$

	We know that if we were to have some kind of "oracle" that gave us accurate estimates of this gradient, we could converge much faster. We know this because when we use the (hugely inefficient) RTRL, we converge ~50x faster:

	![](https://drive.google.com/uc?id=1f2HIPMBxzWMnH8artJ2C8tBX-zD5fjve)
	Results on $a^Nb^N(1, 32)$ dataset, where the task is to predict the next character when the input is a string of $N$ a's followed by $N$ b's, where $N\sim \mathcal U\{1..32\}$. Top: Overall Loss, Bottom: Fraction of correct predictions at end of string of b's. Green is RTRL, Blue is UORO. Top is overall loss, bottom is success rate at predicting end of sequence.

	## What's the problem with UORO?

	UORO gives us cheap estimates of $\pderiv{\mathcal L_t}{\theta}$, which we'll call $g_t^{uoro} :\approx \pderiv{\mathcal L_t}{\theta}$ which are unbiased - i.e. $\mathbb E[g_t^{uoro}]=\pderiv{\mathcal L_t}{\theta}$, but evidently, the variance of these estimates substantially slows down learning. Note that these estimates can be viewed as a deterministic function of a series of past-draws of a random variables $\epsilon_\tau$, inputs $x_\tau$ and parameters values $\theta_\tau$, and the current target $y_t$: $g_t^{uoro} = g((\epsilon, x, \theta)_{[1..t]}, y_t)$.

	# Backpropagation Through the void

	Since our problem is basically a problem of variance reduction, it seems sensible to explore literature on reducing the variance of estimators. The paper [BackPropagation Through the Void](https://arxiv.org/pdf/1711.00123.pdf) proposes a general method for finding low-variance and estimator $\hat g$ of the gradient of the expectation of a stochastic loss:

	$$
	\hat g\approx \nabla_\theta \mathcal L(\theta) := \nabla_\theta \mathbb E_{p_\theta (b) }[f(b)]
	$$

	This is a bit different from our situation. In UORO, $\hat g^{uoro}$ is an estimate of the gradient of a deterministic loss.

	$$
	\hat g ^{uoro} \approx \nabla_\theta \mathcal L(\theta)
	$$

	We can compute this gradient exactly, but it is costly. $\hat g^{uoro}$ is an unbiased approximation, but it has high variance.

	# Control Variates

	A standard way to reduce the variance of an unbiased estimate is with control varience. Suppose we were to find some stochastic variable $g^{cv}$ with a known mean $\mu^{cv} = \mathbb E[g^{cv}]$, and some scaling factor $\lambda\in \mathbb R^+$. $g^{cv}$ may be correlated with $g^{uoro}$. We can create a variable $g'$ with the same mean as $g^{uoro}$ but a lower variance:

	$$
	g' = g^{uoro} - \lambda (g^{cv} - \mu)
	$$

	\begin{align}
	\mathbb E[g'] &= \mathbb E[g^{uoro} - \lambda (g^{cv} - \mu)] \\
	&= \mathbb E[g^{uoro}] - \lambda (\mathbb E[g^{cv}] - \mu) \\
	&= \mathbb E[g^{uoro}]
	\end{align}

	So while the mean remains unchanged, the variance becomes

	\begin{align}
	\mathrm{Var}[g'] &= \mathrm{Var}[g^{uoro} - \lambda (g^{cv} - \mu)] \\
	&= \mathrm{Var}[g^{uoro}] + \lambda^2 \mathrm{Var}[g^{cv}] - 2\lambda \mathrm{Cov}[g^{uoro}, g^{cv}]
	\end{align}

	Which is minimized for $\lambda = \frac{\mathrm{Cov}[g^{uoro}, g^{cv}]}{\mathrm{Var}[g^{cv}]}$

	How can we create a good covariate?: We could learn a recurrent linear model $g_t^{cv} = g_\phi(g_{t-1}^{cv}, x_t, \epsilon_t)$ that correlates with $g^{uoro}$, and $\mathbb E[g_t^{cv}]=0$. It needs to be linear in $\epsilon$ and $h$ so that if $\mathbb E[h]=0$ and $\mathbb E[\epsilon_t]=0$, then $g_t^{cv}, h_t = g_\phi(h_{t-1}, x_t, \epsilon_t)$

	Suppose we train it to minimize $\mathcal L_t^{cv} := \\| g^{uoro}_t - g_t^{cv} \\|^2$. We of course are faced with updating our auxillary parameters $\phi$, which we could do by BPTT, RTRL or our old friend UORO.

	Problem: $\hat g_t ^{cv}\in \mathbb R^{\dim(\Theta)}$. This is big, and if it's an output, we'd have a fully connected parameter matrix $\phi_{hg} \in \mathbb R ^{\dim(h) \times \dim(g)}$... which is makes this not all that much cheaper than RTRL unless $\dim(h)<<\dim(s)$

	Possible Solution: Inspired by [Hypernetworks](https://arxiv.org/pdf/1609.09106.pdf), we
	we can reshape $\theta_r$ (the recurrent part of $\theta$) into a $\dim(S) \times D$ matrix (e.g. for a GRU it would be $\theta_r \in \mathbb R^{\dim(S) \times ((\dim(S)\cdot 3 + \dim(X)\cdot 3)}$,

	Then have another linear, recurrent network output our predicted gradients:
	$$
	h_t^{(i)} = \omega_{hh} h_{t-1}^{(i)} + \omega_{xh} x_t + \omega_{\epsilon h} \epsilon_t \\
	\hat g_r^{(i)} := \omega_{h\theta} h_t^{(i)}
	$$
	Where $(\cdot)^{(i)}$ denodes the $i'th$ row of a matrix, and $h \in \mathbb R^{\dim(S) \times D}$ matrix, where $D$ is some arbitrarily defined size of the hidden state.

	This Covariate-network also needs to be trained somehow. For that we could also use UORO.