Skip to content

Instantly share code, notes, and snippets.

@petered
Last active February 15, 2018 15:22
Show Gist options
  • Save petered/79f6c35da72b86c94845f632986ab8a7 to your computer and use it in GitHub Desktop.
Save petered/79f6c35da72b86c94845f632986ab8a7 to your computer and use it in GitHub Desktop.
2018-01-17 Lower-Variance Online Gradient Estimates
$$
\newcommand{\pderiv}[2]{\frac{\partial #1}{\partial #2}}
\newcommand{\pderivsq}[2]{\frac{\partial^2 #1}{\partial #2^2}}
\newcommand{\lderiv}[1]{\frac{\partial \mathcal L}{\partial #1}}
\newcommand{\pderivgiven}[3]{\left.\frac{\partial #1}{\partial #2}\right|_{#3}}
\newcommand{\norm}[1]{\frac12\| #1 \|_2^2}
\newcommand{\argmax}[1]{\underset{#1}{\operatorname{argmax}}}
\newcommand{\argmin}[1]{\underset{#1}{\operatorname{argmin}}}
\newcommand{\blue}[1]{\color{blue}{#1}}
\newcommand{\red}[1]{\color{red}{#1}}
\newcommand{\numel}[1]{|#1|}
\newcommand{\switch}[3]{\begin{cases} #2 & \text{if } {#1} \\ #3 &\text{otherwise}\end{cases}}
\newcommand{\pderivdim}[4]{\overset{\big[#3 \times #4 \big]}{\frac{\partial #1}{\partial #2}}}
\newcommand{\softmax}{\operatorname{softmax}}
\newcommand{\Bern}{\operatorname{Bern}}
\newcommand{\Cat}{\operatorname{Cat}}
\newcommand{\sigm}{\operatorname{sigm}}
\newcommand{\logfrac}[2]{\log \left( \frac{#1}{#2} \right)}
$$
# Lower Variance online Learning
## Why are we here
Efficient online recurrent learning is basically a pre-requisite for useful trainable spiking networks. The alternative is some form of backpropagation through time, which involves storing an amount of memory proportional to the time-horizon we wish to include in our learning.
## UORO does this!
[Unbiased Online Recurrent Optimization](https://arxiv.org/abs/1702.05043) is an efficient $\mathcal O(N_{hidden}^2)$ algorithm for training online networks, using an unbiased estimate of the gradient $\pderiv{\mathcal L_t}{\theta}$
We *know* that if we were to have some kind of "oracle" that gave us accurate estimates of this gradient, we could converge *much* faster. We know this because when we use the (hugely inefficient) RTRL, we converge ~50x faster:
![](https://drive.google.com/uc?id=1f2HIPMBxzWMnH8artJ2C8tBX-zD5fjve)
**Results on $a^Nb^N(1, 32)$ dataset, where the task is to predict the next character when the input is a string of $N$ a's followed by $N$ b's, where $N\sim \mathcal U\{1..32\}$. Top: Overall Loss, Bottom: Fraction of correct predictions at end of string of b's. Green is RTRL, Blue is UORO. Top is overall loss, bottom is success rate at predicting end of sequence.**
## What's the problem with UORO?
UORO gives us cheap estimates of $\pderiv{\mathcal L_t}{\theta}$, which we'll call $g_t^{uoro} :\approx \pderiv{\mathcal L_t}{\theta}$ which are unbiased - i.e. $\mathbb E[g_t^{uoro}]=\pderiv{\mathcal L_t}{\theta}$, but evidently, the variance of these estimates substantially slows down learning. Note that these estimates can be viewed as a deterministic function of a series of past-draws of a random variables $\epsilon_\tau$, inputs $x_\tau$ and parameters values $\theta_\tau$, and the current target $y_t$: $g_t^{uoro} = g((\epsilon, x, \theta)_{[1..t]}, y_t)$.
# Backpropagation Through the void
Since our problem is basically a problem of variance reduction, it seems sensible to explore literature on reducing the variance of estimators. The paper [BackPropagation Through the Void](https://arxiv.org/pdf/1711.00123.pdf) proposes a general method for finding low-variance and estimator $\hat g$ of the gradient of the expectation of a stochastic loss:
$$
\hat g\approx \nabla_\theta \mathcal L(\theta) := \nabla_\theta \mathbb E_{p_\theta (b) }[f(b)]
$$
This is a bit different from our situation. In UORO, $\hat g^{uoro}$ is an estimate of the gradient of a deterministic loss.
$$
\hat g ^{uoro} \approx \nabla_\theta \mathcal L(\theta)
$$
We can compute this gradient exactly, but it is costly. $\hat g^{uoro}$ is an unbiased approximation, but it has high variance.
# Control Variates
A standard way to reduce the variance of an unbiased estimate is with control varience. Suppose we were to find some stochastic variable $g^{cv}$ with a known mean $\mu^{cv} = \mathbb E[g^{cv}]$, and some scaling factor $\lambda\in \mathbb R^+$. $g^{cv}$ may be correlated with $g^{uoro}$. We can create a variable $g'$ with the same mean as $g^{uoro}$ but a lower variance:
$$
g' = g^{uoro} - \lambda (g^{cv} - \mu)
$$
\begin{align}
\mathbb E[g'] &= \mathbb E[g^{uoro} - \lambda (g^{cv} - \mu)] \\
&= \mathbb E[g^{uoro}] - \lambda (\mathbb E[g^{cv}] - \mu) \\
&= \mathbb E[g^{uoro}]
\end{align}
So while the mean remains unchanged, the variance becomes
\begin{align}
\mathrm{Var}[g'] &= \mathrm{Var}[g^{uoro} - \lambda (g^{cv} - \mu)] \\
&= \mathrm{Var}[g^{uoro}] + \lambda^2 \mathrm{Var}[g^{cv}] - 2\lambda \mathrm{Cov}[g^{uoro}, g^{cv}]
\end{align}
Which is minimized for $\lambda = \frac{\mathrm{Cov}[g^{uoro}, g^{cv}]}{\mathrm{Var}[g^{cv}]}$
**How can we create a good covariate?**: We could learn a recurrent linear model $g_t^{cv} = g_\phi(g_{t-1}^{cv}, x_t, \epsilon_t)$ that correlates with $g^{uoro}$, and $\mathbb E[g_t^{cv}]=0$. It needs to be linear in $\epsilon$ and $h$ so that if $\mathbb E[h]=0$ and $\mathbb E[\epsilon_t]=0$, then $g_t^{cv}, h_t = g_\phi(h_{t-1}, x_t, \epsilon_t)$
Suppose we train it to minimize $\mathcal L_t^{cv} := \| g^{uoro}_t - g_t^{cv} \|^2$. We of course are faced with updating our auxillary parameters $\phi$, which we could do by BPTT, RTRL or our old friend UORO.
**Problem**: $\hat g_t ^{cv}\in \mathbb R^{\dim(\Theta)}$. This is big, and if it's an output, we'd have a fully connected parameter matrix $\phi_{hg} \in \mathbb R ^{\dim(h) \times \dim(g)}$... which is makes this not all that much cheaper than RTRL unless $\dim(h)<<\dim(s)$
**Possible Solution:** Inspired by [Hypernetworks](https://arxiv.org/pdf/1609.09106.pdf), we
we can reshape $\theta_r$ (the recurrent part of $\theta$) into a $\dim(S) \times D$ matrix (e.g. for a GRU it would be $\theta_r \in \mathbb R^{\dim(S) \times ((\dim(S)\cdot 3 + \dim(X)\cdot 3)}$,
Then have another linear, recurrent network output our predicted gradients:
$$
h_t^{(i)} = \omega_{hh} h_{t-1}^{(i)} + \omega_{xh} x_t + \omega_{\epsilon h} \epsilon_t \\
\hat g_r^{(i)} := \omega_{h\theta} h_t^{(i)}
$$
Where $(\cdot)^{(i)}$ denodes the $i'th$ row of a matrix, and $h \in \mathbb R^{\dim(S) \times D}$ matrix, where $D$ is some arbitrarily defined size of the hidden state.
This Covariate-network also needs to be trained somehow. For that we could also use UORO.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment