Skip to content

Instantly share code, notes, and snippets.

@petered
Last active September 27, 2017 06:51
Show Gist options
  • Save petered/b3a668193e5c83512176b282abc03a7b to your computer and use it in GitHub Desktop.
Save petered/b3a668193e5c83512176b282abc03a7b to your computer and use it in GitHub Desktop.
2017-09-26 Iterated Matrix Decomposition
$\newcommand{\pderiv}[2]{\frac{\partial #1}{\partial #2}}$
$\newcommand{\pderivsq}[2]{\frac{\partial^2 #1}{\partial #2^2}}$
$\newcommand{\lderiv}[1]{\frac{\partial \mathcal L}{\partial #1}}$
$\newcommand{\norm}[1]{\frac12\| #1 \|_2^2}$
$\newcommand{argmax}[1]{\underset{#1}{\operatorname{argmax}}}$
$\newcommand{argmin}[1]{\underset{#1}{\operatorname{argmin}}}$
$\newcommand{blue}[1]{\color{blue}{#1}}$
$\newcommand{red}[1]{\color{red}{#1}}$
$\newcommand{argmax}[1]{\underset{#1}{\operatorname{argmax}}}$
$\newcommand{argmin}[1]{\underset{#1}{\operatorname{argmin}}}$
$\newcommand{\numel}[1]{|#1|}$
$\newcommand{\oversize}[3]{\overset{\big[#2 \times #3 \big]}{#1}}$
$\newcommand{\pderivdim}[4]{\overset{\big[#3 \times #4 \big]}{\frac{\partial #1}{\partial #2}}}$
# Forward Gradient Estimates for Online Learning
## Motivation
In Real Time Recurrent Learning, we update states with forward-mode automatic differentiation:
$$
\pderivdim{s_t}{\theta}{|S|}{|\Theta|} = \oversize{\pderiv{s_t}{s_{t-1}}}{|S|}{|S|} \oversize{\pderiv {s_{t-1}}{\theta}}{|S|}{|\Theta|} + \oversize{ \left.\pderiv{s_t}{\theta}\right|_{s_{t-1}}}{|S|}{|\Theta|}
$$
Where $\pderiv{s_t}{\theta}|_{s_{t-1}}$ denotes "gradient of $s_t$ with respect to $\theta$ while holding $s_{t-1}$ constant".
The [UORO paper](https://arxiv.org/abs/1702.05043) makes a Rank-1 approximation:
$$
\pderiv{s_t}{\theta} \approx \tilde s \otimes \tilde \theta
$$
Where $\tilde s\in \mathbb R^{|S|}$, $\tilde \theta \in \mathbb R^{|\Theta|}$ are random variables defined such that $\mathbb E[\tilde s \otimes \tilde \theta] = \pderiv{s_t}{\theta}$. In other words, they give an unbiased estimate of the gradient.
We want to create an algorithm that also generates these $(\tilde s, \tilde \theta)$ pairs, but in a deterministic way, so that they converge faster (at a rate of $1/t$ rather than $1/\sqrt t$) to the true gradient.
In the following, to simplify the problem, we reformulate the variables to remove references to gradients. We will rename variables as follows:
\begin{align}
\pderiv {s_t}{\theta}\in \mathbb R^{|S|\times |\Theta|} &\rightarrow A_t \in \mathbb R ^{M \times N}\\
\tilde s_t \in\mathbb R^{|S|} &\rightarrow u_t \in \mathbb R ^M\\
\tilde \theta_t \in\mathbb R^{|\Theta|} &\rightarrow v_t \in \mathbb R ^N\\
\end{align}
## The problem
Given a sequence of matrices:
$$
A_t \in \mathbb R^{M\times N}: t\in \mathbb N
$$
We are looking for a process which produces a series of Rank-1 Approximations:
\begin{align}
A_t \approx u_t \otimes v_t : u_t \in \mathbb R^M , v_t \in \mathbb R^N
\end{align}
Such that:
1. The $(u_t, v_t)$ pairs are generated in *online* (i.e. $(u_t, v_t)$ are computed before $A_{t+1}$ becomes available)
2. a. The mean error converges for any $A_t: t\in \mathbb N$:
\begin{align}
\lim_{T\rightarrow \infty} \frac1T \left\| \sum_t^T \left(A_t - u_t \otimes v_t\right)\right\| = 0
\end{align}
b. The mean error converges faster than $1/\sqrt T$. (A stricter requirement):
\begin{align}
\lim_{T\rightarrow \infty} \frac1{\sqrt{T}} \left\| \sum_t^T \left(A_t - u_t \otimes v_t\right)\right\| = 0
\end{align}
3. Our process stores less than $\mathcal O(M\cdot N)$ memory between iterations. Basically our end-goal is to find a cheaper (computational and memory-wise) version of RTRL, and this goal is undermined if we must store large matrices. TODO: Be a little clearer on our actual requirement here.
## Solutions
### 1. UORO:
Algorithm:
\begin{align}
\nu &\leftarrow U(\{-1, +1\})^M \\
u_t &\leftarrow \nu \\
v_t &\leftarrow \nu \cdot A_t
\end{align}
Conditions:
1. **Pass**: Clearly online
2a. **Pass**: Must converge, as $\mathbb E[u_t\otimes v_t] = A_t$ (as shown in [the paper](https://arxiv.org/abs/1702.05043))
2b. **Fail Always**: See experiments: Convergence is $1/\sqrt t$, which is too slow.
3. **Pass**, no memory between iterations.
### 2. Orthogonal Sequences
This is a small variation on UORO. The intuition is that UORO drew random samples of A which averaged out over time. By being more clever about our $\nu$ sequence, we can explicitly cancel out the noise in these samples, leading to $1/t$ (instead of $1/\sqrt t)$ convergence.
We generate a random orthonormal basis $B\in \mathbb R^{M\times M}$. We then generate our Rank-1 approximations as:
\begin{align}
\nu &\leftarrow \sqrt M \cdot B_{\bullet, mod(t, M)}\\
u_t &\leftarrow \nu \\
v_t &\leftarrow \nu \cdot A_t
\end{align}
Conditions:
1. **Pass**: clearly online
2a. **Fail sometimes**: Fails with periodic $A_t$ (see experiment)
2b: **Fail sometimes**: Fails with periodic $A_t$ (see experiment)
3: **Pass** (though we do have to store a fixed $M\times M$ matrix $B$)
### 3. Iterated Convergence
The intuition behind this one is that we alternate between greedily minimizing $u$ given $v$, and $v$ given $u$.
\begin{align}
v_0 &\leftarrow \mathcal N(0, 1)^N \\
A' &\leftarrow 0^{M\times N} \\\\
\\
A' &\leftarrow A'+ A_t \\
u_t &\leftarrow \argmin{u_t}\left|A'-u_t\otimes v_{t-1}\right| = \frac{A'\cdot v_{t-1}}{\|v_{t-1}\|^2}\\
u_t &\leftarrow \frac{u_t}{\|u_t\|} \text{ ... for numerical stability}\\
v_t &\leftarrow \argmin{v_t}\left|A'-u_t\otimes v_t\right| = \frac{u_t\cdot A'}{\|u_{t}\|^2} = u_t\cdot A' \\
A' &\leftarrow A' - u_t\otimes v_t
\end{align}
(Note the similarity to herding / Sigma-Delta modulation):
\begin{align}
\phi &\leftarrow \phi + x \\
s &\leftarrow \argmin{s\in\mathbb Z} |\phi-s| \\
\phi &\leftarrow \phi - s\\
\end{align}
Conditions:
1. **Pass**: clearly online
2a. **Pass, Probably**: See experiment
2b: **Pass, Probably**: Converges at rate of $1/T$
3: **Fail**: Requires storing $A'\in \mathbb R^{M\times N}$
# Experiments:
We try three experiments for 3 different settings of $A_t\in \mathbb R ^{50\times 100}$.
We plot the norm of the error as a function of t:
$$
(\text{Norm of Mean Error}) = \frac1T \left\| \sum_t^T \left(A_t - u_t \otimes v_t\right)\right\|
$$
## 1: Fixed A
Fixed Matrix: $A_t \leftarrow A_0 \sim \mathcal N (0, 1)^{M\times N}$
![enter image description here](https://drive.google.com/uc?id=0B4IfiNtPKeSAVGRkOUZfcGtsdUE&export=download)
## 2: Slowly changing A
If $A_t$ actually corresponds to $\pderiv{s_t}{\theta}$, this somewhat resembles the realistic scenario.
\begin{align}
\alpha &\leftarrow 0.01\\
A_0 &\sim \mathcal N (0, 1)^{M\times N} \\
A_t &\sim (1-\alpha)\cdot A_{t-1} + \alpha\cdot \mathcal N(0, 1)
\end{align}
![enter image description here](https://drive.google.com/uc?id=0B4IfiNtPKeSAVlAyZmxuRVMtdDg&export=download)
## 3: Completely Changing A:
$$
A_t \sim N(0, 1) ^{M\times N}
$$
![enter image description here](https://drive.google.com/uc?id=0B4IfiNtPKeSAa0dsQldLYjBoTjA&export=download)
(note that UORO and Orthogonal overlay eachother here)
## 4: Periodic A
To highlight a shortcoming of the orthogonal approach: We have an experiment where A is periodic with period 100 (Note that M=50):
$$
A_t \sim \begin{cases}N(0, 1)^{\mathcal M\times N} & \text{if } t<100\\A_{t-100} & \text{otherwise}\end{cases}
$$
![enter image description here](https://drive.google.com/uc?id=0B4IfiNtPKeSARl9ZQTd1UHh2STQ&export=download)
We see that our Orthogonal approximation is "biased", because of aliasing between the 50-periodic $\nu_t$ and the 100-periodic $A_t$.
# So... Now what?
Clearly our iterated convergence solution is pretty good. But it has this annoying requirement that it requires $\mathbb O(M\cdot N)$ memory ... which can be a lot. Can we find some way to implement the iterated algorithm more efficiently? Is there maybe some way we can exchange some bias for faster convergence?
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment