petered/testgist

## testgist
# Temporal Networks

$\newcommand{\pderiv}[2]{\frac{\partial #1}{\partial #2}}$

# The idea
Let
$(x, y)$ be the input, target data, and
$u_1, ... u_L$ be the pre-nonlinearity activations of a neural network, and
$w_1, ... w_L$ be the parameters and $\cdot w (x) \triangleq x\cdot w$
$h_l(\cdot)$ be the nonlinearity of the $l'th$ layer, and
$\mathcal{L}$ be the loss.

So our network function is:

$$
f(x) = (h_L \circ \cdot w_L \circ ... \circ h_1 \circ \cdot w_1)(x)
$$

Normally, we optimize parameters using the gradients $\pderiv{\mathcal{L}}{w_l}$, causing the parameters to step in a direction to reduce the cost.

What if our inputs are temporally redundant?  By recalculating these derivatives at every frame we are doing a similar computation repeatedly.

What if we take an alternative approach: Let us define "fast" parameters $v_1, ... v_L$.  These variables are added to each layer on the forward pass, so the new network function is:

$$
f(x) = (h_L \circ +v_L \circ  \cdot w_L \circ ... \circ h_1 \circ +v_L \circ \cdot w_1)(x)
$$

Where $+v(x)\triangleq x+v$
On every training iteration, $v_1, ... v_L$ are updated according the following rule:

$$
\Delta v_L = \eta_{fast} \cdot \pderiv{\mathcal{L}}{v_L} \\
\Delta v_l = \eta_{fast} \cdot \pderiv{\lVert v_{l+1}\rVert_{L2}}{v_l}
$$

We can think of the v's as being how much the network is "out of tune with the desired state".  Our slow parameters aim to minimize these v's.

Our slow parameters try to locally minimize the v's.

Our slow parameters simply try to minimize the v's.  So if $a_{l}$ is post-nonlinearity activation of layer $l$, and $v_l$ is the fast parameter, out update is

$$
\Delta W = \eta \cdot a_{l-1} \odot v_l
$$

The result of this update is that on the next iteration (with the same input), $u_l = x_{l-1} \cdot w_l$ will be closer to $(u_l+v_l)$ at the convergence point of the last iteration, and v_l will not need to grow as large.


# Alternative: v just for hidden

An alternative is to have v's just for the hidden layers (and not the output).  Then our updates are
$$
\Delta v_{L-1} = \eta_{fast} \cdot \pderiv{\mathcal{L}}{v_{L-1}} \\
\Delta v_l = \eta_{fast} \cdot \pderiv{\lVert v_{l+1}\rVert_{L2}}{v_l} : l\in[1..L-1]
$$

# What is it good for?

We want to have a method for training the network in the setting where:
1) Data changes smoothly with time
2) Distant parts of the network are "decoupled".  We don't - unlike backprop - neet to wait for some computation to go on somewhere else to finish our update.
3) The amount of computation is proportional to the amount of change in the data.


# And how exactly does this help?

We remove the "locking" by only doing forward passes and compute gradients based on the "v" state of the next layer.  This does mean that the information about the target is a bit stale.

# Relation to Backprop:

Lets analyze the update for $w_{L-1}$.

We first do $N_{steps}$ of computation for the v's (the fast parameters)

$$
\begin{align}
\Delta v_{L-1} &\propto -\pderiv{}{v_{L-1}} \left( \mathcal{L} + \lambda |v_{L-1}|_{L2}\right) \\
&= -\pderiv{\mathcal{L}}{u_L} \cdot w_T \cdot h'(v_{L-1}+u_{L-1}) - \lambda v_{L-1}\\
\Delta v_{L-2} &\propto -\pderiv{}{v_{L-2}} \left(z_{L-1}-u_{L-1}\right)^2-\lambda \cdot  |v_{L-2}|_{L2} \Big|_{Z_{L-1} = const}\\
&= v_{L-1}\cdot w^T \otimes h'(v_{L-2} + u_{L-2}) - \lambda \cdot v_{L-2} \\
\end{align} \\
$$

And then update the w's (the slow parameters):
$$
\begin{align}
\Delta w_L &\propto \pderiv{\mathcal{L}}{w_L} \\
\Delta w_{L-1} &\propto a_{L-2} \odot v_{L-1} \\
\Delta w_{L-2} &\propto a_{L-2} \odot v_{L-2} \\
\end{align} \\
$$
If $v$ are reset between training examples, and $N_{steps}$ = 1, and we compute the $v$'s sequentially, backwards, our updates are proportional to the gradients, because:

$$
\begin{align}
z_{L-2}-u_{L-2} &= \Delta v_{L-2} \\
&\propto z_{L-1}-u_{L-1} \\
&\propto \Delta v_L
\end{align}
$$

If $\lambda=0$, training fails.  It appears that the v's get stuck at certain values and maintain them between samples.

$\lambda=\frac12$ is a bit of a strange case... because $\pderiv{}{v_l}\frac12 |v_l|_{L2} = v_l$.  So the old value of v_l is completely removed from the update.

# Alternative view: combining the influence of forward/backward pass

We can also look at our rule as follows:

Our $z$ - the neural state variable, is a compromise between what the forward pass wants and what the backward pass wants the state of the network to be.

The forward pass communicates a target for z, and the backwards pass communicates a desired change in z.  Thus, if we have:

$$
u = a_{l-1}\cdot w_l \\
\mathcal{L}_l  = |u-z|_{L2} \\
\Delta z \propto \pderiv{\mathcal{L_l}}{z_l} + \lambda \pderiv{\mathcal{L_{l+1}}}{z_l}
$$

And consider $z$ to be our fast variable instead of $v$, we get the same rule (I think).

Now, we have a problem with delays.  If we are updating layers independently, it takes a few time-steps before new information from the ends of the network propagates down, and we achieve stable values for all variables.  If we want our network to settle as fast as possible, we should be trying to predict *past* z's.


# Relation to Equilibrium propagation.

The [Equilibrium Propagation](https://arxiv.org/abs/1602.05179) paper
<!--stackedit_data:
eyJoaXN0b3J5IjpbODc1MDM2ODA0XX0=
-->
	# Temporal Networks

	$\newcommand{\pderiv}[2]{\frac{\partial #1}{\partial #2}}$

	# The idea
	Let
	$(x, y)$ be the input, target data, and
	$u_1, ... u_L$ be the pre-nonlinearity activations of a neural network, and
	$w_1, ... w_L$ be the parameters and $\cdot w (x) \triangleq x\cdot w$
	$h_l(\cdot)$ be the nonlinearity of the $l'th$ layer, and
	$\mathcal{L}$ be the loss.

	So our network function is:

	$$
	f(x) = (h_L \circ \cdot w_L \circ ... \circ h_1 \circ \cdot w_1)(x)
	$$

	Normally, we optimize parameters using the gradients $\pderiv{\mathcal{L}}{w_l}$, causing the parameters to step in a direction to reduce the cost.

	What if our inputs are temporally redundant? By recalculating these derivatives at every frame we are doing a similar computation repeatedly.

	What if we take an alternative approach: Let us define "fast" parameters $v_1, ... v_L$. These variables are added to each layer on the forward pass, so the new network function is:

	$$
	f(x) = (h_L \circ +v_L \circ \cdot w_L \circ ... \circ h_1 \circ +v_L \circ \cdot w_1)(x)
	$$

	Where $+v(x)\triangleq x+v$
	On every training iteration, $v_1, ... v_L$ are updated according the following rule:

	$$
	\Delta v_L = \eta_{fast} \cdot \pderiv{\mathcal{L}}{v_L} \\
	\Delta v_l = \eta_{fast} \cdot \pderiv{\lVert v_{l+1}\rVert_{L2}}{v_l}
	$$

	We can think of the v's as being how much the network is "out of tune with the desired state". Our slow parameters aim to minimize these v's.

	Our slow parameters try to locally minimize the v's.

	Our slow parameters simply try to minimize the v's. So if $a_{l}$ is post-nonlinearity activation of layer $l$, and $v_l$ is the fast parameter, out update is

	$$
	\Delta W = \eta \cdot a_{l-1} \odot v_l
	$$

	The result of this update is that on the next iteration (with the same input), $u_l = x_{l-1} \cdot w_l$ will be closer to $(u_l+v_l)$ at the convergence point of the last iteration, and v_l will not need to grow as large.


	# Alternative: v just for hidden

	An alternative is to have v's just for the hidden layers (and not the output). Then our updates are
	$$
	\Delta v_{L-1} = \eta_{fast} \cdot \pderiv{\mathcal{L}}{v_{L-1}} \\
	\Delta v_l = \eta_{fast} \cdot \pderiv{\lVert v_{l+1}\rVert_{L2}}{v_l} : l\in[1..L-1]
	$$

	# What is it good for?

	We want to have a method for training the network in the setting where:
	1) Data changes smoothly with time
	2) Distant parts of the network are "decoupled". We don't - unlike backprop - neet to wait for some computation to go on somewhere else to finish our update.
	3) The amount of computation is proportional to the amount of change in the data.


	# And how exactly does this help?

	We remove the "locking" by only doing forward passes and compute gradients based on the "v" state of the next layer. This does mean that the information about the target is a bit stale.

	# Relation to Backprop:

	Lets analyze the update for $w_{L-1}$.

	We first do $N_{steps}$ of computation for the v's (the fast parameters)

	$$
	\begin{align}
	\Delta v_{L-1} &\propto -\pderiv{}{v_{L-1}} \left( \mathcal{L} + \lambda \|v_{L-1}\|_{L2}\right) \\
	&= -\pderiv{\mathcal{L}}{u_L} \cdot w_T \cdot h'(v_{L-1}+u_{L-1}) - \lambda v_{L-1}\\
	\Delta v_{L-2} &\propto -\pderiv{}{v_{L-2}} \left(z_{L-1}-u_{L-1}\right)^2-\lambda \cdot \|v_{L-2}\|_{L2} \Big\|_{Z_{L-1} = const}\\
	&= v_{L-1}\cdot w^T \otimes h'(v_{L-2} + u_{L-2}) - \lambda \cdot v_{L-2} \\
	\end{align} \\
	$$

	And then update the w's (the slow parameters):
	$$
	\begin{align}
	\Delta w_L &\propto \pderiv{\mathcal{L}}{w_L} \\
	\Delta w_{L-1} &\propto a_{L-2} \odot v_{L-1} \\
	\Delta w_{L-2} &\propto a_{L-2} \odot v_{L-2} \\
	\end{align} \\
	$$
	If $v$ are reset between training examples, and $N_{steps}$ = 1, and we compute the $v$'s sequentially, backwards, our updates are proportional to the gradients, because:

	$$
	\begin{align}
	z_{L-2}-u_{L-2} &= \Delta v_{L-2} \\
	&\propto z_{L-1}-u_{L-1} \\
	&\propto \Delta v_L
	\end{align}
	$$

	If $\lambda=0$, training fails. It appears that the v's get stuck at certain values and maintain them between samples.

	$\lambda=\frac12$ is a bit of a strange case... because $\pderiv{}{v_l}\frac12 \|v_l\|_{L2} = v_l$. So the old value of v_l is completely removed from the update.

	# Alternative view: combining the influence of forward/backward pass

	We can also look at our rule as follows:

	Our $z$ - the neural state variable, is a compromise between what the forward pass wants and what the backward pass wants the state of the network to be.

	The forward pass communicates a target for z, and the backwards pass communicates a desired change in z. Thus, if we have:

	$$
	u = a_{l-1}\cdot w_l \\
	\mathcal{L}_l = \|u-z\|_{L2} \\
	\Delta z \propto \pderiv{\mathcal{L_l}}{z_l} + \lambda \pderiv{\mathcal{L_{l+1}}}{z_l}
	$$

	And consider $z$ to be our fast variable instead of $v$, we get the same rule (I think).

	Now, we have a problem with delays. If we are updating layers independently, it takes a few time-steps before new information from the ends of the network propagates down, and we achieve stable values for all variables. If we want our network to settle as fast as possible, we should be trying to predict past z's.






	# Relation to Equilibrium propagation.

	The [Equilibrium Propagation](https://arxiv.org/abs/1602.05179) paper
	<!--stackedit_data:
	eyJoaXN0b3J5IjpbODc1MDM2ODA0XX0=
	-->