{{ message }}

Instantly share code, notes, and snippets.

# dmurfet/opt_alg.md

Last active May 9, 2019
Optimisation algorithms

# Optimisation algorithms for deep RL

The optimisation algorithm used in most of DeepMind's deep RL papers is RMSProp (e.g in the Mnih et al Atari paper, in the IMPALA paper, in the RL experiments of the PBT paper, in the Zambaldi et al paper). I have seen speculation online that this is because RMSProp may be well-suited to deep learning on non-stationary distributions. In this note I try to examine the RMSProp algorithm and specifically the significance of the `epsilon` hyperparameter. The references are

Often in the literature RMSProp is presented as a variation of AdaGrad (e.g. in the deep learning textbook and in Karpathy's class). However, I think this is misleading, and that the explanation in Hinton's lecture is (not surprisingly) more comprehensible. In the following `g` always means the gradient with components `g_i`, `lr` is the global learning rate, `epsilon` is some (supposedly small) parameter and `x_i` are the weights.

## The algorithms

``````AdaGrad
=======

r_i <- r_i + g_i^2
x_i <- x_i - lr * g_i/sqrt{r_i + epsilon}
``````

Ignoring `epsilon` which is set to `1e-7` in the deep learning textbook and is inserted in principle only for numerical stability of the division, AdaGrad works by gradually suppressing the learning rate for the weight `x_i` according to how much it has already been updated by gradient descent. This is often undesirable in deep learning, as the textbook explains. We include it here because people often explain RMSProp by reference to AdaGrad

To understand RMSProp you must first understand Rprop. This is how it is explained in Hinton's lecture notes.

``````Rprop
=====

r_i <- g_i^2
x_i <- x_i - lr * g_i/sqrt{r_i + epsilon}
``````

The idea here is to use only the sign of the gradient. So every weight is updated by the same absolute amount `lr` in each step, only the sign `g_i / |g_i|` is different. The problems with this in stochastic gradient descent are obvious and the lecture notes describe them, leading to

``````RMSprop
=======

r_i <- p r_i + (1-p) g_i^2
x_i <- x_i - lr * g_i/sqrt{r_i + epsilon}
``````

There are some slight differences in implementations. In the deep learning textbook the `epsilon` is under the square-root, and this is also the convention here. In the C++ implementation of RMSProp in Tensorflow we see

``````        ms_ = ms_ * ms_.constant(rho_scalar) +
mom_ = mom_ * mom_.constant(momentum_scalar) +
(ms_ + ms_.constant(epsilon_scalar)).rsqrt() *
``````

so clearly here the epsilon is also under the square root. However in the Keras implementation and other references online have the `epsilon` outside (which also makes sense). Since our code actually uses the TensorFlow implementation, we will use that convention here. There is some terminological confusion with "decay rate" used to refer to several different things. In the deep learning textbook, Karparthy's class and the TensorFlow code, "decay rate" is used to refer to the `p` above, which decays the running total of the squared gradients. This usually has a value in `[0.9,0.99,0.999]` according to Karparthy. For example the TensorFlow defaults for RMSProp are:

``````__init__(
learning_rate,
decay=0.9,
momentum=0.0,
epsilon=1e-10,
use_locking=False,
centered=False,
name='RMSProp'
)
``````

The RLlib defaults for RMSprop in IMPALA are

``````"decay": 0.99,
"momentum": 0.0,
"epsilon": 0.1,
``````

We do not consider momentum, since it is turned off in Zambaldi et al and in the IMPALA papers, and by default also in RLlib. Note the large discrepancy in the value of `epsilon`, we'll come back to that. In Keras RMSProp the defaults are `lr=0.001, rho=0.9, epsilon=1e-7, decay=0.0`. From the values and also the below implementation we can discover that what Keras calls `rho` is the decay rate `p` above, while Keras's `decay` is something else, namely a global decay rate schedule for the learning rate independent of RMSProp. So we will ignore this, and keep in mind that what Keras calls `rho` is our decay rate:

``````new_a = self.rho * a + (1. - self.rho) * K.square(g)
new_p = p - lr * g / (K.sqrt(new_a) + self.epsilon)
``````

Note that in our experiments using Ray/RLlib/IMPALA we are calling the underlying TF RMSPropOptimizer not Keras's version, according to the following code in VTracePolicyGraph:

``````@override(TFPolicyGraph)
def optimizer(self):
else:
return tf.train.RMSPropOptimizer(self.cur_lr, self.config["decay"],
self.config["momentum"],
self.config["epsilon"])
``````

This verifies that what we are setting as `decay` in our notebooks is indeed the `p` above.

## Understanding RMSprop

The idea of RMSprop, as presented in Hinton's lectures, is a mini-batch version of rprop where instead of dividing by a different number in every mini-batch (namely, the absolute value of the gradient) we force this number to be similar for adjacent mini-batches by keeping a moving average of the square gradient `r_i`. That makes sense, and the `p -> 0` limit is Rprop. I do not think it makes sense to compare RMSprop to Adagrad.

Here's how I think about RMSprop. Let us take `epsilon = 0` for a moment. We imagine ourselves in the RL setting, with `x_i` a weight in the policy network. We can divide between roughly three cases

• stable case in recent episodes the sensitivity of the discounted reward to the weight `x_i` has been stable, that is, `g_i` has had a stable value and so `sqrt{r_i}` is very close to `|g_i|` and so approximately `x_i <- x_i - lr` or `x_i <- x_i + lr` depending on the sign of `g_i`.

• unstable case, increasing importance in recent episodes the sensitivity of the discounted reward to the weight `x_i` has increased, that is, `g_i^2 > r_i` and so `|g_i| > sqrt{r_i}` and hence `x_i` is changed by an amount more than `lr`.

• unstable case, decreasing importance in recent episodes the sensitivity of the discounted reward to the weight `x_i` has decreased, that is, `g_i^2 < r_i` and so `|g_i| < sqrt{r_i}` and hence `x_i` is changed an amount less than `lr`.

From this point of view the factor we multiply with the learning rate acts a bit like the TD error in RL, in the sense that it measures how "surprising" recent changes to the relevance of the weight are. Suppose some weight in the network has been irrelevant to training until a certain point (so `g_i = 0` and so `r_i = 0`) at which the agent suddenly figures out a new behaviour and this weight is now very important (so `g_i >> r_i`). Then RMSprop will prioritise learning of this weight by bestowing on it a large learning rate.

This kind of arrangement seems particularly useful in RL, because the scale of the gradients `g_i` involves multiplicative factors (from e.g. the term in the loss that corresponds to policy gradient) such as the TD error (to which that term is proportional) and probability of taking certain actions in certain states (inversely proportional) which can easily change by orders of magnitude during training as the agent discovers new behaviours.

NOTE: note that in say RLlib/IMPALA we are not direclty implementing the gradient steps you would see for an actor-critic method with policy gradient, say in Sutton & Barto. Those equations are presuming straightforward `x_i = x_i - g_i` gradient descent. Instead what happens is that in VTracePolicyGraph the loss that RMSProp is optimising is

``````        # The summed weighted loss
self.total_loss = (self.pi_loss + self.vf_loss * vf_loss_coeff -
self.entropy * entropy_coeff)
``````

with the meaning of these terms as elaborated in the IMPALA paper. So the gradient vectors you see in Section 4.2 of the paper contribute to the `g_i`'s above, with the stated coefficients.

## Mystery of the large epsilon

The main mystery here is the role of the hyperparameter `epsilon`. It is not in Hinton's notes, and appears to have been introduced purely as a way to avoid divide by zero problems (e.g. that is how it is presented in the deep learning textbook and the Keras source code). That suggests we should pick a small number (e.g. `1e-6` in the deep learning textbook, `1e-7` in Keras, or `1e-10` in TF) and forget about it. But then we look at its default value in IMPALA and think... wait, `0.1` is not a small number. Here are some papers along with their choices of `epsilon` for RMS prop

``````Zambaldi et. al: epsilon = 0.1
PBT paper: epsilon in [1e-1,1e-3,1e-5,1e-7]
``````

Along with the learning rate and entropy cost, the RMSprop epsilon is highlighted as a key hyperparameter for variation by PBT in the IMPALA paper (although in the PBT paper itself they do not mention `epsilon`). So `epsilon` has clearly come to play a more significant role than just avoiding division by zero. In the TensorFlow documentation for AdamOptimizer (where `epsilon` plays a very similar role) we find

The default value of 1e-8 for epsilon might not be a good default in general. For example, when training an Inception network on ImageNet a current good choice is 1.0 or 0.1. Note that since AdamOptimizer uses the formulation just before Section 2.1 of the Kingma and Ba paper rather than the formulation in Algorithm 1, the "epsilon" referred to here is "epsilon hat" in the paper.

Apparently this trend originated in the Inception v3 paper where one of the authors notes that

Setting eps=1 was essential for training to work well using async SGD, and was the only way to get gains from RMSProp over simple momentum. I agree with you though, it's very unsatisfying that we have to use this high a number for a parameter that should in theory merely be a way to not divide by zero. It might be that you don't need such a high epsilon when using sync SGD. Someone should try :)

So the use of `epsilon = 0.1` is a sort of emerging community consensus, at least for some tasks (which appears to include some deep RL problems). Note that everyone puts the epsilon outside the square root for Adam, for some reason.

To understand what is going on, let us rewrite the equations as follows:

``````RMSprop (large epsilon)
=======

r_i <- p r_i + (1-p) g_i^2
x_i <- x_i - lr * g_i/sqrt{r_i} * 1/sqrt{1 + epsilon/r_i}
= x_i - lr * g_i/sqrt{r_i} * S(sqrt{r_i}/sqrt{epsilon})
``````

where `S` is the sigmoid function (i.e. a monotonic increasing function which is concave and increases from `0` at `x = 0` to asymptote horiziontally to `1`)

``````S(u) = 1/sqrt{1 + 1/u^2}
``````

Written in this way we see a new multiplicative factor in the optimisation algorithm, given by `S(sqrt{r_i}/sqrt{epsilon})`. Note that `sqrt{r_i}` is a moving average of `|g_i|`. Recall the original purpose of Rprop was to use only the sign of the gradient, using `lr * g_i/sqrt{r_i}` and in the stable case, this is what happens. However this new factor reinserts the size of the gradient, but scaled by the sigmoid to be in the unit interval.

In the limit where `epsilon -> 0` we squash the outputs of the sigmoid up near `1` and so this factor disappears. But as we take `epsilon -> 1` we see something like the sigmoid `S(sqrt{r_i})` which means that for large stable gradients we get updates of size `lr` and for small stable gradients we get small updates. The fact that in our RL algorithms we start with `epsilon = 0.1` means that we are starting in this regime, with an effective sigmoid on the running average of the gradient norm. Decreasing epsilon from this value (as PBT sometimes does) has the effective of returning us to the original RMSprop idea, where this new factor is basically eliminated as it is always approximately one, and we get gradient steps that depend only on the sign of the gradient and not its magnitude.

CONCLUSION: RMSprop with varying epsilon interpolates between ordinary RMSprop as `epsilon -> 0` and RMSprop with smoothed gradient clipping as `epsilon -> 1`.