dmurfet/opt_alg.md

## opt_alg.md

      
    Raw
  

              opt_alg.md
            
          
    Optimisation algorithms for deep RL

The optimisation algorithm used in most of DeepMind's deep RL papers is RMSProp (e.g in the Mnih et al Atari paper, in the IMPALA paper, in the RL experiments of the PBT paper, in the Zambaldi et al paper). I have seen speculation online that this is because RMSProp may be well-suited to deep learning on non-stationary distributions. In this note I try to examine the RMSProp algorithm and specifically the significance of the epsilon hyperparameter. The references are

Karpathy's CS231n notes
Hinton's lecture notes
The deep learning textbook Chapter 8

Often in the literature RMSProp is presented as a variation of AdaGrad (e.g. in the deep learning textbook and in Karpathy's class). However, I think this is misleading, and that the explanation in Hinton's lecture is (not surprisingly) more comprehensible. In the following g always means the gradient with components g_i, lr is the global learning rate, epsilon is some (supposedly small) parameter and x_i are the weights.
The algorithms

AdaGrad
=======

r_i <- r_i + g_i^2
x_i <- x_i - lr * g_i/sqrt{r_i + epsilon}

Ignoring epsilon which is set to 1e-7 in the deep learning textbook and is inserted in principle only for numerical stability of the division, AdaGrad works by gradually suppressing the learning rate for the weight x_i according to how much it has already been updated by gradient descent. This is often undesirable in deep learning, as the textbook explains. We include it here because people often explain RMSProp by reference to AdaGrad
To understand RMSProp you must first understand Rprop. This is how it is explained in Hinton's lecture notes.
Rprop
=====

r_i <- g_i^2
x_i <- x_i - lr * g_i/sqrt{r_i + epsilon}

The idea here is to use only the sign of the gradient. So every weight is updated by the same absolute amount lr in each step, only the sign g_i / |g_i| is different. The problems with this in stochastic gradient descent are obvious and the lecture notes describe them, leading to
RMSprop
=======

r_i <- p r_i + (1-p) g_i^2
x_i <- x_i - lr * g_i/sqrt{r_i + epsilon}

There are some slight differences in implementations. In the deep learning textbook the epsilon is under the square-root, and this is also the convention here. In the C++ implementation of RMSProp in Tensorflow we see
        ms_ = ms_ * ms_.constant(rho_scalar) +
              grad_.square() * grad_.constant(T(1) - rho_scalar);
        mom_ = mom_ * mom_.constant(momentum_scalar) +
               (ms_ + ms_.constant(epsilon_scalar)).rsqrt() *
                   ms_.constant(lr_scalar) * grad_;

so clearly here the epsilon is also under the square root. However in the Keras implementation and other references online have the epsilon outside (which also makes sense). Since our code actually uses the TensorFlow implementation, we will use that convention here. There is some terminological confusion with "decay rate" used to refer to several different things. In the deep learning textbook, Karparthy's class and the TensorFlow code, "decay rate" is used to refer to the p above, which decays the running total of the squared gradients. This usually has a value in [0.9,0.99,0.999] according to Karparthy. For example the TensorFlow defaults for RMSProp are:
__init__(
    learning_rate,
    decay=0.9,
    momentum=0.0,
    epsilon=1e-10,
    use_locking=False,
    centered=False,
    name='RMSProp'
)

The RLlib defaults for RMSprop in IMPALA are
"decay": 0.99,
"momentum": 0.0,
"epsilon": 0.1,

We do not consider momentum, since it is turned off in Zambaldi et al and in the IMPALA papers, and by default also in RLlib. Note the large discrepancy in the value of epsilon, we'll come back to that. In Keras RMSProp the defaults are lr=0.001, rho=0.9, epsilon=1e-7, decay=0.0. From the values and also the below implementation we can discover that what Keras calls rho is the decay rate p above, while Keras's decay is something else, namely a global decay rate schedule for the learning rate independent of RMSProp. So we will ignore this, and keep in mind that what Keras calls rho is our decay rate:
new_a = self.rho * a + (1. - self.rho) * K.square(g)
new_p = p - lr * g / (K.sqrt(new_a) + self.epsilon)

Note that in our experiments using Ray/RLlib/IMPALA we are calling the underlying TF RMSPropOptimizer not Keras's version, according to the following code in VTracePolicyGraph:
@override(TFPolicyGraph)
    def optimizer(self):
        if self.config["opt_type"] == "adam":
            return tf.train.AdamOptimizer(self.cur_lr)
        else:
            return tf.train.RMSPropOptimizer(self.cur_lr, self.config["decay"],
                                             self.config["momentum"],
                                             self.config["epsilon"])

This verifies that what we are setting as decay in our notebooks is indeed the p above.
Understanding RMSprop

The idea of RMSprop, as presented in Hinton's lectures, is a mini-batch version of rprop where instead of dividing by a different number in every mini-batch (namely, the absolute value of the gradient) we force this number to be similar for adjacent mini-batches by keeping a moving average of the square gradient r_i. That makes sense, and the p -> 0 limit is Rprop. I do not think it makes sense to compare RMSprop to Adagrad.
Here's how I think about RMSprop. Let us take epsilon = 0 for a moment. We imagine ourselves in the RL setting, with x_i a weight in the policy network. We can divide between roughly three cases


stable case in recent episodes the sensitivity of the discounted reward to the weight x_i has been stable, that is, g_i has had a stable value and so sqrt{r_i} is very close to |g_i| and so approximately x_i <- x_i - lr or x_i <- x_i + lr depending on the sign of g_i.


unstable case, increasing importance in recent episodes the sensitivity of the discounted reward to the weight x_i has increased, that is, g_i^2 > r_i and so |g_i| > sqrt{r_i} and hence x_i is changed by an amount more than lr.


unstable case, decreasing importance in recent episodes the sensitivity of the discounted reward to the weight x_i has decreased, that is, g_i^2 < r_i and so |g_i| < sqrt{r_i} and hence x_i is changed an amount less than lr.


From this point of view the factor we multiply with the learning rate acts a bit like the TD error in RL, in the sense that it measures how "surprising" recent changes to the relevance of the weight are. Suppose some weight in the network has been irrelevant to training until a certain point (so g_i = 0 and so r_i = 0) at which the agent suddenly figures out a new behaviour and this weight is now very important (so g_i >> r_i). Then RMSprop will prioritise learning of this weight by bestowing on it a large learning rate.
This kind of arrangement seems particularly useful in RL, because the scale of the gradients g_i involves multiplicative factors (from e.g. the term in the loss that corresponds to policy gradient) such as the TD error (to which that term is proportional) and probability of taking certain actions in certain states (inversely proportional) which can easily change by orders of magnitude during training as the agent discovers new behaviours.
NOTE: note that in say RLlib/IMPALA we are not direclty implementing the gradient steps you would see for an actor-critic method with policy gradient, say in Sutton & Barto. Those equations are presuming straightforward x_i = x_i - g_i gradient descent. Instead what happens is that in VTracePolicyGraph the loss that RMSProp is optimising is
        # The summed weighted loss
        self.total_loss = (self.pi_loss + self.vf_loss * vf_loss_coeff -
                           self.entropy * entropy_coeff)

with the meaning of these terms as elaborated in the IMPALA paper. So the gradient vectors you see in Section 4.2 of the paper contribute to the g_i's above, with the stated coefficients.
Mystery of the large epsilon

The main mystery here is the role of the hyperparameter epsilon. It is not in Hinton's notes, and appears to have been introduced purely as a way to avoid divide by zero problems (e.g. that is how it is presented in the deep learning textbook and the Keras source code). That suggests we should pick a small number (e.g. 1e-6 in the deep learning textbook, 1e-7 in Keras, or 1e-10 in TF) and forget about it. But then we look at its default value in IMPALA and think... wait, 0.1 is not a small number. Here are some papers along with their choices of epsilon for RMS prop
Zambaldi et. al: epsilon = 0.1
PBT paper: epsilon in [1e-1,1e-3,1e-5,1e-7]

Along with the learning rate and entropy cost, the RMSprop epsilon is highlighted as a key hyperparameter for variation by PBT in the IMPALA paper (although in the PBT paper itself they do not mention epsilon). So epsilon has clearly come to play a more significant role than just avoiding division by zero. In the TensorFlow documentation for AdamOptimizer (where epsilon plays a very similar role) we find

The default value of 1e-8 for epsilon might not be a good default in general. For example, when training an Inception network on ImageNet a current good choice is 1.0 or 0.1. Note that since AdamOptimizer uses the formulation just before Section 2.1 of the Kingma and Ba paper rather than the formulation in Algorithm 1, the "epsilon" referred to here is "epsilon hat" in the paper.

Apparently this trend originated in the Inception v3 paper where one of the authors notes that

Setting eps=1 was essential for training to work well using async SGD, and was the only way to get gains from RMSProp over simple momentum. I agree with you though, it's very unsatisfying that we have to use this high a number for a parameter that should in theory merely be a way to not divide by zero. It might be that you don't need such a high epsilon when using sync SGD. Someone should try :)

So the use of epsilon = 0.1 is a sort of emerging community consensus, at least for some tasks (which appears to include some deep RL problems). Note that everyone puts the epsilon outside the square root for Adam, for some reason.
To understand what is going on, let us rewrite the equations as follows:
RMSprop (large epsilon)
=======

r_i <- p r_i + (1-p) g_i^2
x_i <- x_i - lr * g_i/sqrt{r_i} * 1/sqrt{1 + epsilon/r_i}
     = x_i - lr * g_i/sqrt{r_i} * S(sqrt{r_i}/sqrt{epsilon})

where S is the sigmoid function (i.e. a monotonic increasing function which is concave and increases from 0 at x = 0 to asymptote horiziontally to 1)
S(u) = 1/sqrt{1 + 1/u^2}

Written in this way we see a new multiplicative factor in the optimisation algorithm, given by S(sqrt{r_i}/sqrt{epsilon}). Note that sqrt{r_i} is a moving average of |g_i|. Recall the original purpose of Rprop was to use only the sign of the gradient, using lr * g_i/sqrt{r_i} and in the stable case, this is what happens. However this new factor reinserts the size of the gradient, but scaled by the sigmoid to be in the unit interval.
In the limit where epsilon -> 0 we squash the outputs of the sigmoid up near 1 and so this factor disappears. But as we take epsilon -> 1 we see something like the sigmoid S(sqrt{r_i}) which means that for large stable gradients we get updates of size lr and for small stable gradients we get small updates. The fact that in our RL algorithms we start with epsilon = 0.1 means that we are starting in this regime, with an effective sigmoid on the running average of the gradient norm. Decreasing epsilon from this value (as PBT sometimes does) has the effective of returning us to the original RMSprop idea, where this new factor is basically eliminated as it is always approximately one, and we get gradient steps that depend only on the sign of the gradient and not its magnitude.
CONCLUSION: RMSprop with varying epsilon interpolates between ordinary RMSprop as epsilon -> 0 and RMSprop with smoothed gradient clipping as epsilon -> 1.