When using policy gradients (PG) with discrete actions, a softmax distribution is typically used to represent the output distribution. This document describes how the softmax+PG combination can be problematic, even with perfect sampling and a linear model.
Suppose we have a multi-armed bandit problem. We can define our agent as a
vector v
, where actions are distributed according to softmax(v)
. If we know
beforehand the mean reward vector r
(one reward per arm), then the expected
reward is softmax(v) * r
. It turns out that, if we initialize v
randomly,
doing gradient descent on this objective can converge to a suboptimal arm,
despite the use of the exact gradient.
I will not prove the above claim here, but this repository empirically demonstrates it. EDIT: This claim may actually be false. It is true that a suboptimal asymptote may appear to be reached, but it may be escaped after an infinite amount of training.The general phenomenon is that, if an output for a sub-optimal arm is higher than the output for the optimal arm, then the suboptimal arm can have a larger gradient (and the process perpetuates itself ad infinitum).
The above isn't a problem if we initialize our policy parameters v
to zero.
However, without perfect sampling, the parameters may be randomly changed at the
beginning of training such that the same issue arises. Also, zero initialization
can lead to symmetry breaking problems for multi-discrete action spaces or POMDPs.
Symmetry breaking is crucial when using gradient-based optimization--it's why we use random inits, after all.
This is relevant for softmax+PG because we typically initialize the output layer
to 0. Suppose we have a bandit with action space (Discrete<2>, Discrete<2>).
For this problem, let's define the reward as action[0] ^ action[1]
. We
parameterize our policy as a vector v
of size 4, and actions are
(softmax(v[:2]), softmax(v[2:]))
. If we initialize v
as the zero vector,
as is common in RL, then the agent will never solve the environment because it
cannot break symmetry.
The above problem can be solved by randomly initializing v
or using imperfect
sampling. However, these techniques can be problematic because of the suboptimal
asymptote problem.
Symmetry breaking can also be a problem in partially observable MPDs, even with a single discrete action space. Suppose there are two actions and an episode is 10 timesteps. In this environment, action 0 subtracts 1 from a counter, and action 1 adds 1 to the counter. At the end of the episode, the reward is the absolute value of the counter. In this environment, we need to pick an action to draw again and again. This cannot be done with a zero init using a softmax policy and perfect sampling.