Skip to content

Instantly share code, notes, and snippets.

@unixpickle
Last active November 10, 2017 02:23
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save unixpickle/851f1547c901973794747283950e8348 to your computer and use it in GitHub Desktop.
Save unixpickle/851f1547c901973794747283950e8348 to your computer and use it in GitHub Desktop.
Annoying properties of Softmax + PG

Intro

When using policy gradients (PG) with discrete actions, a softmax distribution is typically used to represent the output distribution. This document describes how the softmax+PG combination can be problematic, even with perfect sampling and a linear model.

Suboptimal asymptotes

Suppose we have a multi-armed bandit problem. We can define our agent as a vector v, where actions are distributed according to softmax(v). If we know beforehand the mean reward vector r (one reward per arm), then the expected reward is softmax(v) * r. It turns out that, if we initialize v randomly, doing gradient descent on this objective can converge to a suboptimal arm, despite the use of the exact gradient.

I will not prove the above claim here, but this repository empirically demonstrates it. EDIT: This claim may actually be false. It is true that a suboptimal asymptote may appear to be reached, but it may be escaped after an infinite amount of training.The general phenomenon is that, if an output for a sub-optimal arm is higher than the output for the optimal arm, then the suboptimal arm can have a larger gradient (and the process perpetuates itself ad infinitum).

The above isn't a problem if we initialize our policy parameters v to zero. However, without perfect sampling, the parameters may be randomly changed at the beginning of training such that the same issue arises. Also, zero initialization can lead to symmetry breaking problems for multi-discrete action spaces or POMDPs.

Symmetry breaking

Symmetry breaking is crucial when using gradient-based optimization--it's why we use random inits, after all.

This is relevant for softmax+PG because we typically initialize the output layer to 0. Suppose we have a bandit with action space (Discrete<2>, Discrete<2>). For this problem, let's define the reward as action[0] ^ action[1]. We parameterize our policy as a vector v of size 4, and actions are (softmax(v[:2]), softmax(v[2:])). If we initialize v as the zero vector, as is common in RL, then the agent will never solve the environment because it cannot break symmetry.

The above problem can be solved by randomly initializing v or using imperfect sampling. However, these techniques can be problematic because of the suboptimal asymptote problem.

Symmetry breaking can also be a problem in partially observable MPDs, even with a single discrete action space. Suppose there are two actions and an episode is 10 timesteps. In this environment, action 0 subtracts 1 from a counter, and action 1 adds 1 to the counter. At the end of the episode, the reward is the absolute value of the counter. In this environment, we need to pick an action to draw again and again. This cannot be done with a zero init using a softmax policy and perfect sampling.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment