Skip to content

Instantly share code, notes, and snippets.

@jasonrute
Last active April 17, 2021 11:13
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save jasonrute/c6a82fe72bf73a0191988d1b73f527d8 to your computer and use it in GitHub Desktop.
Save jasonrute/c6a82fe72bf73a0191988d1b73f527d8 to your computer and use it in GitHub Desktop.
Summary and Review of Universal Policies for Software-Defined MDPs

This is a summary and my thoughts about Universal Policies for Software-Defined MDPs. For me, this was a difficult paper to understand, so I hope this helps others. However, it is also possible I still don’t understand it and I’m totally missing the point.

This paper is about an abstraction, and like many abstractions (I’m looking at you “monad”), it is on one hand simple and the other hand profound. Also, like many abstractions (still looking at you "monad”) it is quite hard to wrap your head around at first.

The simplicity

The simple idea in this paper is a prototype toy programming language, Dodona, which is basically a version of Lisp/Scheme with an extra choose function. (I don’t feel like writing Scheme code, so my examples will be in a generic functional pseudo-code.) An expression like

let n := choose [1, 2, 3] in 
(n + 1)

basically means non-deterministically choose a number n from the list [1, 2, 3] and return n + 1. Unlike, say, Python’s random.choice function, choose [1, 2, 3] does not necessarily mean choose uniformly at random. Instead, the language specifically doesn’t specify what choose means, only that it is possibly non-deterministic. Here are a few interpretations:

  • Choose uniformly at random.
  • Perform all the choices. A choice is a branching point in code execution. Either in parallel or sequentially, execute the program with every choice. Since there is likely more than one choice in the code, this becomes a tree that can be traversed in any order (e.g. DFS).
  • Have a human make the choice. It may seem silly to farm out decisions to a human in the middle of the execution of computer code (except for debugging), but it is at the very least a good way to wrap your head around how choose works.
  • Have an AI (likely neural-network-based) agent make the choice. This is one of the main interpretations in the paper. However, even in this case there are many scenarios. In one scenario, the agent could be specifically trained to make this very choice. In another scenario, the language compiler/interpreter would come with a default universal AI choice function which is universally good at making choices. If this seems too good to be true, read on…

The major question looming over this paper is what sort of applications one would want this abstraction for. I’ll defer giving good examples for a while, but let’s talk about a silly example:

def f(x) := 
  let n := choose [1, 2, 3] in 
  if n != x then n else fail

Here fail is shorthand for choose [], i.e. make an impossible choice (since there are no options to choose from). So either my function finds and returns some n different from x or it fails. One can start to notice a few things:

  • To make a non-failing choice, the agent obviously needs to know the value of x.
  • To make a non-failing choice, the agent has to have some idea of the goal (in this case, finding a number different from x). This can be gotten either through trial-and-error (the usual way for such AI agents to learn), or by reading the source code. … :lightbulb: Now we are starting to see the profoundness of this abstraction…

The way the Dodona interpreter is actually implemented is that when it gets to a choose, or what they call a choice point, in the code, then the interpreter creates a data structure, a continuation, representing the current stack execution of the program as well as all the code yet to be executed. This can be fed into an artificial intelligence (or examined by a human) to make a choice which is fed back to the interpreter to continue execution of the program until the next choice point.

The first profoundness: Markov Decision Problems

The academic theory of reinforcement learning (RL) (i.e. learning through experience) is based on the concept of a Markov Decision Problem (MDP). It is made up of three things:

  • States.
  • Actions. These let you move from one state to another (possibly stochastically)
  • Rewards. You earn rewards for performing certain actions at particular states.

The goal in a MDP is to find a (near) optimal policy which for any given state, chooses the action which maximizes future rewards.

For example, you can think of solving a Rubik’s cube. Each state is a configuration of the cube, and the actions are the 12 rotations. You get a reward when you solve the cube. This paper focuses on the special case of a deterministic MDP where there is only a reward at the end for solving the problem (and zero reward for failing at any point.) So in essence we are considering a search problem.

It isn’t hard to see that if such an MDP is computable in some sense then it can be encoded by the Dodona programming language. Each choose [a, b, c] function specifies a state where we can make an action, chosen from [a, b, c]. Similarly any policy can be used to implement the choose function.

For example, consider a fairly simple MDP where we start at the point (x,y)=(0,0). A move action will move us to an adjacent grid cell (or stay in the same cell). After each move action we perform a color action to color the square. The goal is to color the square as a checker board, so it must have the same parity as x+y or else we fail. If we survive four move-and-colors then we win.

let (x0, y0) := (0, 0) in
let x1 := x0 + (choose [-1, 0, 1]) in
let y1 := y0 + (choose [-1, 0, 1]) in
let color1 := choose [0, 1] in
if (x1 + y1) mod 2 != color1 then fail else
...
if (x4 + y4) mod 2 != color4 then fail else (x4, y4)

Notice that we broke the actions into three actions per turn: choose the new x, the new y, and the color. This ability to pick from different lists of actions at different choice points is a distinct advantage of this setup. Other common reinforcement learning environments assume the set of actions is the same across all states. Moreover, the language is composible, so it would be easy, say, to make a choose_pt_and_color function which returns a tuple (x, y, color).

The second profoundness: Zero-shot task solving through specification-based RL

There are classically two types of reinforcement learning:

  • Model free RL. The only knowledge of the environment the agent gains is through direct interaction with the environment (usually by starting at some initial state and performing a sequence of actions using the current agent’s policy until it hits a timeout or end condition). This is the setting of robotics and OpenAI’s Gym (including the well-known Atari learning environments). Common model-free RL algorithms include deep Q-learning and actor-critic.
  • Model-based RL. The RL agent has direct access to a model (i.e. a simulator) for the environment. The agent can perform hypothetical what-if actions before making a true action. This lets the agent perform a neural-guided tree search over the next few action to find the best action. (In robotics, this is called planning.) This can speed up learning and improve the policy. This is the setting of say theorem proving and board games, like Go and Chess. MCTS is one of the most-well known model-based learning algorithms.

Now we have a third option:

  • Specification-based RL. (This is my term. In the paper, they use the term “software-defined MDP”.) The agent has direct access to the code for the environment, which in turn provides a specification of the action dynamics and rewards. Contrast this to both model-free and model-based RL, where the agent has to perform an action (or simulated action in model-based RL) to know if it will get a reward. Here the agent can see ahead by just reading the code which specifies how it will get rewards.

For example, consider the move-and-color MDP I described above. In model-free (resp. model-based RL), the agent would have to experiment (resp. search) to get a “lay-of-the-land”, whereas an agent with access to the code can just know which color to choose without any experimentation. This is known as zero-shot task solving.

The third profoundness: A universal policy

Recall, a method of choosing actions in an MDP is called a policy. A good policy can be learned through RL, but this takes both time and computer power. Also it is specific to the MDP at hand and won’t generalize. Wouldn’t it be great if we could have a universal policy which can just read the MDP specification and spit out an optional action for a given state? While it may sound too good to be true, (and the authors agree it wouldn’t always work), there is good precedent to believe it is possible in certain “natural” situations.

But let’s back up and discuss universality. The two hottest types of neural networks today are Convolutional Neural Networks (ConvNet) for vision and Transformers for text and sequence information. They are each universal in three distinct ways. The authors argue it is reasonable to expect that we could have a third type of network which is universal for specification-based RL in the same three ways.

Problem domain universality In the old days of image recognition one had circle detectors, line detectors, mouth detectors, nose detectors, etc. Now, if one wants to, say, screen X-rays for cancer or classify species of trees, one just goes to TensorFlow or PyTorch, loads up a ConvNet model and trains it a dataset. The ConvNet architecture is domain agnostic and works for almost any image data. It is the same with Transformers. Most language modeling or sequence-to-sequence tasks can be solved by training a Transformer if one has enough data and compute. Again, the steps to train it are basically the same if you are modeling French or proteins sequences.

Similarly, the Dodona programming language provides a universal way to train an MDP policy. It is still too early to determine the right neural network architecture to use for this application, but it probably would be some advanced graph neural network with attention taking as input some information about the execution of the program (including the variable values) as well as the program code.

Universal pretraining and meta-learning It is well-known that a ConvNet trained on the ImageNet dataset not only does really well on ImageNet, but if fine-tuned, it can do well on a totally different domain, like x-rays. The pretraining helps the network learn universal image tasks like detecting lines and circles in the early layers. Then one just needs to retrain the last few layers to do well on the domain in question. A recent paper has shown the same is possible with Transformers. A pretrained transformer on English text can quickly learn a completely different task like image recognition by just re-training a few weights.

It seems very reasonable to me that the Dodona networks could quickly learn new MDPs with only a minimal amount of RL or search. Even in cases where the network doesn’t have good access to the full specification (because say the code specifying the MDP is obfuscated or because it calls some external simulator or robot for the result of performing each action), it may well still be trainable via RL much faster than a network trained from scratch.

This is also related to the concept of meta-learning (learning to learn). The idea is that when humans encounter a new situation, we can quickly adapt because of all the prior experience we have. (This is in stark contrast to say an AI trained on 19x19 go board which struggles to adapt to a 13x13 go board or to a small variation in the rules.) There are many strategies to implement meta-learning, but in Dodona, since the specification is given, the network can (theoretically) adapt quickly to new situations.

Universal (zero-shot) model The OpenAI GPT-3 transformer (trained on a diverse assortment of publicly available text) has shown itself to be capable of few-shot task performance. Give it the prompt “Name the capital city of the country: France => Paris, England =>” and it will fill in “London”. (And this is just the tip of the iceberg.) More recently, CLIP (also by OpenAI and also trained on a diverse array of image and text data) is also capable of zero-shot task performance in image classification.

Trained humans are quite reasonable at reading code (and specs and rules) and it makes sense that there could be a universal AI which is good at solving MDPs zero-shot given only the code describing the MDP. Indeed the experiments in the paper (are trying to) suggest that this is the case (but I think the evidence is still lacking).

Experiments

I consider this an idea paper, and a good one, but the truth is that idea papers have a hard time getting accepted without experiments. As for the experiments, I have to admit that I think this is the part of the paper most lacking, both in design and presentation. It is however possible I'm just being too picky.

Even though the paper is about MDPs, all the experiments are supervised learning tasks. All of the tasks are of the form “predict the output of a function f with the input n”. More specifically, here is an example test case (or what I think one looks like):

if choose_output == (list_remove_first_odd [1, 3, 4]) then
  true
else
  fail

where choose_output uses the choose function to construct a list of digits using the usual cons and nil construction, something like

def choose_output :=
  if choose [true, false] then
    []
  else
    (choose [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]) :: choose_output

and list_remove_first_odd is a function removing the first odd digit from the list.

For each task, e.g. list_remove_first_odd,they plug in 500 random inputs, e.g. [1, 3, 4] and then they use that task for testing or training.

There are some important details to consider:

  • Each task, list_remove_first_odd is used for either testing or training, but not both. However, it is possible a related task, like list_remove_first_even could be in the other.
  • The reported score seems to be the difference in cross entropy from a random baseline model and their trained model. I think the scores are averages over every choice, not over completion of the whole path. It definitely shows when the model is and isn’t learning something, but it also obfuscates the performance of the model in practice. For example, it is not clear to me that the model can fully solve any of the list tasks, even if they have higher cross entropy than random guess, because it only takes one wrong choice to give a wrong answer.

I wouldn’t have such an issue with the testing approach or the way the results were presented if it wasn’t for the claim of “zero shot” learning. Sure, the results show some potential for greater generalization and meta-learning, but zero-shot learning is one of those magical terms in AI research, and I personally think a lot more evidence is needed to claim the prize. (For example, I think lean and metamath gpt-f shows similar, if not better ability to generalize to unseen theorem proving tasks, but I wouldn’t call it a “zero shot setting”.)

I know that testing is expensive and hard (and reviewers are never satisfied), but if this is further developed and say put in Lean, I’d love to see the following tests:

  • Pick tasks that actually require search and/or RL to solve, e.g. small grid environments, small puzzles like the tower of Hanoi. Show that having a pre-trained model trained on a variety of tasks allows for either solving new environments with no additional training or search, with some search (but it is faster than from scratch), or with RL (but it is much faster than solving from scratch environment). This could be a really high quality meta-learning paper with such experiments.
  • You could also show faster generalization to related tasks. For example, train a network from scratch to solve the Tower of Hanoi for 3 pieces and retrain it to generalize to 5. Now, do the same, but using Dodona's continuation input to show much faster generalization (or zero shot learning).
  • If you still focus on computation tasks, I would want to know that the extra overhead was worth it. For example, does your model do better at finding the value of list_remove_first_odd [1, 3, 4] than a similar transformer that was directly trained to find the output of a function. I honestly don’t know if breaking up the list computation into lots of little steps, each having access to the continuation, is better than doing it all at once in a transformer. (Even if it is better, it is much slower I imagine.)

Application to Lean 4

It’s no secret that Daniel has theorem proving in mind, particularly solving the IMO Grand Challenge in Lean 4. One obvious application of something like Dodona in Lean, is symbolic rewriting, which is probably one of the most important tools in theorem proving. Moreover, many challenging puzzles (e.g. Rubik’s Cube) can be considered rewriting problems. (I’m not saying one needs to solve the Rubik’s cube in Lean, but it provides some insight I think.)

Rewriting already brings up a number of challenges to implementing Dodona in practice:

  • The list of available lemmas is huge (and even larger if one considers that one might have to specialize a lemma before applying it as a rewrite rule).
  • Would one train a rewriting agent through RL or human data?
  • Would RL work for Dodona (see Nature’s Rubik’s Cube AI article)? I worry that with the large networks needed to support the policy network that the usual RL methods would be a challenge to implement.
  • Would there be a universal policy or a separate one for each use case?
  • The rewriting tactics in Lean probably have a lot of code. Would all that code get passed to the Neural network as part of the continuation?
  • For some rewriting problems, including standard domains like rings, as well as special domains like the Rubik’s cube, the best solutions are algorithms. This is exactly the use case for program synthesis, which I think will get more popular in AI. Currently, RL algorithms optimize for the shortest path, not the path with lowest description complexity. (See again the Nature Rubik’s cube AI where the AI finds short paths, but takes a lot of search to get there.) Projects like DreamCoder seem to be bridging that by taking a minimal DSL (like the moves in a Rubik’s cube) and learning reusable subroutines. If one thinks about a DSL as a MDP, then DreamCoder is constantly modifying the MDP by adding in new actions which are combinations of primitive actions. I wonder if program synthesis could go well with Dodona or not.
  • In a similar thought, in Dodona the neural network agent makes a series of choices. They can see past choices (like in an autoregressive model), but they can’t see past hidden states. A recent paper showed that transformers modified to see past hidden states (similar to an RNN) do better on the sort of tasks like computing the output of a long program, that Dodona would need to do.

Final Thoughts

I know the paper tries to contrast Dodona as an alternative to using something like GPT-3 directly in code, but the more I think about it, I think we really need general purpose GPT-style models to create Dodona’s universal policy. It seems like way too much work to put all the interesting math problems into Dodona’s continuation framework. Moreover, even if one does that, one still needs to design the network underlying Dodona’s universal policy, which the paper suggests is not at all fleshed out. Instead, I think if this universal policy ever gets off the ground, it needs to be built on top of a more flexible GPT-like model which supports

  1. Huge amounts of diverse human data, covering both natural and formal topics, in multiple modalities.
  2. Extremely flexible input/prompt modes and formatting.
  3. Refining one sub-aspect of the network through both data addition and reinforcement learning.

GPT-3 has shown that aspect (2) probably would arise out of aspect (1). Aspect (3) is a challenge, but I also can’t imaging building Dodona’s universal policy without something like that. Aspects (2) and (3) would also allow Dodona to be built on top of such a pretrained network. Right now, I think transformers are the closest we have to this, even though they are not close to perfect.

In short, I don’t think one should just work on building Dodona’s policy from scratch (as suggested in this paper and talks of Daniel), but instead work on gathering data for a large dataset of formal mathematics, which can include natural language, formal math, and other sources, in addition to Dodona MDP continuation data when that becomes available. (As a thought experiment, I bet that if the current Dodona experiments weren’t trained from scratch on a custom GNN/Transformer architecture, but were trained on a comparable pre-trained natural language transformer already containing large amounts of English language and computer code, then the results of the experiments would be much better. While, it is probably not reasonable to expect that the authors would have had access to such a model for writing their paper, I think it is reasonable to build one on the way to building a practical universal MDP policy for Dodona.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment