Mario asked what sort of interface is an interactive theorem prover (ITP) needs for machine learning (ML) research. I’ll try to give a detailed system design, but also leave a lot of flexibility to account for different research directions.
Proving a theorem in a theorem prover is solving a puzzle. Like a Rubik’s cube, there is a current state, a desired state, and some actions that the user can perform to move to another state. The input to an ML model is the current state, and the output is an action to apply to that state.
In theorem proving, the current state is made up of two things:
- An environment. This encompasses everything which has come before the theorem. Most importantly, it encompasses all the previous lemmas and definitions in the library. But more generally the environment can also include notation, tactic settings (like
simp
lemmas in Lean), namespace information, and the proofs of previous lemmas. This is a lot of information, and to the extent it is passed to the ML agent, it has to be done in an efficient way. - The current state of the proof. For a tactic proof, this is a list of sequents made up of a local context and goal. Other systems like tableau proofs have their own ways to represent the intermediate state of a proof. (While it is not necessary, one could also include information about what came before in the proof such as the partial proof term/tree so far, or the current partial list of tactics. This is not common but could be useful.)
The goal is always the same in all theorem provers: finish the proof. Usually this involves closing all open sequents.
The actions in an ITP are more complicated than the 12 actions of a Rubik’s cube. An action, or proof step, is some instruction to modify the proof state. Here are some examples:
- In tactic theorem proving, a proof step is a tactic.
- In simple logic systems, like propositional logic, a proof step could be an inference rule in some calculus.
- In MetaMath, a proof step is a theorem from the environment along with a list of substitutions for the variables in the theorem.
- For connection tableau provers (as in the xCoP family of provers), a proof step is a clause from a theorem in the library or from the goal.
- For Isar proofs, a proof step is an intermediate goal (possibly along with its justification).
- In hammer systems (which are a bit different from the others since there is no tree search), a proof step is just a list of premises from the environment to send to the automatic theorem prover.
There are a few things which make proof steps hard to synthesis.
- Diversity of proof steps. Even if one uses tactics, the number of tactics is quite large, and every tactic has a different set of parameters. Older systems created a detailed API for every tactic. New systems just synthesize the tactic as text (possibly using the same syntax as human proofs).
- Arbitrary math terms For many proof systems, like MetaMath, Isar, Lean and Coq, it is important to be able to dream up arbitrary math terms. These are needed to instantiate theorems already in the library, such as
apply (lem p)
. Also, most all theorem provers (save for some cut-free calculi) need term synthesis to prove existential statements. Luckily, with the advent of state of the art language models, term synthesis is now a reality (but it does come with some costs I’ll outline below). - Premise selection and dependence on the environment Much of theorem proving is applying previously proved facts already in the library (environment) or assumption in the local context. This is called premise selection. Many tactics, such as rewrite tactics or simp tactics are conduits for listing useful premises. For connection tables and hammers, a proof step is entirely premise selection. For Metamath proofs, premise selection is half the work (where the other half is instantiating the variables in the theorem). One tricky thing about premise selection is that the list of available premises is different for each theorem, so there needs to be a guard when evaluating to make sure, say, one doesn’t use theorem X to prove theorem X, and moreover that the ML model has access to the premises it can use. (Also, definitions and notation are another form of premise selection. It doesn’t make sense to use integrals before they are defined in the library, but at the same time, this isn’t usually as large of a concern as theorem dependencies.)
- Tactic specific parameters In modern tactic libraries, every tactic has special parameters which control all sorts of little things. The ML model has to either learn how to control those parameters, or ignore them.
I’ll talk more below about representing proof steps, but the biggest choices are to decide if you will make your own machine-readable API or use the same human written syntax.
There are also some more subtleties. In most proofs systems, the proofs are not linear. In Lean one can put tactics inside other tactics, as well as chain tactics together, try a tactic on all the open goals, or focus on a single goal. One has to decide how to support this if at all. Usually in tactic systems one of two methodologies is used:
- Apply a proof step (tactic) to the first goal only. This effectively linearizes the proof state. This is what HOList does.
- Apply tactics to the whole goal state. If the tactic only acts on the first goal, fine. It it acts on all the goals (or swaps goals), then also fine. This is what Lean GPT-f does, and it is easier to extract data using this methodology.
I’m making the assumption here that the ML agent and the theorem prover are written as separate systems, possibly in different languages, and they communicate over some interface, like JSON.
It is not obvious if the ITP agent is the server or the ML agent is the server (or if they are both servers and there is a third client controlling everything). For the most research purposes, it makes sense for the ITP agent to be a server. The one important exception to this is when you want the ITP (say via a tactic) to use the ML agent to suggest proof steps or whole proofs to a user (like the Lean gptf
tactic). Then it acts more like a client.
Also, since there is a tree search going on, sometimes it is helpful to think of the ML agent as being two parts. The pure machine learning agent which takes an information about the proof state and environment and outputs the next proof action (discussed more in depth below) and a tree search agent which decides when and how to backtrack.
Here is a communication workflow between a prover and an ML agent.
- A theorem is chosen to prove and that theorem comes with an environment.
- The ITP agent generates three types of data:
- Data about the environment (only needs to be sent once per theorem at most).
- A serialization or identifier for the current proof state (used for tree searching).
- Data about the current proof state (to be passed to the ML agent)
- The tree search part of the ML agent dictates which proof state to go to (either by identifier or serialization).
- The ITP agent loads that proof state.
- The ML agent specifies a proof step to apply.
- The ITP agent applies this proof step (if valid) and sends information about the next state.
- Loop until reach a solved state (or timeout).
The big question with tree searches is how to handle backtracking. Does the ML agent or the ITP agent decide when to backtrack and what state to visit next. Ideally the ML agent would be able to handle whatever sort of tree search it likes, e.g. DFS, BFS, or most likely some (machine learned) heuristic-guided search.
There are many ways to set up and pass this information especially if tree searching is involved and both agents need to keep track of visited states.
There are two extremes here:
- The ITP is a stateless server. After each proof step, the ITP agent hands the ML agent a full serialization of the current proof state. Then before each proof step, the ML agent sends both that serialization and the proof step to apply to that state. This makes sense for some simple logics where it is easy to serialize the middle of a proof just as a sequence of sequents. HOList uses this approach since an HOL-Light sequent is relatively simple. (Note, for any prover, one could just rerun the proof from the beginning until that point, but that likely isn’t performant.) One also still needs to consider the environment. It doesn’t make sense to serialize an entire environment. However, if the environment is just a list (or tree) of theorems and definitions, that could be stored separately and the agents could just pass between them an index for current the position in that environment. (This is useful for training and testing an ML agent on a fixed library. It is less useful for using that agent for proof guidance on new proofs.) For example, Lean has a debugging tactic which lets you recreate the environment to look exactly like the environment at the start of any declaration in the library. Overall, a stateless ITP server approach is great for parallelization, but also comes with other performance issues (and maybe assumes a fixed unchanging set of environments).
- The ML agent is a stateless server. Since, within a single proof, the environment won’t change and the proof state is likely a persistent data structure, it may make sense to have the prover keep track of all the states. Instead of serializing everything, let the prover hold onto the current environment and all the visited proof states. (For the ML agent to be truly stateless, the prover or a third tree search agent, needs to control the tree search.) Overall, this approach is likely faster since less information is serialized. It also works well for tactics, like GPT-f. (Lean GPT-f takes this approach.)
There are, of course, many middle grounds. The prover or the agent can both be statefull servers (or clients) which both keep track of the proof states visited. The prover stores the state along with an index for each state, while the ML agent keeps track of the tree search bookkeeping. The ML agent can tell the server which state to visit next by providing the index. This can be parallelized across proofs but not within a proof.
It is also important for one or both of the agents to keep track of the whole proof since the point of a proof search is to find a (human readable?) proof in the end.
Also, if there is premise selection (or other important information in the environment), then likely one side or the other of this interface needs to keep track of the available premises along with any information computed about those premises, such as vector embeddings. This information only needs to be computed once per theorem usually.
I see three main use cases.
- Proof guidance for a human user. This will happen inside a tactic or human facing interface which, given a proof state, uses the ML agent to suggest next proof steps or whole proofs. This use case isn’t needed for ML research (e.g. HOList doesn’t have any way to interface with HOL Light for human users), but a ML theorem prover is not useful without something like this. It is a really nice way to showcase your ML agent and to provide real world testing. In this case the prover will likely be the client and the ML agent the server. If there is a full proof search, then we have to decide if the prover or the ML agent controls the proof search. Also, it would be desirable in this use case for the ITP system to be able to provided a human-style proof to the user instead of something that looks more like computer code, hence it is useful to think about this use case when designing the system.
- Evaluation. In this case, we are evaluating the ML agent over all the test theorems so we can, say, claim that our system solves X% or the theorems in
mathlib
. (Before we start we should separate theorems in the library into testing and training. Testing can be much smaller than training. Also, I won’t get into it, but there are lots of views on how this should be done. Some like to keep a separate set of test theorems which are unrelated to the training theorems. For example, they should be in another project or be new to the library.) In this case, the client is likely some external testing code. (Although, some write this loop in the metalanguage of the prover.) We likely want to store the resulting proof for inspection later. Also, it is important here to enforce that when proving theorem X it is only using theorems which come before X. (We don’t want it say using X to prove X!) - Reinforcement learning. This is similar to evaluation, but with some key differences. Here we will run our prover over the much larger set of training theorems (instead of the test theorems). We will also likely run it multiple times in a row, gathering data on each run. Hence, for this to be practical, we have to be sure that our system is fast (although for larger ML models, the bottle neck might be the neural network instead of the prover). We also need to make sure that for each proof step, we can record the same sort of data as generated in the data gathering phase (see below).
The theorem prover doesn’t have to choose the machine learning model, but it has to make choices which will constrain what sort of the machine learning models are available to use. Basically either the theorem prover needs to give the ML model an input in some standardized format, or it needs to provide an interface by which the ML model can ask the prover for the type of data it needs. (Again, I’m making the assumption here that the agent does not have direct access to the metalanguage, but if it did, this would provide a lot more flexibility, but also make it a lot harder to ML specialists who don’t know much about the metatheory of a prover.)
In short, the input to the ML model is information about the proof state and environment. The output is the next tactic state to apply. Some environment inputs, like the list of premises, might be only read in once per theorem or even just preprocessed offline during training.
The simplest option is to just support a pure text-to-text ML model. For example, in Lean GPT-f the input to the ML model is the pretty printed goal state (all the goals as a single string with some whitespace cleanup). The output is a tactic string like rw [nat.add_comm]
(trained to mimic human-written tactics in the code). In Lean GPT-f, there is no premise selection. Instead during training the model memorized all the theorems in the library. (So Lean GPT-f would just output strings like .) This means that the ML model can be treated as a stateless server where it is just fed the pretty printed goal string and returns a tactic string. We had to modify the Lean source code to be able to parse in that tactic string and convert it into a Lean tactic. (This option might not be available to all provers.) Our text based model however can also hit certain snags:
- It may not provide the namespace information, like it may write
rw [add_comm]
in a setting where thenat
namespace isn’t open. - It may try to use theorems which are really theorems, but not yet in the environment. In that case, the tactic will fail.
- Also, when testing on new libraries, our model will not know about any new theorems in the library, so it likely won’t use them.
In MetaMath GPT-f the idea is similar. I don’t recall how the input is specified, but it is just a string of some sort. The output again is a string specifying the proof step. In this case, premise selection was handled via the following mechanic: The model was trained to output a proof step, except instead of referencing a theorem in the library by name, it would reference it by the statement of that theorem. Then the prover would do a library search to lookup that theorem. (This approach is more feasible in Metamath than a lot of other provers and also had to be designed into the ITP agent.)
This sort of text-to-text model works fairly well, but here are some considerations. The largest consideration is whether to use pretty printed inputs or more structured inputs (like fully elaborated strings, or s-expressions, or JSON). Modern text-to-text models only accommodate fairly short inputs. (The details are a bit tricky, but the input+output for a GPT-f model is capped at 2048 tokens where a token is on average something like 4 characters.) So if one starts expanding certain goal states, it could easily exceed the input length limit. Another thought is that pretty printed strings, since they are good enough for humans, are good enough for AI systems, and maybe better since they get to the point and avoid superfluous information.
Text models also have another important characteristic. Since they just take in text as input and output, they are easy to pretrain on a large chunk of the internet including arXiv and GitHub, as well as on whatever random data one can access from the prover. This pretraining provides both an economy of scale (the same pretrained model can be used for many projects) and greatly improves the results. Other models aren’t so flexible in pretraining.
Instead of human readable text, many AI models use a more structured representation. For example, HOList for HOL-Light supplies an input as an s-expression representing a sequent. (This expression is basically the internal expression raw format including all type information.) Graph models take apart that s-expression and turn it into a graph which is then fed into a graph neural network. (There is nothing important about s-expression compared to say JSON. This is just a convenient representation for terms—but it also seems to slightly annoy some ML researchers.) Other models which retain the tree structure of an input are text models with special tree encodings. The advantage of this sort of structured input is that it doesn’t throw away important information. Also, if one wants to use a plain text input model, that is fine too. They can just plug in the s-expression into that model. However, as discussed above, this input might be on the verbose side for a text model, and one shouldn’t expect an ML research to take an s-expression and be able to turn it into a pretty printed expression.
For the output side, HOList has a fixed tactic API covering all the available tactics. Their model works by first choosing the tactic. If that tactic has term parameters, these are either chosen by hard-coded logic (like when a variable name is needed), or such tactics are skipped altogether, like EXISTS_TAC
. Next, if the tactic has theorem parameters, like MESON_TAC
, then another premise selection component selects the premises from the available theorems in the environment. If I recall correctly, the ML agent actually knows the list of premises and moreover knows which premises are associated with which test/training theorem. This means they could precompute special data about the premises which they could use to select them. This premise selection strategy has a consequence: The list of premises is effectively hard coded into the model. While it is a bit more flexible than Lean GPT-f, it won’t accommodate new projects with new premises in an on-line manner. It will only know about the premises it was told about ahead of time. But this is still pretty good. (If there was a standard ML model associated with your prover, you could even compute the information about each premise ahead of time as part of CI.)
Last, this tactic output is sent back to the prover via protobuf (sort of like JSON). Theoretically, one could train a text-to-text model to output this protobuf representation or something similar.
The Tactician (Coq) and TacticToe (HOL4) provers take an online approach where they don’t necessarily do any training ahead of time. Obviously they have to use a fast ML model to accommodate this. Here is how it works. As the library is being compiled, every time a proof is added to the library each tactic step is recorded and processed. For each tactic state, they compute a vector summarizing that state. Then when trying to prove a new theorem, they compute a vector v representing that tactic state, and try to find out of all the previous tactics in the environment, the ones with the closes vector representation to v. They try to apply the same tactics which were applied to those goals. (They also do some simple modifications of those tactics to, say, swap local variables with those in the current local context.)
Such a system is much much faster than anything involving neural networks. It can run on a laptop easily. It is also very flexible. It instantly scales to new projects and to new refactored versions of the library.
However, such an AI system requires a very tight interaction with the theorem prover. The theorem prover, has to process the tactic proof of every theorem in the environment during compilation using a particular embedding model. (They use a hand crafted vector embedding similar to TF-IDF for this part, so that makes it easier to build directly into the theorem prover.)
I still think however, that the ability to incorporate new proofs and theorems may be an important step in theorem proving, so I think the ability to access this information is a nice-to-have, but it involves a lot of work.
Many machine learning models need large amounts of data. With the exception of Tactician and TacticToe, this data is captured ahead of time. For some AI agents, like rlCoP and DeepHOL Zero, all the data they need is the theorem proving environment since they learn purely with reinforcement learning. (This theorem proving reinforcement learning environment does however already contain a list of all theorems and their premises, as well as a list of all available tactics, so it isn’t fair to say that the model learns math from zero knowledge.)
Most other agents learn by mimicking human proofs (and then some are additionally fine-tuned with reinforcement learning). Therefore, one needs to gather human proof data that others can use to train their models.
NB: One needs to gather the human proof data in exactly the same input-output form they would like the model to use. This is definitely worth testing thoroughly, since it would suck if there was a mismatch. (Subtle differences in the preprocessing of data between testing and training have can be the death of production ML systems.)
Moreover, it is often helpful to provide other data as well. In PACT for Lean GPT-f we showed that there was a lot of data that we could use to help train our model like converting from pretty printed to elaborated, or predicting the name of a theorem from its statement. (Such data is more useful in text models.)
Also, while it is not clear if structured expressions are helpful for model input anymore (since we have really good text-to-text models), it is still probably true that structured representations help a lot in training. If we know the structure of x y : nat, h : x < y |- y > x
, we can do training tricks like masking out a complete subexpression, such as x y : nat, h : [MASK] |- y > x
and asking our model to predict the missing subexpression. We need structured data to be able to do this.
In the above, I tried not to be too committal. I think the best approach for any theorem prover designer would be collaboration with a variety of ML researchers so they can express their needs. Also given the shifting winds of ML research, a flexible and modular system is best.
Nonetheless, here are practical first steps (none of them trivial).
- Decide on a tentative way for an ML agent to communicate with your prover considering the discussion above.
- Build that interface.
- Gather training data from your theorem proving library using the same inputs and output format as your interface.
- Make a mock ML agent which regurgitates the training data. Specifically, it takes in the input, uses it as a key to look up the output and it returns that.
- Loop over all theorems in your library and use that mock ML agent to reprove all theorems in the library. (If you perfectly recorded the proofs and if you perfectly built your interface, you should be able to just replay all proofs using your interface and the training data.)