Yoav Golderg, February 2024.
Researchers at Google DeepMind released a paper about a learned systems that is able to play blitz-chess at a grandmaster level, without using search. This is interesting and imagination-capturing, because up to now computer-chess systems that play at this level, either based on machine-learning or not, did use a search component.1
Indeed, my first reaction when reading the paper was to tweet
wow, crazy and interesting. I still find it crazy and interesting, but upon a closer read, it may not be as crazy and as interesting as I initially thought. Many reactions on twitter, reddit, etc, were super-impressed, going into implications about projected learning abilities of AI systems, the ability of neural networks to learn semantics from observations, etc, which are really over-the-top. The paper does not claim any of them, but they are still perceived and talked about by many readers.
On the other side of being utterly impressed and making far-reaching assertions about the abilities of machine-learning, manu reactions turned to diminishing the result based on the fact that it was only achieved for blitz-chess, a fast moving game with a strict time limit, where players don't have much time think about each move, and not for a full-on chess game.
I am also not utterly impressed, but I will tackle it from a different angle than focusing on the blitz nature.
In short, a naive reader (and I was such a naive reader on my first read) may conclude the following:
DeepMind demonstrated a machine learning model which satisfies all of:
- Learned and is able to follow the rules of chess
- Learned and is able to play the game at a grandmaster level
- Learned purely from observing chess games and their outcomes
- Learned using a purely-ML system, without any game specific biases encoded in the training procedure
In particular, believing points 1-4 lead people to conclusions about other learning systems, such as large language models, along the lines of "If a transformer network can learn the rules/semantics of chess, including the ability to perform search, just from looking at games, then it is further evidence that we might be able to learn the true semanttics of language just from observing written texts".
However, none of these points are true! And to be fair, the paper doesn't claim them. It is well written, and doesn't over-claim (expcept for claiming a resulting system with no search... we'll get to that). Upon some more careful reading and thinking, one can actually realize that:
- Points (1) and (2) were not achieved, and in fact are not possible at all with the learning setup used in the paper.
- Moreover, point (1) and (2) are likely not possible in any setup that follows points (3) and (4).
- Points (3) and (4) are not supported by the setup used in the paper.
So, the paper and its results does not support any further claims or hypotheses about other learning scenarios and the ability of transformers to learn semantics from form.
Still, many readers did in fact have points (1-4) above in mind when reading the paper or its abstract (if you didn't, you can stop reading now). In what follows, I'll dig in into why points (1-4) do not hold. Beyond better understanding of the current paper, I think it also serves as a useful demonstration as to how seemingly neutral and natural modeling choices and data representation in fact pack a lot of punch w.r.t to what's actually being learned and why.
The researchers took 10M human-human chess games from an online chess arena. These correspond to roughly 500M board positions. For each board position and legal move (a total of somewhat above 15 billion pairs), they asked the Stockfish 16 engine for its estimate of the "winning probability" for this board+move, based on Stockfish's analysis of up to 0.05 seconds per move. They then trained a transformer-based model whose input is the board position+move, and the target output is the winning probability of the move.2 This is called action-value prediction.3
Conceptually4 the resulting model was then used to play games by the following, simple procedure: at each turn, use the trained model by having it look at the board position + each possible move, and predict the winning probability for this board+move pair. Then play the move with the highest probability of winning. That's it. Only model predictions, no looking-ahead search.
This procedure proved to be remarkably strong, achieving grand-master level when playing against humans in a blitz-chess setting.
This is impressive. The AI system was able to only look at board positions and their outcomes, and use this to learn a policy that enable it to pick a strong move for each position, strong enough to actually win against very strong human players when applied repeatedly. This was not done by mere memorization: the games against humans included many board positions that were not seen by the system during its training. The system learned to generalize from seen position to these novel ones, and still suggest strong moves leading to wins. And all this without "looking ahead".
One possible interpretation of this result, which I've seen many people make online (and which also crossed my own mind) is:
The system learned to play chess from observations.
It learned the rules of the games and how to play strong moves
just by looking at (many) positions from chess games,
without any extra domain knowledge,
and without self-play or mechanisms for thinking-ahead.
Points (1-4) above are another variant of the same interpretation. However, this interpretation is wrong. None of it is true.
Before understanding why, let's take a small detour to discuss the scale involved in learning chess patterns from observing 10M games, and how it relates to how humans may learn to play.
The system learned from 10M games. That's a lot. Let's assume a hypothetical extremely dedicated professional human player, who studies chess 16 hours a day, and is able to analyze a game per minute, 365 days a year. Such a player will go over 10M games in about 28.5 years. If this player takes not 1 minute but 5 minutes to analyze a game, they will need 142 years to go over these 10M games. There are grandmaster-level chess players who are 13 years old. So clearly human learn differently and much more efficiently. They likely have a very different kind of knowledge of the game, at least in the process of learning it. We'll return to this calculation later.
The authors seemingly chose a neutral representation and task: input is board position, output is the winning probability of a move. Simple, natural, "neutral". Note however, that this is still a choice, and more over, this choice (as discussed in the paper's section 5, "Discussion") resulted in a system that did not, in fact, learn the rules of the game (point 1), and which cannot learn the rules of the game, no matter how strong the learner is. This modeling choice also prevents any learner from playing "like a grandmaster" (point 2). Let's see why, using two failure modes discussed in the paper. The first demonstrates inability to learn the rules of the game. The second demonstrates inability to learn to play like a grandmaster. These are very nice examples of consequences of modeling choices.
The paper mentions that the model could not follow the "don't repeat the same board position three times in a row" rule (aka "threefold repetition"). Indeed, when looking only at single board positions and predicting the next move, it is impossible to learn to not repeat past board positions three times in a row, simply because you do not have access to the previous board positions. There is insufficient signal.
In order to really learn chess "from observations", including this rule, you will an input representation that provides access to move history.5 To encode the history, the "natural" option is a sequence of board states.6 Now, compare learning chess by looking at a single board state and having to learn the next move, compared to learning chess by looking at a sequence of board positions and learning to predict the next move. If you do not know the games, the second one is significantly harder to do. At the minimum you need to learn to ignore all previous boards when predicting the next move (except for a few special cases, like the threefold-repetition rule or castling rules, where the history is extremely important). This is not trivial to do, and a learning model without any knowledge of the game may very well pick up on spurious patterns that involve previous board positions and which are incorrect. By restricting the input to a single board position, the model designer is using their own knowledge of the game to inject game knowledge to the learning process: the knowledge that previous moves are irrelevant in most situations. This does not seem much, but in terms of learning without any knowledge, it is significant. So, focusing on learning to map from a single board position to the best move is a modeling choice that makes the problem much easier to learn but also make it impossible to learn some aspects of it. It's a tradeoff, and it injects important knowledge and "inductive bias" into the learned model. This is not an innocent, neutral decision. No decision is.
Can it be learned from observations at all? Another interesting issue that arise here is that even if you had access to the entire game history (so no missing signal), and a perfect learner (that don't get distracted by spurious patterns), it would be very challenging to learn the threefold-repetition rule only from observing human-human games, and even harder to learn when restricting yourself to looking at games by strong human players: the players know this rule, know that such repetitions will lead to a draw, and hence avoid threefold repetitions in their games. There will be very few game examples where this behavior is manifested, and the few examples that do exist, will be in situations where both players realize the game is "stuck", and wish to draw. It will be very hard, if not impossible, to correctly infer the general threefold rule in this setup, and learn that you should not repeat the same board position three times in a row in a strong board position. (This means capable players who do know the rule, and this model's blindspot, can use it to manipulate the model into drawing by mistake when it is in a very strong leading position).
I find this to be a great demonstration of a semantic rule that cannot be learned from observations, in this case because the semantics affect the observed states. It is worth noting that large language models also learn from observations alone, and reflect on possible semantics they cannot learn, for various reasons.
Grandmaster level human players know the game well. They won't be caught exhibiting behaviors of very weak players. In a sense we already saw such a blind-spot: not knowing the threefold repetition rule, the model can be manipulated into a draw when it's in fact leading. However, the next example is even more problematic, and stems directly from modeling choices:
In cases where the position of one player is much stronger than the others, the "winning probability" can be 100% for several different moves: each of them will lead to winning the game down the line. However, some moves will achieve victory using fewer future moves than others. In our case, the model cannot choose between these moves: it only knows to maximize winning probability of a single move, it does not even "know" that the move will be followed by future moves, doesn't know how winning will be achieved, and doesn't even known what it means to win.
Hence, the model may choose a poor move that takes more future moves to lead to winning. After taking such a move, it may again select a poor move that will take a long time to achieve victory. Indeed, it has no notion of "a plan" nor a notion of making a move that will get it closer to victory. In some extreme cases, the model may get stuck forever in situation where winning is guaranteed, yet the model never reaches it. Consider, for example, the model having two rooks, while the opponent has a king. This is a textbook example of check-mating, so every move will be a guaranteed 100% probability of winning. However, in order to actually win, one has to make a specific series of moves with their rooks, not just move them aimlessly around. But the board is large enough to allow rooks moving aimlessly around. This will be an absurd behavior, associated with very weak chess players, and which will never ever happen at grand-master level. Yet, the model, as trained in this paper's setup, has no way of avoiding it, and has a real chance of exhibiting it.
The modeling decision crippled the learner and prevented it from exhibiting grandmaster level play. This modeling decision in question is learning to associate moves (or boards) with their "winning probability", without factoring in the estimated number of steps that will lead to the win. The paper proposes and ad-hoc solution, that actually contradicts it main claim: the resulting chess program does use search.7
Modeling choices matter.
So, we saw how modeling choices prevented the model from learning the rules of the games (and that some rules will be hard to learn from observing games under any setup). This handles points (1) and (2) above, and to some extent also point (4): there certainly are game-specific information embedded in the modeling choices, and other choices could make learning much harder.
Now let's tackle point 3: the paper's setup is not learning from observing games. This means, in my view, that any attempts to infer from this paper anything about learning via language models is futile. I will now describe some of the reasons making learning from observing games very hard, and what the paper actually does instead.
Consider what it means to learn chess from observations. You see a chess game, and at the end, you get told who won. You can watch as many games as you want, but you can only watch. What can you learn from that about the rules of the game? What can you learn about the game's semantics and strategy? What does it take to learn to play well?
Learning that players alternate in their turns will likely be an easy pattern to pick up.8 Learning the legal moves for each player is likely also going to be easy (with the possible exception of castling, which may be harder). But how about learning the semantics of the game? or, let's pick an easier target, learning to make good moves, which will maximize your chances of winning?
To learn to pick up good moves, you could, for example, focus on only the moves of the winning player in each game. Now, at each board position you note what move this player made. And you try to find patterns that associate the board state and the move. There is a big problem, though: this will not necessarily (and will likely not) result in learning to play well and make good moves: not all the moves a winning player made in their game are good moves. Some of them might actually be mistakes, that the player managed to recover from. Other might just be "meh" moves. Learning to distill the actual good moves in a sequence that eventually lead to victory, simply by observing the sequence and its outcome, is very hard. In Reinforcement Learning ("RL") terminology, this translates to a situation of "sparse reward", and identifying the actual good moves that led to the sparse reward is called the "credit assignment problem". Credit assignment is perhaps the biggest challenge in RL, and is far from being solved. And whatever solutions exist for it in RL, assume you can interact, propose moves and observe their outcomes. Situations when you are only allowed to observe and now allowed to interact are much, much harder.
Ignoring credit-assignment and just learning to predict the move the winning player made in each board position will likely lead to very poor, and confused, strategy, especially when you do not have any knowledge of the players in the game you observe. (Maybe the player you observed is a weak one, and only managed to win because they played against other weak player? maybe many games are from such weak players? In this case, you will learn to associate board-positions with very non-impressive moves. Likewise, maybe most moves in a game are "meh" in the sense that they can lead to either winning or defeat depending on the future, but are very indecisive, and the patterns you will learn will be dominated by such meh moves rather than by strong winning moves? etc).
So, learning from observation is far from trivial. The authors of course realized that, and did not even attempt to learn from observations. They did something very different. Did you spot what it was?
The setup in the paper is not learning from observations. It didn't look at game positions and tried to learn and predict the next move of the game at certain positions. Rather, it looked at a game position, then used a strong chess-engine to expand a tree of possible game continuations from this position spanning hundreds of thousands if not millions of moves into the future. The game engine then used its internal knowledge of chess to assess the percentage of winning board positions within the possible continuations, which is the "probability of win" for a position. Then, the learner tried to learn this number. This is not learning from observations, it is learning from a super-human expert.
Knowing the "win percentage" or "win probability" of a move in a position is very hard. Grandmaster-level players can estimate this number for some board positions based on their past experience (and based on their history of interactions with chess engines, including querying them for this precise number for many different positions, and analyzing both the number and the some select winning and losing paths). But this is virtually impossible to give a good estimate of this "win probability" of a move without either being exposed to many correct estimates of other moves, knowing the rules of the game and being able to "play them out", or observing someone else doing such playing-out. In any case, this is well beyond what is available by just observing played games.
So, the learner certainly did not learn from observation, it learned from a significantly reacher signal, one that cannot be obtained without very strong knowledge of the game (the authors call it "distilling the knowledge of stockfish 16". This is an apt name).
Above, we said that a human learner is much more efficient than the learning method of this paper, because observing and analyzing 10M games in a rate of 1 game per minute 16 hours a day will require 28.5 years, and humans are much more efficient than that.
But in fact, the story is even more extreme. For every game, the machine-learner looked not only at the game, but also at the win-probability estimates of each of the 15B board-move pairs in these games, and tried to learn and replicate these win-probabilities. The win-probability estimates are not only time-consuming for human-learners to produce, they are also impossible to derive without very strong knowledge of the rules of the game (or access to a chess engine that encodes these rules). Human learners clearly learn very differently than that, and much more efficiently, at least in the middle parts of their learning. (Once beyond some level, they may as well be learning win probabilities. But this will build on a very different foundation than a system that was trained only on such probabilities, and will also look at significantly fewer board positions.)
We saw that the way in which the problem is encoded is never natural or neutral, it is always a modeling choice among various alternatives, and these choices have benefits and consequences.
In this paper, the prominent modeling choices were:
- representing the input as a single board position, rather than a sequence of positions or moves.
- representing the output as a winning probability.
While not explicitly discussed in the paper, these two choices encode game-specific knowledge, and make the learning problem significantly easier. However, the first choice also prohibited the model from properly learning some rules of the game, while the second choice prevented the model for ever achieving real grandmaster level play by prohibiting it from concisely and efficiently winning in some very easy-to-win situations.
Moreover, the second choice introduces a huge amount of game-knowledge into the training process, and resulted in a process which is very different from learning-by-observing. While there is no search at playing time, there is a lot of search going on in training, and a lot of fine-grained chess knowledge as well. (While the authors don't deny it, and even say it up-front, this is not the focus, and easy to miss on a first read by a casual reader. Or at least, I missed it.)
While the end result is indeed impressive, it is also much less impressive than many readers portray it to be: the model learned to pattern match from a given position to a strong move, in a generalizable way. But it did not learn from observations, did not learn the actual rules of the game, and likely learned in a very different way than any human would. In many ways, it is telling much more about properties of the game of chess, than it does about the learning abilities of ML-models. We can certainly not support claims of potential strong abilities of langage models based on this work.
But if you take one message from this piece, I urge that that is not "this result is meh" or the specifics of learning to play chess without search, but rather the more general one: "everything is a design choice. design choices have big implications, and are often glossed over in papers. looking for important but hidden design choices and implications is a critical paper reading skill."
Roughly speaking, a "search component" means a system that "looks ahead" in the game: At each board position, it unfolds many moves ahead, for both the player and the opponent, building a tree of possible moves, and then choosing the move that leads to the best outcome over all possible continuations that start from this move. Details vary, but that's the gist of it. Systems with a search component evaluate many board positions for each move. For example, Stockfish, the leading chess engine, on a 4-core CPU, can evaluate roughly 5 million board positions per second. Considering 50ms "thinking time" per move corresponds to evaluating about 250,000 boards for each move. ↩
They actually don't predict the probability, but divide the probability range into 128 bins, and predict the identity of the bin the probability falls into. This transforms the problem from regression to classification. But it does not really matter for the rest of the discussion. ↩
They also experimented with two other variants: (1) for each position, predict the win probability of the position (state-value prediction); (2) for each position, predict the move with the best probability of winning (behavior-cloning). The differences between these are irrelevant for the rest of our discussion, so I'll stick to the action-value prediction variant. The translation to other variant should be trivial. ↩
I write "conceptually" because it wasn't exactly what was done, there are some extra hard-coded quirks and kvetches that are only mentioned later in the paper, and which I will get into later. ↩
Another rule that requires access to game history is "castling", whose availability depends on previous moves. In this work, though, castling is handled by encoding "castling availability" as an additional variable in the "board state", which is tracked externally to the learned system. If the goal of the authors was indeed to demonstrate learning to play chess only from observations, this would have been cheating (but they don't claim to learn from observation, so it is a legitimate choice). A system that claims to learn to play only from demonstrations, should also tackle the challenge of learning the castling rules. ↩
We can also think of encoding the history as a sequence of moves, as chess games are often communicated, or as a combination of board-states and moves. The gist of the argument transfers between these choices. ↩
The authors "solve" this in a very non-elegant way, in my opinion. In their own words: "we check whether the predicted scores for all top five moves lie above a win percentage of 99% and double-check this condition with Stockfish, and if so, use Stockfish’s top move (out of these) to have consistency in strategy across time-steps." They essentially hard-code the identification of this situation (domain specific knowledge) and then ask the chess engine to tie-break among model predictions, which actually introduces search to the player strategy, in contrary to the paper's main claim!! ↩
Note that the setup in the current paper does not even attempt to learn this pattern. It is just assumed to be known, is baked into the playing procedure, and the model is only invoked in its own turns. ↩