This post-mortem details my entry in the Winter Challenge 2024 at CodinGame, which ended up winning the competition. Once again I could count on a strong showing from the University of Wrocław participants, making me sweat until the very end, with special kudos to @MSz, @aCat and @marwar22. 😅 And of course many other strong participants have also showed up with great strategies.
And while I am at it, I must also give some praise to Codingame, as this game was well designed, challenging, interesting to watch, and without a doubt the best one of recent years in my mind. Bravo!
My bot can be summarized as an actor-critic neural network trained with reinforcement learning and backed with a search algorithm at runtime. I suppose at this point it is no longer a surprise to anyone as it is my 4th challenge in a row doing so, and I am still just as dedicated to further hone and improve my machine learning skills. I will skip some details to avoid repetitions, so you may want to check two of my previous post-mortems [1] [2] to get more of the full picture.
The model used in the final submission had around ~98k parameters, with the bot itself a few hundred characters short of the 100k limit. At this rate I may have to write a full C++ minifier next...
The input is a 2D grid of 12 rows by 24 columns (the maximum map size). For smaller maps, the unused cells are zero-masked. For the blue player, symmetry is exploited by mirroring the map to put the starting position in the top-left, to eliminate unnecessary learning of permutations.
Each cell of the grid has 93 channels:
- One-hot encoding over 34 categories:
- 6 for empty, wall and proteins.
- 14 per player for each organ type, with 4 direction sub-types for harvesters, tentacles and sporers.
- For each player (2x29):
- For non-root player organs, one-hot encoding over 4 parent organ directions.
- Possible grow types on the next turn. (14 as described above for player organ types)
- Scalars splatted over the map:
- For each protein type, stored amount and income per turn.
- "Criticality", the number of organs that would be destroyed as a result of this cell being attacked.
- The score and tiebreak score.
- The turn number splatted over the map.
Each input cell is embedded with a 1x1 convolution, and then the grid embedding goes through a residual network (ResNet) of 4 blocks, giving the network a total receptive field of 17x17. I have tried smaller networks but they had a sharp decrease in quality that was not compensated by being able to do more search, while bigger networks would hardly fit in 100k characters alongside bot code that is already a fairly minified mess.
On this note, dear CodinGame if you are reading this, 200k characters in 2025 would make for a terrific present. 😁
In typical shared actor-critic architecture fashion, the final output is split into two heads, one for value and the other for policy.
The value head starts with a 1x1 convolution embedding, with the output flattened and fed to a 2 layer MLP to obtain the value scalar.
The policy head is a 1x1 convolution embedding that outputs 15 channels over the entire map. 14 of them are used to indicate the grow type as specified above, while the extra channel is flattened and fed to a single layer MLP to obtain the wait logit which is concatenated to the rest of the policy to represent 4033 possible actions. Naturally in practice it is impossible, so action masking is used to only keep the valid ones, after which softmax is applied to obtain the policy.
If you are following well, you may notice this does not actually fully describe an action. For instance, a target cell could come from 4 different parent cells. And "growing a root" (i.e. sporing) could also come from 4 different sources. After some testing though, it turns out the extra complexity is just not worth it, I guess the situations where it happens and it matters at the same time are too few. So arbitrarily picking a valid one is good enough.
But what about handling actions for multiple roots? Wait for it.
It deserves its own section, as for me it was the key to make this bot work, and it took me a while to figure it out. For the sake of brevity I will not bother listing out the things I thought of and tried, pros and cons, blah blah, so here's what worked best: just ignore it.
No really. There is no explicit knowledge of multiple root identities in the entire network or in the policy. Here is the insight: what if the game only allowed one action per turn? Then the policy would focus on the most important area. After that, what if the organism playing that action was not allowed to play in the first place? Then you could use the rest of the same policy to determine the next best action. Do you see where I am going with this?
I have not really seen this concept elsewhere before, so I would call this small novelty "iterative policy masking". After an action is taken, the same policy is reused by further masking the actions that are no longer possible as a result. This includes accounting for remaining resources, growing on the same cell, double harvesting the same resource, etc. This goes on until all possible actions are exhausted, or whenever the wait action is chosen, so it is never forced to take action if it does not want to.
And the best part is it's all operating from a single inference, so big win on performance.
I am still relying on fairly straightforward PPO to train the network. This time I had 512 games running in parallel at around 16k turns per second while training was happening. The final network was trained for 2600 minutes, but it most likely stopped improving before that point.
The reward was straight win/loss but I added "win shaping", where the win reward was doubled if the game ended by completely eliminating or starving the opponent, so drawn out stalemates receive less. This was mainly done to accelerate the training, as I noticed a sizeable amount of interactions were wasted on dead games. I suppose I could have also written better endgame detection code to address this, but I also liked the aggressive playstyle that evolved out of it.
The policy network by itself plays pretty well, but as the old saying goes, if you are not fully using your 50ms, you might be doing it wrong. Or was it use the compute, Luke? Something like that.
Anyway, the problem with convolution networks is they are a real CPU killer, so even with fairly decent optimizations I could only squeeze ~100 evals on a regular turn. Normally you can plug an actor-critic network fairly easily with Monte Carlo Tree Search, but it seemed to me it was just too few evaluations to give something useful, especially when the game is full of traps.
So I figured a minimax-like approach would be a better friend here. Alpha-beta pruning works well with good move ordering, which I happen to have for free with the policy. I could also use the same policy to generate only a few likely candidate moves for each player, which I ended up fixing at 4. Simultaneous moves were handled with a pessimistic approach, where the opponent acts with knowledge of what the first player did. This is definitely not ideal, but I did not find a nice way to handle resolving a payoff matrix within that framework without penalizing performance. Another time perhaps.
With a few rare exceptions, on most turns depth 2 was able to be reached, sometimes 3 if lucky or on a small map.
I also had a NN evaluation cache, useful if the opponent happens to play as predicted, or for making the most of the 1000ms on first turn. In fact, according to my bot, most games are effectively decided by turn 10 at most, when it's not much earlier, so planning ahead on the first turn is quite vital.
My bot does a few things I am not happy with still. It does not like finishing games sometimes, playing dubious moves instead, when it thinks it does not matter anymore because it won anyway. I tried forcing its hand by artificially boosting evaluation but well, I could not be bothered doing too much either I guess. 😅 It is also completely helpless at understanding tiebreak scenarios, it will happily lose them by turning off harvesters too early. Again, I gambled it would not matter too much in practice, but I may have been a bit wrong to ignore it.
With this game I actually had a slight symbiotic feeling with my bot, as for once it also felt very natural for a human to play it. So it basically taught me how to play the game as it was learning itself. Then I would see something dumb happen, so I had to fix things and teach it not to do that as well, it was an interesting feedback loop. Of course I am anthropomorphizing a lot here, but it was still a kind of cool experience for me.
I would like to take this opportunity to give some quick publicity to CG related tools powered by the community. I cannot emphasize enough the importance of local testing, and for this I must recommend using PsyLeague by @Psyho, which has provided me (and many other participants) invaluable help to figure out the game performance of different bot versions. There is also a new similar tool called CGArena by @aangairbender, which I have not personally tried but looks like an interesting alternative. And of course, CGStats by @Magus is as always the go-to for tracking leaderboard submissions, especially on mobile.
And at the risk of sounding like a broken record, thank you and thanks to the community for bringing some much needed humanity while the bots are busy clashing for our sake. Happy New Year to all of you!
-reCurse
well done reCurse! :-)