This post-mortem details my entry in the Summer Challenge 2024 at CodinGame. It ended up ranking at 2nd place, which is obviously off-by-one with what I would have preferred, but as always the competition is really tough! Congrats to MSz for winning it.
I have once again been dedicated (and maybe a bit stubborn) into leveraging Reinforcement Learning in this challenge, so as a result there is a lot in common here with what I did in the previous one. For this reason if you want all the details, I would first recommend reading my previous post-mortem if you have not done so already, for I will skim or go very quickly over what has not changed since.
Also I have to apologize, as I will again go straight to the details with little introduction to the concepts, so if you are not versed into ML or RL it will likely be (more of) an unreadable mess. Sorry!
Once I finished porting the game referee into my training "framework" (or what I would affectionately call frankenwork), I decided to start simple with a policy network that does not make use of any search. Because search tends to make things a lot slower, I figured it would help me iterate on the right hyperparameters, reward, architecture, etc. I am glad I did as it allowed me to do a lot of experimentation.
So without enumerating all of the iterations, here is the architecture I ended up with.
Each minigame has its own dedicated embedding or encoding network, with an extra one dedicated for the global state (score, medals, etc). In order to exploit some symmetry and help the network not waste expressiveness on permutations, the first player is always the one that plays, and the two opponents are sorted by their current score.
For brevity, if not otherwise mentioned, all inputs are manually normalized over their expected range.
- For each player:
- A list for the future track starting from their current position, 1 for obstacle and 0 otherwise
- Distance left over the entire track length
- Stun duration
- Resulting rank if the minigame ended right now (1 for 1st place and 0.5 for 2nd place)
- Whether the game is inactive (game over)
- Whether the game can finish before the match is over
- For each player:
- X, Y, X^2, Y^2, X^2+Y^2
- Resulting rank if the minigame ended right now
- Future wind strengths, 0 when past the end
- Rounds left
- Whether the game can finish before the match is over
- For each player:
- One-hot encoded position over modulo 10
- Absolute continuous position without modulo
- Risk (including negative for stun, in hindsight I should have split them)
- Resulting rank if the minigame ended right now
- For each action:
- One-hot encoded input assigned to the action
- Rounds left
- Whether the game can finish before the match is over
- For each player:
- Current points
- Current combo
- One-hot encoded input for each future turn
- Rounds left
- Whether the game can finish before the match is over
- For each player:
- For each minigame:
- The number of medals won in that minigame, one per type
- For both current state and if all minigames ended right now (but only if they can finish):
- The score of medals won in that minigame
- How much each point won in that minigame is worth (i.e. all other scores multiplied)
- Total score for current state
- Total score if all minigames ended right now (but only if they can finish)
- For each minigame:
- Turns left in match
Each encoding network starts with a linear projection of the features to the hidden size with no activation, along with layer normalization. The result is then fed through 3 residual blocks made of a fully-connected linear layer, swish activation and layer normalization.
The encoding of all five networks is then concatenated into a single vector and fed through 2 linear layers, the first one downsizing to another hidden size with a swish activation.
The output is then split into two heads, one for value (projection to a single scalar representing the expected return from current state) and one for policy (projection to 4 logits, giving a probability distribution for each input).
The model used in the final submission has around ~39k parameters.
Once again the training was done with PPO in a similar fashion to before, running 2048 games in parallel at around 2M steps per second, the usual run lasting between 8 and 12 hours for good convergence. Because the game length is fixed, it made more sense to have no discount factor this time (gamma = 1.0), as there is no real point to favor getting to the end earlier, and the game always ends on the 100th turn so there is no infinity possible, so it allows optimizing for the final score with less noise. And local testing confirmed this was the case.
After testing a few reward schemes, the one that gave the most consistent results is giving only one reward at the end of the episode, summing up the score difference between self and each opponent, but giving 0.5 weight to the opponent with the lowest score. In practice it gave a nice balance between maximizing its own score and denying a rival, prioritizing the one with the highest score. I also tried only looking at the opponent with the highest score, which ended up winning more but also ending 3rd place more, so favoring consistency I discarded that one.
Once I had figured out most of the details, I also trained as a test a mega network with 6M parameters, which confirmed there was still a lot of learning possible left, as it easily crushed my smaller networks. Fortunately I was able to leverage it in practice by performing actor distillation, using it as a teacher to train another smaller network, which ended up giving better performance than when training only on its own.
It became unfortunately obvious at some point during the competition, that even a very good policy network could not be as competitive on its own anymore (unlike the previous challenge), so I started looking into search. A natural choice for me in games with simultaneous actions is DUCT, which has already been extensively discussed in other post-mortems so I will not reiterate the details here.
Fortunately for me, it is very straightforward to repurpose the networks I have trained into search, as the actor-critic framework gives all that is needed to plug into MCTS (value and policy) by following the footsteps of AlphaZero which uses PUCT for action selection. A tricky part still was to find the right balance between model size and the number of possible visits with it, so a lot of experimentation with local testing had to be done to find it.
However there is still the open question of how to handle the randomness in the states. A pure version of DUCT would not only branch on every player-action combination, but also on every possible random state. Not only does this combinatory explodes absurdly quickly, but using neural networks I could only afford between 1k and 3k search iterations with my chosen model depending on the VM CodinGame used, despite the AVX2 optimizations I quickly wrote for inference. That purity would unfortunately not make for a meaningful search.
So my first version was to more or less ignore the randomness. To avoid biasing too much on a single random outcome, every descent of the tree would start with a different random seed, so hopefully over time it would converge to something more generally good regardless of the random. It worked out well, but there were still problems I found as I kept investigating.
Because of the way the NN policy strongly directs the search through PUCT, the first visit on any node could add a lot of bias according to the randomness of the state that first expanded it. Which can be really bad, because recovering from that bias may need a lot more visits, and the policy from those other states would not be taken into account so it could become an uphill battle. Maybe the policy could have been updated and averaged on each descent, but that would have required more costly inferences which I did not have any budget for.
As an experiment I locally tested some uncertain states where I ran the MCTS search thousands of times but using a fixed seed and recording the final action taken on each of these times. This gave me a nice probability distribution of what should be, overall, the best action to take by sampling the possible futures. It could maybe be a little biased because in theory each search could start exploiting the guaranteed future it should not know about, but it is likely mitigated with the low amount of iterations on each search. In any case, this gave me what I considered to be a more objective target, which I could then compare to the "real" search which used a different seed on each descent.
So if UP was the action to take in 60% of futures, then it's the one that should be taken 100% of the time (minus some (probably?) irrelevant details). However in the real search, because of the bias happening that I described before, there were situations where for example, 5% of the time it would take the action that was only good in 1% of futures. Clearly not great!
With this insight, I tried another (novel?) approach where I keep my "DUCTs-in-a-row". Instead of spending all my budget on one search that could be clearly wrong, I would instead search using a fixed random seed, and start over again with a different seed after reaching a fixed number of visits, recording the best action found each time. When the timer was up, the action with the most "votes" would be taken instead. I was not sure it would work, but surprisingly the results were very convincing both locally and on the leaderboard. It was better to act from seven (on average) shallower searches than a single deeper one.
I briefly tried incorporating search directly into training, but the initial attempts were not convincing at all and I just did not have time to invest any further, so I gave up on that.
Another detail that helped with "real world" conditions was to inject a little uniform noise into the policy during search, which helps lessen the assumptions on what the opponents will do and be a bit more robust to the unexpected. It was tricky to tweak, as locally it is always worse, but definitely necessary to give an extra edge against real opponents on the leaderboard.
Once more, thanks to CodinGame and all the participants for another fun challenge. This is a truly unique place that has no equivalent elsewhere, and I enjoy these times where everyone wakes from their slumber to banter and duke it out in the arena.
I do not make use of it personally as I have my own tooling, but still, here is a free advertisement and shoutout for Psyho, who has been working hard on providing the community with PsyLeague, which many have found incredibly useful to run local testing with. Make sure to use it if you have no alternative! And of course I also have to thank the one and only CGStats site provided by Magus, always useful to keep track of everyone's submits, especially on mobile.
Thanks again for reading this wall of text! I hope I did not bore you.
Looking forward to the next challenge!
-reCurse