Skip to content

Instantly share code, notes, and snippets.

Created January 8, 2024 18:34
Show Gist options
  • Save reCurs3/7d4eb5b952790e3009337f61c3341074 to your computer and use it in GitHub Desktop.
Save reCurs3/7d4eb5b952790e3009337f61c3341074 to your computer and use it in GitHub Desktop.
reCurse's Codingame Fall Challenge 2023 Postmortem


This post-mortem is for detailing my entry in the Fall Challenge 2023 at CodinGame, which ended up winning the competition in a very close call. There were heavyweight competitors participating, MSz showing up with an impressively strong entry (really, wow) and some other fierce showings from the regulars, too many to name for fear of missing out. It's really great to see you all again after the couple of years I was out. :)

Coincidentally, I have been itching to try doing some reinforcement learning with neural networks again lately. Re-reading the OpenAI Five and AlphaStar papers, which are amongst my favorite, made me want a project to experiment with model-free policies on partial information games, which is very different from my previous tinkering on AlphaZero-like training. I was pleasantly surprised to see this challenge ended up a perfect fit for finally trying it out, even if I found the game personally rather unappealing to be honest. If it wasn't for this learning motivation, I would have probably skipped it.

Neural network

Only describing the final model here, listing all of the other attempts would be too much. The final network contains 156k parameters.


  • 4 categories of inputs:
    • Drone: position, number of turns to reach surface, battery, highlight used last turn, emergency, one entry for each fish indicating if it is scanned, and value.
    • Fish: position, speed, visible, active (i.e. still in the map), one-hot vector for types, one-hot vector for colors, one-hot vector for radar X and Y per allied drone, values for each team.
    • Monster: position, speed, visible, active (i.e. unused in this map if < 6 are present), one-hot vector for radar X and Y per allied drone, one-hot vector for id.
    • Team: One entry for each fish indicating if it is reported, score, possible score and drones value.
  • Positions are normalized X: [-1,1] Y: [0,1], with the origin at (5000,0) to (potentially?) help figuring out symmetry.
  • All observations and actions are mirrored on X when playing as player 2. This actually helped a lot with stability and convergence speed.
  • Early training showed the model had a lot of problems figuring out the scoring mechanics accurately, ironically. It was clearly the case when it was sometimes predicting a win when a very simple score computation could tell the game was lost. I obtained a big boost by adding some preprocessing for it, this was the best variant:
    • Assuming all drones go straight to surface, compute what the score would end up like for each turn in the future.
    • When a drone gets to the surface, compute how many points it scored and set it as its drone value. Split the bonus points evenly if both drones surface at the same time.
    • The increase in score by both drones surfacing is set as the team's combined drones value.
    • Fish value: if that fish is scanned or reported, set to zero. Otherwise, its value is what score it would grant if that fish was immediately reported, bonus included. In case there is no bonus, but that a bonus can still be completed (needed fishes are not missing or still in scans), then divide the bonus by the number of fishes needed to complete that bonus. In other words, at game start each fish has typeBonus / 4 and colorBonus / 3 added to its own base value.
    • Possible score: straightforward, same as the game end condition, what score is still possible to get per team by scanning and reporting.


  • An embedding network is used to generate an embedding vector for each of: self drones, enemy drones, fishes, monsters, self team and enemy team.
  • Each embedding network is made of 3 fully connected layers with swish activation.
  • Enemy drones, fishes and monsters are max-pooled to generate an embedding set for each. A lot of literature prefers average pooling, but curiously it was performing worse in my case. Self drones are not pooled to avoid complexity with having to implement a drone selection layer before inferring its action.
  • Each category is concatenated to a vector which is then fed to a Gated Recurrent Unit (GRU) layer, acting as memory to help with retaining information over time to handle the fog of war. I also tested with a Long Short Term Memory (LSTM) layer, but it did not seem to perform much better, with the added cost of more parameters and slower training.


  • The GRU output is split between different heads.
  • 4 heads for actor (policy): 2 per drone, one for movement and one for light usage.
  • 1 head for critic, not needed for playing but still included as it was helpful for debugging.
  • Each head is a single fully connected layer with no activation. Adding complexity there gave no improvements.
  • Movement is 32 angles at full speed, +1 for staying in place and +1 for waiting. I ended up disabling the last two later on as they were never used in practice. Big improvement over 8 and 16 angles, but no improvement at 64 angles.
  • Movement is clamped at the surface Y=500 to get that extra edge the network could not do otherwise.
  • For training, both actions are probabilistically combined (prob(movement) * prob(light)), but each drone is treated independently.
  • Action masking was used to remove invalid actions (going left on left side, using light with no battery, etc.).
  • Early training showed the model had sometimes trouble avoiding easy collisions, possibly because those cases are too rare in training and get lost in the noise. An easy fix was to treat such movements as invalid actions too if it would result in a collision, which improved results a bit and training convergence a lot.
  • There were still cases of collision with monsters that had been seen before and could have been tracked. Again, I am a bit disappointed the network memory does not seem to handle those rare cases. So once a monster is seen and then not visible, I predict its next positions with a simple simulation and invalidate movement based on that. Also feeding that information to inputs, but not sure it helped.
  • During evaluation, the policy is played with the action having the highest probability. Using action masking here caused an issue in rare cases where the policy outputs 99% probability for a movement colliding with a monster, sometimes leaving the rest of the policy susceptible to noise and going in directions that were not intended. So I replaced masking with adjusting the movement vector to the smallest delta in either direction that was clear of collisions. Also added 1 degree to the delta to account for potential rounding errors with referee. Tried to also do this during training but it made things worse.


  • I wrote a multi-threaded C++ referee for maximum speed, directly linked against the bot code to skip input/output parsing. Validating it against the official Java referee was cumbersome (oh hello, fearsome replay format), it also needed a viewer (not too bad but again, time consuming), but ultimately it was 100% worth it.
  • 32k games were played in parallel against the current network version (self-play) for 20 steps before updating the network.
  • Relatively straightforward Proximal Policy Optimization (PPO) was implemented to train the network in actor-critic fashion, keeping with typical discount gamma factor at 0.99 and Generalized Advantage Estimation (GAE) lambda factor at 0.95. Adam optimizer was used for the parameters. No improvement found tweaking any of it, really.
  • Reward:
    • +1 for win, -1 for loss. Added a small reward for draw later as during training it had no motivation to end tied games otherwise.
    • After so many different attempts and ablations, the only survivor of reward shaping: the difference between the possible score for each team. A little depressing for the effort put in to be honest, but I am still impressed it generated so much behavior on its own.
    • All rewards with the exception of draw are zero-sum, so a positive reward for one team is a negative reward for the other.
  • I made it a point to keep every best model always functional in a local leaderboard to estimate elo gains and preserve variety.
  • Including backpropagation phase, the best optimizations yielded approximately 300k steps per second using a 32 cores Ryzen and RTX 3080.
  • Learning rate and entropy annealing were used to fine-tune the training, but ended up worsening the overfit problem (see later below).
  • Made use of Pytorch, Tensorboard and WSL.

Runtime on CG

Or how to fit 156k parameters in 100k. Probably not news for most veterans.

  • CG counts the number of characters (max 100k) in a submission, not bytes. So in order to store as much information as possible, it is possible to make use of character encoding such as UTF-8 to store much more. Using 32k code points per character results in 15 bits of data for each.
  • Furthermore, CG servers have gzip available from shell, so compression is very easy to add on top.
  • I reused a makeshift "code minifier" I did for the previous challenge, where I replace common occurrences of names and keywords by shortened versions using #define or straight up string replacement. A smarter take would probably use libclang, curious to try and learn it at some point.
  • Each network parameter was quantized to 8 bits, with the exception of the first layer dealing with the inputs that used 16 bits, as it caused too much inaccuracy otherwise.
  • My final submission used 99264 characters, 65714 of which were for the network weights.


Now as for what has maybe nearly killed my contest run. A typical system to implement in self-play is to keep playing against past versions of the network to avoid phenomenons known as catastrophic forgetting, or strategy collapse. To make a bad analogy, in a rock-paper-scissors game, if at some point the network ends up favoring paper too much, it may end up with a policy where it stops using rock as a result of exploitation over exploration. Which would lead to the policy favoring scissors, and consequently only ever playing scissors. So keeping variety in self-play helps to not overfit to a particular strategy, to keep up with the analogy, by having "rock" bots in the self-play pool even (and especially) if the current version favors scissors.

All of that is good, except I kept putting off implementing it, and then never did. It would have been a bit of a pain to do and it drastically affects training speed, because of having to use multiple models with smaller batches, and I prioritized speed above all else. I also thought having so many games in parallel would mitigate collapse, along with the game being infinitely simpler than the likes of Starcraft or Dota where it's completely necessary. My prior experience toying with AlphaZero-like learning showed it was not needed in practice either even with complex games like chess, but I did not pay enough attention to how different model-based is from model-free. Keeping all past model attempts in a leaderboard also showed strict monotony in improvement with no model ever demonstrating rock-paper-scissors effect with any other despite having a lot of internal differences. I was also cross-validating by playing some games in the IDE against top bots every day, and again no signs of overfitting, so I thought I was in the clear.

But I wasn't. The last Friday I started seeing weird behaviors and many more losses than I was used to. At first I thought maybe competitors had come up with strategies my model can't cope with. Fortunately I save a checkpoint every minute during training, so I thought maybe there's overfit after all and tried an earlier checkpoint. Which started winning more consistently again. Uh oh.

It was too late for me to do anything about it though, as I had family coming over the last weekend and no time to code, make sure things work, launch another training run, etc. But submitting the earlier checkpoint seemed to indicate it would be fine still, it was winning consistently against the top. But then MSz arrived and my bot was quite vulnerable to his. I ended up rewinding almost all training before I found checkpoints that were dealing better with his, but then they were not nearly as strong versus the rest anymore. Worse, because these versions were undercooked and early in training, there was still a lot of noise in the policy and no refinement. So I spent the last evening spamming checkpoints in the leaderboard to find one that seems to work better than the others. It ended up being a very close call, and honestly I would have entirely deserved to lose making such a mistake.

So lesson learned, don't be dumb and always respect overfitting. Next time I do self-play RL, I will look into nice league systems using AlphaStar as inspiration.

What (else) did not work

So much. Some highlights:

  • Continuous actions. It seems like every time I try using gaussian distributions as outputs, the result is so much worse than discretized space it's not even close.
  • Building a network to explicitly predict the hidden state using the ground truth as target and attention mechanism as inspiration for the architecture. I think it was actually predicting well, but the network performance did not care one bit for it, so I trimmed the fat.
  • Splitting the actor and critic in separate models. Much slower training, but I thought it would give a better policy free from the noise of the critic. Nope. Cut.
  • Reward shaping. Except for the one that remained, trying to interfere to penalize collisions, or reward good battery usage, or use more complex score predictions... Nope, the machine knows better. If it doesn't prove to be more signal than noise, cut it for simplicity.
  • Batch normalization. Still surprised on that one, it almost always improve things, but turned out much worse results. Maybe I did it wrong.
  • Separate network per drone OpenAI style. Only slower, no benefits here.
  • Adding some residual inputs to the network. I guess GRU already takes care of that.
  • No performance gains using lower precision training (!), unlike my previous works. Have not figured out that one yet. Maybe it's mainly good for convolutions?
  • Not entirely belonging here, but I have no idea just how good the internal tracking is actually, with some of the failures I encountered above. I suspect maybe using one of the many approaches described in other post-mortems, keeping it fast and optimized, and skipping the memory part may have turned out as good or maybe even better results. But that would have gone too much against what I wanted to learn and experiment with initially. And hey, less code to maintain too!

Time to end this wall of text. Thanks for reading!


Copy link

reCurs3 commented Jan 16, 2024


  1. Yes, the policy is directly used with no search involved, as I did not think it would help. The limited interactions and unknown information would make it difficult to predict future states to search on. For training, the move is sampled randomly from the policy to ensure there is enough exploration, while for playing it always uses the move with highest probability.

  2. I used no SDK for runtime, the inference code is relatively straightforward to write and it allows me to minify it as much as needed to fit the model in the submission.

Copy link

Thanks !
I tried to use the "Alpha Zero approach" for a contest a few years ago (Life N Death game), i did not have the time to investigate why the training did not clearly improve play. Not knowing why was particularly frustrating.
Your successful attempt makes me want to try it again ^^

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment