nlml/blog.md Secret

## blog.md

      
    Raw
  

              blog.md
            
          
    LoL Minimap

VIDEO
Short video showing nice performance of minimap model
The Goal

PandaScore is the provider of static and real-time data for eSports. We cover a range of video games and tournaments, converting live in-game action into usable data for our customers.
A core part of the work we do involves deep learning and computer vision. This is needed as we take video streams of live eSports matches, and convert them into data describing what is happening in the game.
The League of Legends (LoL) minimap is a great example of this work. For this particular task, our specific goal was to build an algorithm that can 'watch' the minimap, and output the (x, y) coordinates of each player on the minimap.
The Problem

In the deep learning literature, the type of problem that involves looking at images and locating or tracking objects in that image is generally referred to as object detection, or tracking.
On the surface, our particular minimap problem appears as though it could be easily solved with detection models such as YOLO or SSD. We would just need to label a large dataset of minimap crops with the positions of each champion, and then pass this dataset to one of these algorithms.
Indeed, this was the approach we tried first. Drawing on previous work on the LoL minimap problem done by Farzain Majeed in his DeepLeague project, we trained an SSD-style model on Farza's DeepLeague100K dataset, and found it to work exceptionally well on a held-out test set from his dataset.
There was one major problem with this approach however: the model did not generalise to champions not present in Farza's DeepLeague100K dataset that it was trained on. We needed a model that will work for any champion a player happens to choose. We thus identified three possible routes to resolving this issue. The options were:


Manually annotate a lot more training data, using manually-created video frames covering all champions.


Train a model to detect the positions of any champion on the minimap, then feed the detections from this model to a classifier model covering all champions.


Train some sort of model on the raw champion 'portraits' - the raw portrait images of each champion that the icons on the minimap are based on - and somehow transfer this model to work in detecting the champions on real minimap frames.


We ruled out approach 1 early on, as this would be time-consuming, and would require a lot of extra work each time a new champion is released.
We experimented with approach 2 for some time, but found it posed its own challenges. The detections needed to be very reliable as they are fed to a classifier later on, plus having a model that makes exactly ten unique detections with no overlaps presents its own difficulties.
As a result (omitting quite some detail here for brevity) we were left with approach 3. We describe this approach in more detail in the next section.
The Approach

The final approach we arrived at relied on a classifier that was trained on the raw champion portraits. If the classifier was only trained on these portraits, then we could be (more or less) certain that it would not give any preferential treatment to champions that only occur in the minimap training data.
The general idea here is to train a classifier on heavily-augmented versions of the raw champion portraits. We could then 'slide' this trained classifier over minimap frames, resulting in a grid of predictions. At each square in this grid, we could extract the detection probabilities for each of the 10 champions we know are being played in the game. These detection grids could then be fed to a second, champion-agnostic model, that would learn to clean these up and output the correct (x, y) coordinates (or perhaps just the argmax position) for each detected champion (more on this later).
For the classifier however, we found that standard (albeit heavy) augmentation was
insufficient to train a model on raw champion portraits that could reliably generalise to the champions as they appear on the minimap. We needed augmentations that could transform the raw portraits such that they looked the same as they do on the minimap.
** TODO: ADD HERE AN IMAGE SHOWING A RAW PORTRAIT VERSUS WHAT A HERO LOOKS LIKE ON THE MINIMAP **
On the minimap, LoL champions appear with a blue or red circle around them. There can be explosions, pings, and other artefacts that also obfuscate the portraits. We experimented with crudely adding such artefacts manually. We found however, that the most effective approach was to learn a model that could add such artefacts. We achieved this with a Generative Adversarial Network (GAN). In short, GANs are a neural network-based approach that allows us to learn a model that can generate data from a desired distribution (in our case, we essentially want to generate explosions, pings, and other artefacts to add to the raw champion portraits). More info on GANs can be found here.
Training the GAN

Our particular use of GANs differs somewhat from the usual setup. Rather than generating the champion image in its minimap environment, in our case we were interested in generating masks to add to our raw champion portrait. The discriminator of the GAN would thus see the raw champion portrait plus the mask, and the generator would have to learn to change its mask such that this combination looks real. This is illustrated in the diagram below.

Diagram showing our GAN setup
Our generator outputs two things:


The main output is a mask of the same dimensions (HxWx3) as the raw champion portrait, which we add to that portrait.


To improve the generator's ability to make the raw portraits look like varied minimap crops, it also outputs a 3x1 vector w_c, and a 3x1 vector b_c. The raw champion portrait has its colour channels multiplied by w_c and has b_c added to it. Through this, the generator can also learn to generate some colour shifting/scaling artefacts.


Training GANs is a notoriously unstable process and our case was no exception. We found the Least Squares GAN loss to improve training stability, along with weight normalisation, and the two tricks described in section 3 and section 4.2 of the Progressive GANs paper.
Training the Classifier

We now had a trained generator that was capable of producing masks that, when added to any raw champion portrait, would take us to a distribution of images that look (somewhat) like how that champion might appear on the minimap. We could thus train a classifier on this distribution, in the hopes that it would also work for detecting champions on real minimap frames.
The below diagram illustrates the training setup for this classifier:

Diagram showing our classifier setup
This step is quite simple really. We just train an ordinary convolutional neural net classifier C on our raw champion portraits, augmented by the GAN-generated masks. We use a shallow, wide classifier network with lots of dropout to prevent overfitting to the GAN-style data.
Calculating the detection maps

Our classifier is a fully-convolutional network that takes 24x24x3 champion images as input and outputs a 1x1x(NumChampions + 1) tensor, which we pass through a softmax nonlinearity to estimate class probabilities (the additional output channel is for a background class; we trained our classifier to also detect random patches of minimap with no champion and output a high 'background' probability).
If we pass an entire minimap crop of size 296x296x3 to this classifer, the output shape increases from 1x1x(NumChampions+1) to 12x12x(NumChampions+1). We can increase the spatial granularity of this output by reducing the stride of the final two layers of the classifier (a conv layer followed by an average pooling layer). So in these final two layers, we reduced their stride to 1, from 2 (we later found that this trick has already been applied, e.g. in this work). After reducing the stride, we get an output tensor of size 70x70x(NumChampions+1). Taking the softmax over the final dimension of this results in a tensor representing the detection probabilities of each champion at each of these 70x70 positions on the minimap.

Diagram showing the procedure for producing the detection maps
We slice out these 'detection maps' - as shown above - for each of the ten champions present in the current game. We also take the detection map for the background class. This 70x70x11 tensor then serves as the input to the final stage in our minimap model - a convolutional LSTM sequence model.
Training the sequence model

Very often, when champions are close to one another, one champion's icon on the minimap will cover that of another. This poses issues for our classifier from the previous step, which cannot detect the champion that is being covered. To address this issue, we enlisted a sequence model. The idea here is that the sequence model has some 'memory' of where the champions last were seen, and if they disappear suddenly, and another champion is nearby, then it can 'assume' that the missing champion has probably just gone behind the nearby champion.

The above diagram presents the architecture of our sequence model. We take the 11 detection maps extracted per the previous section, and pass each independently through the same convnet, which reduces their resolution and extracts relevant information. A low resolution copy of the minimap crop itself is also passed through a separate convnet, the idea being that some low-resolution features about what is going on in the game might also be useful (e.g. if there is a lot of action, then non-detected champions are likely just hidden amongst that action).
The minimap and detection map features extracted from these convnets are then stacked into a single tensor of shape 35x35xF, where F is the total number of features. We call this tensor r_t in the above diagram, as we have one of these tensors at each time step. These r_t are then fed sequentially into a convolutional LSTM (see this paper for conv-LSTM implementation details). We found switching from a regular LSTM to a convolutional LSTM to be hugely effective. Presumably, this was because the regular LSTM needed to learn the same 'algorithm' for each location on the minimap, whereas the conv-LSTM allowed this to be shared across locations.
At each time step, each of the convolutional LSTM's 10 output channels (one for each champion) is passed through the same dense (fully-connected) layer. This then outputs x and y coordinates for each champion. The mean squared error (MSE) between the output and target coordinates is then backpropagated to the weights of this network. The model converges after 6 or so hours of training on a single GPU.
Results

We are still more rigourously evaluating our network before moving it into production. However results on our in-house test set suggest that more than 95% of all detections are within a 20 pixel radius of the target. Out of interest, we also tested the necessity of the GAN augmentation, but found performance to be substantially degraded when using standard augmentation alone, as opposed to augmenting with the GAN-generated masks. So it seems all our GAN training was not for nothing :)
This article is quite light on implementation details, and we're sure some of our more technical readers will want to know more. Please don't hesitate to ask questions in the comments, or on Reddit machinelearning.