Skip to content

Instantly share code, notes, and snippets.

@SannaPersson
Created March 21, 2021 10:47
Show Gist options
  • Save SannaPersson/6e7cda439dde6f2ad70af325f1c0f112 to your computer and use it in GitHub Desktop.
Save SannaPersson/6e7cda439dde6f2ad70af325f1c0f112 to your computer and use it in GitHub Desktop.

Implementing and training YOLOv3 from scratch in PyTorch

For such a popular paper there are still few implementations explained of the YOLOv3 architecture completely from scratch. I'll do my best to add something useful to the list. The code is written in collaboration with Aladdin Persson and can be found on github. You can also download pretrained weights on the Pascal-VOC that obtain 78.1 MAP here for the implementation below.

Prerequisites:

  • Understanding the major parts in YOLOv1
  • Coding in PyTorch
  • Familiarity with convolutional networks and their training

With this article I hope to convey:

  • Understanding of the key ideas necessary for implementing and training YOLOv3 from scratch in PyTorch
  • Complete code to use for training of YOLOv3
  • The relevant details of the algorithm to succeed if you choose to make you own implementation of YOLOv3

The code is completely runnable if you download a utils.py and config.py file from the github above containing a few supporting functions and constants not specific to the YOLOv3 model.

Disclaimer: there are minor differences between this implementation and the original and I will point them out when we get to them. If you find anything else that seems scetchy or like a bug, let me know!

Understanding the model

The YOLO algorithm consists of three main parts: the data loading, the Darknet model and the YOLO loss function. The YOLO algorithm is based on the idea that we divide the image into a grid with side $S$. Depending on the YOLO version as well as the image size this grid size will differ. Each grid cell is responsible for predicting the bounding boxes of the objects that have midpoint in their cell. This means that only one grid is responsible for each object's bounding box in the image. One drawback of this is that there can only be on bounding box in each grid cell. In YOLOv2 and forward they mitigate the issue by the introduction of anchor boxes which is an idea also seen in previous object detection papers such as Faster RCNN. alt text

An anchor box is essentially a set of a width and a height chosen to represent a segment of the training data. For example a standing rectangle may suit a human while a wide rectangle is a better fit for a car. The anchor boxes are in YOLO found with K-means clustering yielding better results than only hand-picking them. The anchors will be used, which is suggested by the name, to allow the model to anchor its prediction to. The model will thus predict how much the true bounding box is offset in comparison with the anchor which is a difference to YOLOv1. Each grid cell will now have several anchor boxes and each anchor box can make one bounding box prediction. Each bounding box will be coupled with an object score as well as class predictions. The object score should reflect product of the probability that there is an object in the bounding box and the intersection over union between the predicted bounding box and the actual object. That means if there is no object in the grid cell corresponding to the specific anchor the target is zero else it is the intersection over union between the predicted box and the target bounding box.

The predictions from the model $t_i$ are offsets to the anchors and will be converted to bounding boxes according to the following equations $$ \begin{array}{l} b_{x}=\sigma\left(t_{x}\right)+c_{x} \ b_{y}=\sigma\left(t_{y}\right)+c_{y} \ b_{w}=p_{w} e^{t_{w}} \ b_{h}=p_{h} e^{t_{h}}. \end{array} $$ where $p_w$ and $p_h$ are the width and height of the corresponding anchor boxes and ${b_x, b_y, b_w, b_h}$ is the resulting bounding box.

In YOLOv3 the backbone network is updated to DarkNet-53 and its structure can easily be understood from the following table. This network that was pretrained on ImageNet and used as a feature extractor in the YOLOv3 model. The paper, however, completely skips detailing the following 53 convolutional layers in the YOLOv3 model where the actual prediction of bounding boxes takes place in the model. alt text

The prediction of bounding boxes happens in three different places in the networks on three different scales. In this context a scale means the grid size, $S$, which we divide the image into. We now predict on three different grid sizes in three different parts of the model. The intuition behind this is that on a coarser grid larger objects can easily be detected and vice verse for smaller objects on finer grids. We therefore also divide the anchor boxes we have found such that we assign all the smaller anchors to the last and finest scale and the largest anchor boxes to the first coarsest grid. In YOLOv3 the grid sizes used are [13, 26, 52] for an image size of 416. If you use another image size the first grid size will be the image size divided by 32 and the others will be a multiple of two of the previous one. The intricacies of the model will be clear when we implement it but the following image by Ayoosh Kathuria (check out his Medium) gives great insight into the model architecture.

alt text

The backbone network is a standard convolutional network quite similar to previous Darknet versions with the addition of residual connections. It is really after layer 53 that the interesting parts happen. As the image visualizes there are three downward paths corresponding to predictions of three different grid scales. The network then continues forward from the place it was before the prediction path. After the first and second scale prediction paths there is an upscaling layer to double the size of the feature map and concatenates the feature mapes with a route from a previous layer along the channel dimension. The image details which convolutional layers the routes come from but we will instead use a trick to find them in our implementation.

We are now ready to start actually coding the model. All model details are found in the configuration file for YOLOv3 on Joseph Redmon's Github who is the author of the paper.

Coding the model

This is the part of the YOLOv3 implementation that I spent the least and the most time on debugging. I found it manageable to make the model work but it took some time to correct details to make sure the original weights could be loaded.
Everything in this section will be in a model.py file on Github. Let's start with the imports:

https://gist.github.com/5258e7895c1a1c63ff331425fbe9f244

First we will define the architecture building blocks in a list as a way of parsing the original config file that majorly increases the readibility and grasp of the complete model.

https://gist.github.com/485ccf4712775dfa855ada6be9edeb21

Defining the building blocks

We will now define the most common building blocks of the architecture as separate classes to avoid repeating code over and over again. Each tuple signifies a convolutional block with batch normalization and leaky relu added to it.

https://gist.github.com/78e88914f22e0fda3963f13536cff74e

This layer also allows us to toggle the bn_act to false and skip the batch normalization and activation function which we will use in the last layer before output. In the case where we use batch normalization the bias term of the convolutional layer will have to effect but occupying VRAM.

We then define the residual block which is essentially a combination of two convolutional blocks with a residual connection. The number of channels will be halved in the first convolutional layer and then doubled again in the second. The input size will therefore be maintained through the residual block. As in the CNNBlock we will have an argument to allow us to skip the residual connection which we will use in parts of the architecture.

https://gist.github.com/ae9850499e199ffb804a785684032db7

The last predefined block we will use is the ScalePrediction which is the last two convolutional layers leading up to the prediction for each scale. Here the image above is a bit misleading and actually this block includes the downward path except for the loss function computation. We will reshape the output such that it has the the shape (batch size, anchors per scale, grid size, grid size, 5 + number of classes) where 5 refers to the object score and four bounding box coordinates.

https://gist.github.com/58fd829bb48969f31b7694272daa89a0

Putting it together in YOLOv3

We will now put it all together to the YOLOv3 model for the detection task. Most of the action takes place in the _create_conv_layers function where we build the model using the blocks defined above. Essentially we will just loop through the config list that we created above and add the blocks defined above in the correct order. The trickiest part here is in the case in where there is an "S" in the config list which means that we are on the last layers towards a prediction on a specific scale. In these cases we will have three convolutional layers (one residual block and one convolutional block) following the same pattern on all prediction scales. To avoid creating a mess in the config list it is easiest to just add them here.

It should also be noted that we triple the in_channels after we add the upsamling layer and this is due to the route that we will concatenate in the forward propagation that has twice as many channels as the output from the upsampling layer.

This leads us into the structure of the forward function. In the case where the next layer is a ScalePrediction block we will append the output to a list and then compute the loss for each of the predictions on different scales separetely. The second if-statement will take care of finding the specific route layers specified in the image above without us keeping track of unnecessarily complicated indices. The two routes will be the outputs from the residual blocks in the config list which have 8 repeats. When we encounter an upsamling layer we will concatenate the output with the last route previously found following the figure above.

https://gist.github.com/5fc1043764ef6e1d6a99de7f068e48df

Before we move on to the data loading I'll add a test function below that acts as a sanity check that the model at least outputs the correct shapes.

https://gist.github.com/cb17429c919fac0239a5b18f5082bbfc

Loading the data

In the dataset class we will load an image and the corresponding bounding boxes perform augmentation using the Albumentations library and then create the matrix form of the target that will be used to compute the loss. We earlier mentioned that each scale will have anchor boxes associated with them and in the dataloading we will have to compute which should be responsible for the particular target bounding box. Everything in this section will be in a dataset.py file.

Imports

Most of the imports we are using are standard for the dataset class in PyTorch with the additional albumentations package for the data augmentation. The imports from the utils, however, require some explanation. In a utils.py file that you can find on github we will store some important functions for handling bounding boxes conversions, non-max suppression and mean average precision. The only function that we will use in the data loading is the intersection over union function taking as input two tensors with the width and height of bounding boxes and outputting the corresponding intersection over union. The other files for utils are only for checking that the data loading actually works. Plotting images and bounding boxes each time you modify the dataset class or augmentations can save you a lot of debugging time.

https://gist.github.com/54370f8d5a088c816ccd8f7f014d737f

Data format

The part of the data loading that is different from image classification is the way we process the bounding boxes and format them such that they can be inputted to the model. This dataset assumes that the data is formatted such that you have a folder with all images, a folder with a text file for each image detailing the bounding boxes and one or several csv files for the train, development and test set. The text file should be formatted such that each row corresponds to a bounding box of the image with class label, x coordinate y coordinate, width, height in that specific order. The bounding box coordinates should be relative to the image such that if an object has midpoint in the middle of the image and covers it in half in both width and height we would specify: class label 0.5 0.5 0.5 0.5, on a row in the text file. In the csv file you want to specify the image file name and the text file name in two different columns.

If you just want to get started without having to format the data you can download the Pascal-VOC dataset from here (link to Kaggle dataset) where the data is already formatted.

Even if your dataset is not formatted this way it should be manageable to modify the data loading such that you can still make the training labels the same way.

Dataset class overview

In a Pytorch dataset there are three building blocks: the init-method, the dataset length and the getitem-method.

The important part in the init-method is how we handle the anchor boxes. We will specify the anchor boxes in the following manner https://gist.github.com/dc65120b8fc1d1506e7f6183596d59c0

where each tuple corresponds to the width and the height of a anchor box relative to the image size and each list grouping together three tuples correspond to the anchors used on a specific prediction scale. The first list contains the largest anchor boxes which will be used for prediction on the coarsest grid where its presumably easier to predict larger bounding boxes. The following lists containing medium and small anchor boxes will be used for the medium and finest grid following the same reasoning. The anchors above are the ones used in the original paper but have be scaled to be relative to the image size.

If your dataset is very different from MSCOCO you would probably generate your own anchor boxes and then it is probably wise to assign the anchor boxes to the different scales by their size as was done in the paper. In this case you would collect data of the widths and heights of the bounding boxes in your dataset and run these through K-means clustering with the intersection of union as the distance measure. The resulting centroids would be your anchor boxes.

In the init-metod we will just combine the list above to a tensor of shape (9,2) corresponding to each anchor box on all scales. We will also specify an ignore-threshold which will be used when building the targets as is explained below.

The second challenging part of the data loading is in the getitem-method where we will load the image and the corresponding text file for the bounding boxes and process it such that we can input it to the model. For data augmentation we use the Albumentations library which requires the image and bounding boxes to be numpy arrays. The bounding boxes are also expected to be in the format [x, y, width, height, class label] which is different from how we have formatted it in the text file and we therefore use np.roll to change this. The reason for this inconsistency is that the text files are structured the same way as in the original implementation and if you are formatting a custom dataset you may consider modifying this if you are also using Albumentations.

Here it should be noted that if you download Pascal-VOC or MS COCO dataset from the official sites or from Joseph Redmon's website you may run into some out of range issues when using Albumentations depending on how you convert the labels to the format x, y, width, height where (x,y) signifies the object's midpoint. If you do, make sure you have converted the labels as is specified in this Github issue: link and you will save a couple of hours of debugging.

Building targets

When we load the labels for a specific image it will only be an array with all the bounding boxes and to be able to calculate the loss we want to format the targets similarily to the model output. The model will output predictions on three different scales so we will also build three different targets. Each target for a particular scale and image will have shape (number of anchors // 3, grid size, grid size, 6) where 6 corresponds to the object score, four bounding box coordinates and class label. We make two assumptions which are that there is only on label per bounding box and that there is an equal number of bounding boxes on each scale. We start with initializing the three different target tensors to zeros with targets = [torch.zeros((self.num_anchors // 3, S, S, 6)) for S in self.S] where self.S is a list with the different grid sizes e.g. for an image size of 416x416 we have S=[13, 26, 52] or more general we have S = [image_size// 32, image_size//16, image_size//8]since at the prediction state the feature map will have be downscaled with the factors in the denominator.

The next step is to loop through all the bounding boxes in this particular image. If you have a lot of bounding boxes this will be quite expensive but haven't yet figured out a way to remove this step without taking shortcuts when assigning the anchor boxes. Let me know if you have any ideas of how to optimize this! We will then compute the intersection over union between the target's width and height and all the anchor boxes and sort the result such that the index of the anchor with the largest intersection over union with the target box appears first in the list. https://gist.github.com/a5274f34bf0c843cc2afe589b10b0505

We will then loop through the nine indices to assign the target to the best anchors. Our goal is to assign each target bounding box to an anchor on each scale i.e. in total assign each target to one anchor in each of the target matrices we intialized above. In addition we will also check if an anchor is not the most suitable for the bounding box but it still has an intersection over union higher than 0.5 as is specified in the ignore_iou_thresh and then we will mark this target such that no loss is incurred for the prediction of this anchor box. From my understanding the reasoning behind this is that during inference this anchor could also make valid predictions on similar objects and non-max suppression will remove surplus bounding boxes. We first compute which cell the bounding box belongs to by i, j = int(S * y), int(S * x) and then we check if the anchor we are currently at is taken in this cell by anchor_taken = targets[scale_idx][anchor_on_scale, i, j, 0]. As you can probably imagine it is relatively uncommon for most datasets to have two objects with midpoint in the same cell of such similar size that they fit the same anchor box, however, if you run this through a couple of hundred examples you'll notice it occurs several times on for example the Pascal-VOC dataset. In addition to checking if the particular anchor is taken, we also check if the current bounding box already has an anchor on this particular prediction scale. We only want one target on each scale to allow for specialization between the anchor boxes such that they focus on prediction different kinds of objects.

If we find an anchor that is unoccupied and our current bounding box does not have an anchor on the scale which the anchor belongs to, we want to assign this anchor to the bounding box. First we will set the object score on this anchor to 1 by: targets[scale_idx][anchor_on_scale, i, j, 0] = 1, to indicate that there is an object in this cell. We then compute the box coordinates relative to the cell such the midpoint (x,y) states where in the cell the object is and the width and the height corresponds to how many cells the bounding box covers. This is computed by: https://gist.github.com/0660be0880179bf4a12b8ac7dc2840c9

We will then add the bounding box coordinates as well as the class label to the cell and the anchor box indicated by i, j and anchor_on_scale respectively. Lastly we will update the flag has_anchor[scale_idx] to True to indicate that the particular prediction scale now has an anchor. Missing this line will lead to assigning each bounding box to the worst possible anchor on all scales.

Only doing the data loading in the way above would be sufficient. In the YOLOv3 paper they, however, also check if the anchor we are currently at has an intersection over union greater than ignore_iou_thresh = 0.5 and then they do not incur loss for this anchor's prediction. We will do this by setting the object score of the anchor in the object cell to -1 i.e. targets[scale_idx][anchor_on_scale, i, j, 0] = -1. In the loss function we will later make sure that no loss is incurred for these anchors.

Below is the complete dataset class.

https://gist.github.com/53a7b2db4c3551436c46187c436b622d

To make sure that the data loading works it is beneficial to plot a few examples with augmentations added to them and the bounding boxes. The code below should do the trick, possibly with some modifications depending on how you structure the data.

https://gist.github.com/931e30ba7c7c9881630230a4ccac0f7e

YOLOv3 loss function

In the original YOLO paper the author states the loss function and the same expression circles can be found in articles on YOLOv2 or v3 which is kind of a simplification compared to the actual implementation. If you are familiar with the original YOLO loss you will recognize all parts below but they are tweaked to match the idea with the anchor boxes. The loss function can be divided into four parts and I will go through each separately and then combine them in the end.

First we will form two binary tensors signifying where in what cells using which anchors that have objects assigned to them and not. https://gist.github.com/917bef5d2c3b78ca2782f41d26bfd40e

The reason for not only using one of these is that we in the data loading assign the anchors which we should ignore to -1. Indexing only the indices above in all parts of the loss function will make sure that we do not incur any loss on these anchors. Don't worry if the formulas below confuse you, I have just translated the code for those finding it easier to understand the loss in that format.

No object loss

For the anchors in all cells that do not have an object assigned to them i.e. all indices that are set to one in noobj we want to incur loss only for their object score. The target will be all zeros since we want these anchors to predict an object score of zero and we will apply a sigmoid function to the network outputs and use a binary crossentropy loss. In code we have that https://gist.github.com/88be8181592bc2855b8b4d5b264109e9

where self.bce refers to an instance of the PyTorch BCEWithLogitsLoss() which applies the sigmoid function and then calculates the binary crossentropy loss.

In mathematics we have that $$ \begin{aligned} L_{noobj} &= \frac{1}{N \sum_{a, i,j} \mathbb{1}{a\ i\ j}^{\text {noobj }}} \sum{n=1}^N \sum_{a,i,j \in \mathbb{1}{a\ i \ j}^{\text {noobj }}} BCE \left ( y{n,a,i,j}^{obj}, \sigma\left(t_{n,a,i,j}^{obj}\right)\right) \ &=\frac{1}{N \sum_{a, i,j} \mathbb{1}{a\ i\ j}^{\text {noobj }}} \sum{n=1}^N \sum_{a,i,j \in \mathbb{1}{a\ i \ j}^{\text {noobj }}} -\left[y{n,a,i,j}^{obj} \cdot \log \sigma\left(t_{n,a,i,j}^{obj}\right)+\left(1-y_{n,a,i,j}^{obj}\right) \cdot \log \left(1-\sigma(t_{n,a,i,j}^{obj})\right)\right] \end{aligned} $$

where $N$ is the batch size, $i,\ j$ signifies the cell where and $a$ the anchor index and $\mathbb{1}_{a\ i\ j}^{\text {noobj }}$ is a binary tensor with ones on anchors not assigned to an object. The output from the network is denoted $t$ and the target $y$ and $\sigma$ is the sigmoid function given by

$$ \sigma(x) = \frac{1}{1+e^{-x}}. $$

Object loss

For the anchors that have and object assigned to them we want them to predict a appropriate bounding box for the object. When building the target tensors we assigned these anchors to have an object score to 1. One idea is to then just do similarily as in the no object loss and train the network to output large values in the cells and anchors for which we have assigned a target bounding box. This would, however, mean that no matter how horrible a bounding box prediction the network makes it would still try to predict a high object score. During inference we are guided by the object score when choosing which bounding boxes to output and if we do as proposed the object score would actually not reflect how likely it actually is that there is an object in the ouputted bounding box. The idea in the YOLOv3 paper instead that the object score that the model predicts should reflect the intersection over union with the target bounding box. It is slightly unclear how this is actually implemented originally and I have seen several different versions in others' code. In our implementation we will during training time calculate the intersection over union between the target bounding boxes and the predicted bounding boxes in the output. This does not seem to slow down training noticeably.

In the code we will convert the model predictions to bounding boxes according to the formulas in the paper $$ \begin{array}{l} b_{x}=\sigma\left(t_{x}\right) \ b_{y}=\sigma\left(t_{y}\right) \ b_{w}=p_{w} e^{t_{w}} \ b_{h}=p_{h} e^{t_{h}}, \end{array} \ $$ where $p_w$ and $p_h$ are the anchor box dimensions and ($b_x, b_y, b_w, b_h$) is the resulting bounding box relative to the cell. We will then calculate the intersection over union with the target that we defined in the dataset class and lastly as in the no object loss above apply the binary cross entropy loss between the object score predictions and the calculated intersection over union. Note that the loss will only be applied to the anchors assigned to a target bounding box signified by indexing by obj.

https://gist.github.com/008f516f6885c8c97bba84f0a505f569

The mathematical formula will be similar to the one above $$ L_{obj}= \frac{1}{N \sum_{a, i,j} \mathbb{1}{a\ i\ j}^{\text {obj }}} \sum{n=1}^N \sum_{a,i,j \in \mathbb{1}{a\ i \ j}^{\text {obj }}} BCE \left ( \hat{y}{n,a,i,j}^{obj}, \sigma\left(t_{n,a,i,j}^{obj}\right)\right) $$ with $$ \hat{y} = IOU(y^{box}, b) $$ where $b$ is the bounding box computed above and $ \mathbb{1}_{a\ i\ j}^{\text {obj }}$ corresponds to the binary tensor with ones for the anchors assigned to a target bounding box.

Box coordinates loss

For the box coordinates we will simply use a mean squared error loss in the positions where there actually are objects. All predictions where there is no corresponding target bounding box will be ignored. We will apply a sigmoid function to the $x$ and $y$ coordinates to make sure that they are between [0,1] but instead of converting the widths and heights as above we want to compute the ground truth value $\hat{t}$ that the network should prediction which we find by $$ \begin{aligned} \hat{t}_w &= \log (y_w / p_w) \ \hat{t}_h &= \log (y_h / p_h) \end{aligned} $$ where the $y_w$ and $y_h$ are the target width and height. We will then apply the mean squared error loss between the targets and predictions.
https://gist.github.com/f4f3163ca300d5888b094a90f92588ce

The equivalent formula is given by $$ L_{box} = \frac{1}{N \sum_{a, i,j} \mathbb{1}{a\ i\ j}^{\text {obj }}} \sum{n=1}^N \sum_{a,i,j \in \mathbb{1}{a\ i \ j}^{\text {obj }}} \left(\sigma(t^x{n,a,i,j}) - y^x_{n,a,i,j} \right)^2 + \left(\sigma(t^y_{n,a,i,j}) - y^y_{n,a,i,j} \right)^2 + \left(t^w_{n,a,i,j} - \hat{t}^w_{n,a,i,j} \right)^2 + \left(t^h_{n,a,i,j} - \hat{t}^h_{n,a,i,j} \right)^2, $$ where $\hat{t}^*$ is the ground truth labels for what actual values the model should predict.

Class loss

We will only incur loss for the class predictions where there actually is an object. Our implementation differs slightly from the paper's in the case of a class loss and we will use a cross entropy loss to compute the class loss. This assumes that each bounding box only has one label. The YOLOv3 motivates that it does not want to have this limitation and instead uses an binary cross entropy such that several labels can be assigned to a single object e.g. woman and person. https://gist.github.com/c4e230917cebda8f48157414409374e2

where self.entropy refers to an instance of PyTorch's CrossEntropyLoss() with combines the softmax function and negative loglikelihood loss. This corresponds to

$$ L_{class} = \frac{1}{N \sum_{a, i,j} \mathbb{1}{a\ i\ j}^{\text {obj }}} \sum{n=1}^N \sum_{a,i,j \in \mathbb{1}{a\ i \ j}^{\text {obj }}} -\log \left(\frac{\exp (t{n, a, i,j}^{c})}{\sum_{k} \exp (t_{n, a, i,j}^k)}\right), $$ where $t_{n, a, i,j}^{c}$ is the prediction for the correct class $c$.

Total loss

I will not attempt to put the entire loss function in a single formula as this only creates an unnecessarily complicated expression when each part can be understood and computed separately. The total loss is computed by https://gist.github.com/9e6320b3b3c24385816ce7728ec30e6d

or equivalently $$ L= \lambda_{noobj} L_{noobj} + \lambda_{obj} L_{obj} + \lambda_{box} L_{box} + \lambda_{class} L_{class} $$ where each $\lambda_$ is a constant signifying the importance of each part of the loss. It seems that the original implementation uses $\lambda_ = 1$ for all constants but during training we found better convergence by modifying them.

The complete code for the loss function is found below and the code is placed in a separate loss.py file.

https://gist.github.com/b045892f42a96274902bb349126d8c5c

Training the model

The training configuration is completely contained in the config.py file that can be found on github. This is where we specify the image size, dataset paths, augmentations, learning rate and all other constants. I will not include it here and if you implement YOLOv3 you can just copy it from above or write you own training configuration.

What we instead will focus on is building the training loop which should be quite straightforward. Everything from here will be placed in a train.py file which we can then run to train the model. First we will define the imports where we will import our previously defined functions and in addition a couple of helper functions from the utils.py file you can find on github.

https://gist.github.com/e199c9fb7a44a88b6ec59bf15301e642

We will then define a training function which will train the network for one epoch. We will take as input the model, the data loader, the optimizer the loss function, a scaler for mixed precision training and scaled anchors such that each anchor is relative to the prediction scale. Originally the anchors are relative to the entire image but to the loss we want to input them relative to the cell and this is accomplished by scaling them with the grid size of the prediction scale.

We calculate the total loss as the sum of the losses for each prediction scale, three of them in total. We use mixed precision training to train the model.

https://gist.github.com/f496588355a689a781e955bf373bc31f

We have now come to the part where we are ready to actually train the model. The main function will take care of setting up the model, loss function data loaders etc. and in each epoch we will run the train function defined above. Once every ten epochs we will evaluate the model by checking the mean average precision on the test loader. Note that this can be costly if your model's performance is bad because there may be many false positives that the non max suppression and mean average precision functions have to loop through.

https://gist.github.com/b6d8c261d30a1fb299a92766b66c13c2

We have now reached the end of this YOLOv3 implementation and if you feel that everything is crystal clear then: Wow I've really outdone myself. It is more likely that you have to reiterate this and possibly others' implementations if your goal is to implement YOLOv3 yourself. Anyhow, I hope that you take with you some key implementational details of YOLOv3 from this article and if you have any lingering thoughts, leave a comment!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment