YimianDai/train_ssd.md

## train_ssd.md

      
    Raw
  

              train_ssd.md
            
          
    Train Script

$ python train_tiny_ssd.py --gpus 0,1 -j 16 --lr 0.0001 --lr-decay-epoch 160,200 --lr-decay 0.1 --epochs 240 --data-shape 512 --train-split train --val-split val --nms-thresh 0.01 --batch-size 12 --resume best_TinySSD.params 
Debug
python train_tiny_ssd.py --gpus 0 -j 6 --lr 0.0001 --lr-decay-epoch 160,200 --lr-decay 0.1 --epochs 240 --data-shape 512 --train-split train --val-split val --nms-thresh 0.5 --batch-size 12 --dataset DENTIST
其他一些问题

train_ssd.py 是怎么实现多 GPU 的？


Data: 在 DataLoader 返回一个 Batch 的数据后，调用 gluon.utils.split_and_load 函数会将 Image 和 Label split 并均匀 load 到各个 GPU 上。
Model: data 是一个 List of MXNet.NDArray, len(data) 就是调用的 GPU 个数 num_gpus, x 是 List 中的 MXNet.NDArray，不同的 x 在不同的 GPU 上，因此返回的 cls_pred 也都是在不同的 GPU 上；由于 Lazy 机制，每个 x 就被分配到各个 GPU 上，由此完成的并行计算

                for x in data:
                    cls_pred, box_pred, _ = net(x)

Loss: 返回的 box_losses、sum_losses 这样还是个由 (B // num_gpus, ) 形状的 MXNet.NDArray 构成的 List，每个 MXNet.NDArray 都在不同的 GPU 上，因为每个 sub-batch 都是分开运算的
Optimization: MXNet 有机制在不同设备上优化
Metric: VOC07MApMetric 中有一个 as_numpy 函数负责将不同 GPU 输出拼成的 List of MXNet.NDArray 拼起来并转化成 NumPy.NDArray，也就是在 CPU 上运算

代码解读

代码整体架构


parse_args

dali: NVIDIA DALI 是一个 GPU 加速的数据增强和图像加载库, data loading and data preprocessing in training
amp: MXNet AMP，AMP 是 Automatic Mixed Precision 的缩写，用来做 mixed precision training
horovod: Uber 开源的分布式训练工具，对于我只有 2 个 GPU 来说用处不大。


get_dataset

得到 train_dataset, val_dataset, val_metric


get_dataloader

use fake data to generate fixed anchors for target generation
Set batchify_fn and train_loader
Set val_batchify_fn and val_loader


get_dali_dataset

得到 train_dataset, val_dataset, val_metric


get_dali_dataloader

得到 train_loader, val_loader


save_params

每 save_interval 个 epoch save 一次
得到最高的 metric 时 save 一次


validate
train
__main__

train

        for i, batch in enumerate(train_data):
            if args.dali:
                # dali iterator returns a mxnet.io.DataBatch
                data = [d.data[0] for d in batch]
                box_targets = [d.label[0] for d in batch]
                cls_targets = [nd.cast(d.label[1], dtype='float32') for d in batch]
            else:
                data = gluon.utils.split_and_load(batch[0], ctx_list=ctx, batch_axis=0)
                cls_targets = gluon.utils.split_and_load(batch[1], ctx_list=ctx, batch_axis=0)
                box_targets = gluon.utils.split_and_load(batch[2], ctx_list=ctx, batch_axis=0)
从 Data Flow in SSD - 1.2 SSDDefaultTrainTransform 中可以知道，transform 是 SSDDefaultTrainTransform 的 VOCDetection 返回的 img, cls_target, box_target 分别是:

img 就是一个 (3, H, W) 的 MXNet.NDArray，每个元素的数值为 0 均值，方差为 1 的分布
cls_target 是长度为 (H_1 x W_1 + ... + H_6 x W_6, ) 的 MXNet.NDArray
box_target：(H_1 x W_2 x num_anchors + ... + H_6 x W_6 x num_anchors, 4)，如果是 Positive Anchor，那么行向量就是 Normalized 后的 (center_x, center_y, width, height) 的 Center 编码，否则就是 Background Anchor 就是 全零向量
这里的 N =

经过 batchify_fn = Tuple(Stack(), Stack(), Stack()) 的 DataLoader 返回的 batch 是一个 List，里面的元素分别是

batch[0] 是 (B, 3, H, W) 的 MXNet.NDArray, 是 img 的 batch
batch[1] 是 (B, H_1 x W_1 + ... + H_6 x W_6) 的 MXNet.NDArray, 是 cls id 的 batch
batch[2] 是 (B, H_1 x W_2 x num_anchors + ... + H_6 x W_6 x num_anchors, 4) 的 MXNet.NDArray, 是 bbox 的 batch

因此，

data 是一个 List of MXNet.NDArray，以我两个 GPU 为例，里面的每个元素是 (B // 2, 3, H, W) 的 MXNet.NDArray
cls_targets 是一个 List of MXNet.NDArray，以我两个 GPU 为例，里面的每个元素是 (B // 2, H_1 x W_1 + ... + H_6 x W_6) 的 MXNet.NDArray
box_targets 是一个 List of MXNet.NDArray，以我两个 GPU 为例，里面的每个元素是 (B // 2, H_1 x W_2 x num_anchors + ... + H_6 x W_6 x num_anchors, 4) 的 MXNet.NDArray

这里的多 GPU 是怎么运作的？通过将一个 MXNet.NDArray split_and_load 成一个 List of MXNet.NDArray，然后在下面的代码中通过 Lazy 机制分配到每个 GPU 上。而 FCN 那里则是通过 DataParallelModel 来 Hide the difference of single/multiple GPUs to the user。
            with autograd.record():
                cls_preds = []
                box_preds = []
                for x in data:
                    cls_pred, box_pred, _ = net(x)
                    # print("nd.contrib.isnan(cls_pred).sum(): ", nd.contrib.isnan(cls_pred).sum())
                    # print("nd.contrib.isnan(box_pred).sum(): ", nd.contrib.isnan(box_pred).sum())
                    cls_preds.append(cls_pred)
                    box_preds.append(box_pred)
                sum_loss, cls_loss, box_loss = mbox_loss(
                    cls_preds, box_preds, cls_targets, box_targets)
                # print("sum_loss: ", sum_loss[0].sum())
                if args.amp:
                    with amp.scale_loss(sum_loss, trainer) as scaled_loss:
                        autograd.backward(scaled_loss)
                else:
                    autograd.backward(sum_loss)
            # since we have already normalized the loss, we don't want to normalize
            # by batch-size anymore
            trainer.step(1)

以我两个 GPU 为例，x 是 (B // 2, 3, H, W) 的 MXNet.NDArray
cls_pred 是 (B // 2, H_1 x W_1 + ... + H_6 x W_6, self.num_classes + 1) 的 MXNet.NDArray
box_pred 是 (B // 2, H_1 x W_1 x num_anchors + ... + H_6 x W_6 x num_anchors, 4) 的 MXNet.NDArray
cls_preds 是一个 List，里面的元素是 (B // 2, H_1 x W_1 + ... + H_6 x W_6, self.num_classes + 1) 的 MXNet.NDArray
box_preds 是一个 List，里面的元素是 (B // 2, H_1 x W_1 x num_anchors + ... + H_6 x W_6 x num_anchors, 4) 的 MXNet.NDArray