YimianDai/YOLOOutputV3.md

## YOLOOutputV3.md

      
    Raw
  

              YOLOOutputV3.md
            
          
    YOLOOutputV3

这个 Block 真正学习的 component 只有将 feature map 转化成 Prediction 的那个 Conv2D 而已, 剩下的都是在做一些 class / bbox decoding, 具体而言 YOLOOutputV3 的 forward 依次完成了如下步骤:

根据给定的 feature map 通过 Conv2D 做出 Prediction, 这个 Prediction 依次包含 [cx, cy, w, h, objness, class_pred], class_pred 有 num_class 个长度
分别从上一步的 Prediction 中 slice 出 cx, cy, w, h, objness, class_pred
根据当前的 offsets 和 stride, 将 Feature Map 上的 cx, cy, w, h 映射成原图上的 [xmin, ymin, xmax, ymax]
根据 objness 得到 confidence, 即 confidence = sigmoid(objness)
根据 confidence 和 class_pred 得到 class_score, 即 sigmoid(class_pred) * confidence, 其中 sigmoid(class_pred) 是每类各自的概率, 概率乘上 confidence 才是这个类的 score
将 ids, scores, bboxes 拼成 detections

__init__

    def __init__(self, index, num_class, anchors, stride,
                 alloc_size=(128, 128), **kwargs):
        super(YOLOOutputV3, self).__init__(**kwargs)

num_class 是 foreground object 的类别数
anchors 是 [10, 13, 16, 30, 33, 23] 这样的 List
stride 是当前的 feature map 相对于原图的下采样倍数, 是 8, 16, 32 中的一个

        anchors = np.array(anchors).astype('float32')
        self._classes = num_class
        self._num_pred = 1 + 4 + num_class  # 1 objness + 4 box + num_class
        self._num_anchors = anchors.size // 2
        self._stride = stride

self._num_pred 是每一个 anchor 需要预测的数, 1 objness + 4 box + num_class
我搞不懂,为什么 self._num_anchors 是实际 anchors 的元素数除以 2

        with self.name_scope():
            all_pred = self._num_pred * self._num_anchors
            self.prediction = nn.Conv2D(all_pred, kernel_size=1, padding=0, strides=1)

all_pred 就是 每一个 anchor 需要预测的数 乘上 anchor 数目, 也就是每一个 feature map 上的点需要预测的数目,也就是 self.prediction 最后的 channel 数
self.prediction 是用 1 x 1 卷积来预测的

            # anchors will be multiplied to predictions
            anchors = anchors.reshape(1, 1, -1, 2)
            self.anchors = self.params.get_constant('anchor_%d'%(index), anchors)

以输入的 anchors 为 [10, 13, 16, 30, 33, 23] 得到的输出的 anchors, shape 是 (1, 1, 3, 2)

array([[[[10, 13],
         [16, 30],
         [33, 23]]]])
            # offsets will be added to predictions
            grid_x = np.arange(alloc_size[1])
            grid_y = np.arange(alloc_size[0])
            grid_x, grid_y = np.meshgrid(grid_x, grid_y)

以默认的 alloc_size=(128, 128) 为例,得到的 grid_x, grid_y 分别是

array([[  0,   1,   2, ..., 125, 126, 127],
       [  0,   1,   2, ..., 125, 126, 127],
       [  0,   1,   2, ..., 125, 126, 127],
       ...,
       [  0,   1,   2, ..., 125, 126, 127],
       [  0,   1,   2, ..., 125, 126, 127],
       [  0,   1,   2, ..., 125, 126, 127]])
array([[  0,   0,   0, ...,   0,   0,   0],
       [  1,   1,   1, ...,   1,   1,   1],
       [  2,   2,   2, ...,   2,   2,   2],
       ...,
       [125, 125, 125, ..., 125, 125, 125],
       [126, 126, 126, ..., 126, 126, 126],
       [127, 127, 127, ..., 127, 127, 127]])
            # stack to (n, n, 2)
            offsets = np.concatenate((grid_x[:, :, np.newaxis], grid_y[:, :, np.newaxis]), axis=-1)
            # expand dims to (1, 1, n, n, 2) so it's easier for broadcasting
            offsets = np.expand_dims(np.expand_dims(offsets, axis=0), axis=0)
            self.offsets = self.params.get_constant('offset_%d'%(index), offsets)

grid_x[:, :, np.newaxis] 会在最后添加一维, 得到的是 (128, 128, 1) 的 np.ndarray
第一行代码就是将 grid_x[:, :, np.newaxis] 和 grid_y[:, :, np.newaxis] 沿着最后一维拼起来, 得到 (128, 128, 2) 的 shape
第二行代码就是在 axis = 0 上增加一维, 做两次, 最后得到 (1, 1, 128, 128, 2) 的 np.ndarray
最后会将其变成 mx.ndarray

hybrid_forward

    def hybrid_forward(self, F, x, anchors, offsets):

x 是 (B, C, H_i, W_i) 这样的 mx.ndarray
anchors 是 (1, 1, 3, 2) 这样的 mx.ndarray
offsets 是 (1, 1, 128, 128, 2) 的 mx.ndarray

        # prediction flat to (batch, pred per pixel, height * width)
        pred = self.prediction(x).reshape((0, self._num_anchors * self._num_pred, -1))
        # transpose to (batch, height * width, num_anchor, num_pred)
        pred = pred.transpose(axes=(0, 2, 1)).reshape((0, -1, self._num_anchors, self._num_pred))

输入 x 是 (B, C, H_i, W_i), self.prediction(x) 得到的是 (B, _num_pred x _num_anchors, H_i, W_i), reshape 后得到的是 (B, _num_pred x _num_anchors, H_i x W_i) 的 mx.ndarray
transpose 后得到的是 (B, H_i x W_i, _num_pred x _num_anchors) 的 mx.ndarray, 再 reshape 后得到的 pred 是 (B, H_i x W_i, num_anchors, num_pred)

        # components
        raw_box_centers = pred.slice_axis(axis=-1, begin=0, end=2)
        raw_box_scales = pred.slice_axis(axis=-1, begin=2, end=4)
        objness = pred.slice_axis(axis=-1, begin=4, end=5)
        class_pred = pred.slice_axis(axis=-1, begin=5, end=None)

_num_pred 预测的是 1 objness + 4 box + num_class
由此可见, 前 2 个元素是 box_centers 也就是 cx, cy, 第 3, 4 个是 box_scales 应该就是 w 和 h, 第 5 个是 objness, 从第 6 个开始一直到最后都是 class_pred, 而且这个 num_class 是 fg 的类数
raw_box_centers 是一个 (B, H_i x W_i, _num_anchors, 2) 的 mx.ndarray
raw_box_scales  是一个 (B, H_i x W_i, _num_anchors, 2) 的 mx.ndarray
objness         是一个 (B, H_i x W_i, _num_anchors, 1) 的 mx.ndarray
class_pred      是一个 (B, H_i x W_i, _num_anchors, num_class) 的 mx.ndarray

        # valid offsets, (1, 1, height, width, 2)
        offsets = F.slice_like(offsets, x * 0, axes=(2, 3))
        # reshape to (1, height*width, 1, 2)
        offsets = offsets.reshape((1, -1, 1, 2))

在第一行代码中, x 是 (B, C, H_i, W_i) , offsets 是 (1, 1, 128, 128, 2), 得到的 offsets 是 (1, 1, H_i, W_i, 2)
reshape 后得到的 offsets 是 (1, H_i x W_i, 1, 2)

        box_centers = F.broadcast_add(F.sigmoid(raw_box_centers), offsets) * self._stride
        box_scales = F.broadcast_mul(F.exp(raw_box_scales), anchors)
        confidence = F.sigmoid(objness)
        class_score = F.broadcast_mul(F.sigmoid(class_pred), confidence)
        wh = box_scales / 2.0
        bbox = F.concat(box_centers - wh, box_centers + wh, dim=-1)

raw_box_centers 是 (B, H_i x W_i, _num_anchors, 2) 的 mx.ndarray, offsets 是 (1, H_i x W_i, 1, 2) 的 mx.ndarray, 这两 broadcast_add 之后得到的还是 (B, H_i x W_i, _num_anchors, 2) 的 mx.ndarray, 乘上 stride 后就把 cx 和 cy 映射回原图上的坐标
raw_box_scales 是 (B, H_i x W_i, _num_anchors, 2) 的 mx.ndarray, anchors 是 (1, 1, 3, 2) 的 mx.ndarray, _num_anchors 就是 3, 两者做 broadcast_mul 后得到的 box_scales 还是 (B, H_i x W_i, _num_anchors, 2) 的 mx.ndarray, 这样原本 anchors 里面前后两个数值就有意义了分别是 anchor 的 h 和 w, 这也是为什么 self._num_anchors = anchors.size // 2 的原因
confidence 还是 (B, H_i x W_i, _num_anchors, 1) 的 mx.ndarray
class_pred 是 (B, H_i x W_i, _num_anchors, num_class) 的 mx.ndarray, confidence 是 (B, H_i x W_i, _num_anchors, 1) 的 mx.ndarray, 因此得到的 class_score 是一个 (B, H_i x W_i, _num_anchors, num_class) 的 mx.ndarray
wh 里面存的都是 w / 2, h / 2
cx - w/2, cx + w/2, cy - h/2, cy + h/2, 因此 bbox = F.concat(box_centers - wh, box_centers + wh, dim=-1) 其实是完成了从 [cx, cy, w, h] 中转化成 [xmin, ymin, xmax, ymax], 因此最后得到的 bbox 会是 (B, H_i x W_i, _num_anchors, 4) 的 mx.ndarray

        if autograd.is_training():
            # during training, we don't need to convert whole bunch of info to detection results
            return (bbox.reshape((0, -1, 4)), raw_box_centers, raw_box_scales,
                    objness, class_pred, anchors, offsets)

bbox.reshape((0, -1, 4)) 后得到的返回量是 (B, H_i x W_i x _num_anchors, 4) 的 mx.ndarray
最后整理一下每个返回量:

bbox.reshape((0, -1, 4)) 是 (B, H_i x W_i x _num_anchors, 4) 的 mx.ndarray, 但里面的数值是映射到原图上的
raw_box_centers 会是 (B, H_i x W_i, _num_anchors, 2) 的 mx.ndarray, 但里面的数值是映射到原图上的
raw_box_scales 会是 (B, H_i x W_i, _num_anchors, 2) 的 mx.ndarray, 但里面的数值是映射到原图上的
objness 是 (B, H_i x W_i, num_anchors, 1) 的 mx.ndarray
class_pred 是 (B, H_i x W_i, num_anchors, num_class) 的 mx.ndarray
anchors 是  (1, 1, 3, 2) 这样的 mx.ndarray
offsets 是 (1, H_i x W_i, 1, 2) 的 mx.ndarray, 存的是作为当前 feature map 以行优先逐个扫描整个 feature map 单独的 index


        # prediction per class
        bboxes = F.tile(bbox, reps=(self._classes, 1, 1, 1, 1))
        scores = F.transpose(class_score, axes=(3, 0, 1, 2)).expand_dims(axis=-1)
        ids = F.broadcast_add(scores * 0, F.arange(0, self._classes).reshape((0, 1, 1, 1, 1)))
        detections = F.concat(ids, scores, bboxes, dim=-1)
        # reshape to (B, xx, 6)
        detections = F.reshape(detections.transpose(axes=(1, 0, 2, 3, 4)), (0, -1, 6))
        return detections  

bboxes: bbox 是 (B, H_i x W_i, _num_anchors, 4) 的 mx.ndarray, self._classes 就是 num_class, tile 之后得到的 bboxes 则是 (num_class, B, H_i x W_i, _num_anchors, 4), 就是把 bbox 沿着新的维度复制了几遍
scores: class_score 是一个 (B, H_i x W_i, _num_anchors, num_class) 的 mx.ndarray, transpose 将其变换成 (num_class, B, H_i x W_i, _num_anchors), 经过 expand_dims 后得到的 scores 是一个 (num_class, B, H x W, _num_anchors, 1) 的 mx.ndarray
ids: F.arange(0, self._classes) 就是一个 [0, 1, 2, ..., num_class-1], reshape 后得到的是 (num_class, 1, 1, 1, 1) 的 mx.ndarray, scores 是 (num_class, B, H x W, _num_anchors, 1) 的 mx.ndarray, 最后得到的 ids 是 (num_class, B, H_i x W_i, num_anchors, 1) 的 mx.ndarray
detections: 第一行将 ids, scores, bboxes 沿着最后一维拼起来, 得到的是 (num_class, B, H_i x W_i, num_anchors, 6) 的 mx.ndarray, transpose 后得到的是 (B, num_class, H_i x W_i, num_anchors, 6), reshape 之后得到的是 (B, num_class x H_i x W_i x num_anchors, 6), 最后一维的 6 个元素分别是 id, score 和 bbox