YimianDai/SSDMultiBoxLoss.md

## SSDMultiBoxLoss.md

      
    Raw
  

              SSDMultiBoxLoss.md
            
          
记录一些对于 GluonCV 中 SSD 的损失函数 SSDMultiBoxLoss 的一些笔记。
class SSDMultiBoxLoss(gluon.Block):
    """Single-Shot Multibox Object Detection Loss.
    
    Parameters
    ----------
    negative_mining_ratio : float, default is 3
        Ratio of negative vs. positive samples.
    rho : float, default is 1.0
        Threshold for trimmed mean estimator. This is the smooth parameter for the
        L1-L2 transition.
    lambd : float, default is 1.0
        Relative weight between classification and box regression loss.
        The overall loss is computed as :math:`L = loss_{class} + \lambda \times loss_{loc}`.

    """
    def __init__(self, negative_mining_ratio=3, rho=1.0, lambd=1.0, **kwargs):
        super(SSDMultiBoxLoss, self).__init__(**kwargs)
        self._negative_mining_ratio = max(0, negative_mining_ratio)
        self._rho = rho
        self._lambd = lambd
可以看到 SSDMultiBoxLoss 是采用 Hard Example Mining 策略的。
Code

GluonCV-CV 上的 SSDMultiBoxLoss 类实现了 SSD 的 Loss Function
def forward(self, cls_pred, box_pred, cls_target, box_target):
我们先来看输入

从 SSD 的代码可以看出， cls_pred 是 SSDMultiBoxLoss 的输入，也就是 SSD 的输出 cls_preds，其大小是 (B, K_0*W_0*H_0 + K_1*W_1*H_1 + ... + K_5*W_5*H_5, C+1)，或者说是 (B, N, C+1)，N 是所有 layers 的 anchor 的总数
从 SSD 的代码可以看出， box_pred 是 SSDMultiBoxLoss 的输入，也就是 SSD 的输出box_preds，其大小是 (B, K_0*W_0*H_0 + K_1*W_1*H_1 + ... + K_5*W_5*H_5, 4)，或者说是 (B, N, 4)，N 是所有 layers 的 anchor 的总数
注意哦，SSD 和 TinySSD 的输出的不同，SSD 是 (B, N, C+1) 以及 (B, N, 4)，TinySSD 是 (B, N, C+1) 以及 (B, N*4)
    """Compute loss in entire batch across devices."""
    # require results across different devices at this time
    cls_pred, box_pred, cls_target, box_target = [_as_list(x) \
        for x in (cls_pred, box_pred, cls_target, box_target)]
    # cross device reduction to obtain positive samples in entire batch
    num_pos = []
    for cp, bp, ct, bt in zip(*[cls_pred, box_pred, cls_target, box_target]):
        pos_samples = (ct > 0)
        num_pos.append(pos_samples.sum())
    num_pos_all = sum([p.asscalar() for p in num_pos])
    if num_pos_all < 1:
        # no positive samples found, return dummy losses
        return nd.zeros((1,)), nd.zeros((1,)), nd.zeros((1,))
上面这一部分，有点 trivial，
cls_pred, box_pred, cls_target, box_target = [_as_list(x) \ for x in (cls_pred, box_pred, cls_target, box_target)] 这个代码就只是检查
SSDMultiBoxLoss

从下面所有的看来，其实计算 Loss ，不管是 Classification Loss 这一部分，还是 Localization Loss, 都是以 Groundtruth 为主的，作为主体的。Classification Loss，是以 Groundtruth Anchors 为主体，Cross Entropy 么，真正用上的就是 Groundtruth 的那个 label 类的输出概率；Localization Loss 同样也是以 Groundtruth Anchors 只不过是 Positive Groundtruth Anchor 作为主体，这也是显然的，因为 Negative Groudtruth Anchor 都没有对应的 BBox，都是 0，不需要做 BBox Regression
对于输入：cls_pred 肯定是 (B, N) 的，box_pred 肯定是 (B, N, 4)，N 是这一层 Laeyer 的 Anchor 总数，cls_target 也是 (B, N) 的，box_target 是 (B, N, 4)，总之多谢 SSDDefaultTrainTransform 中的 SSDTargetGenerator，让 Groundtruth Label 变成了和 Model Predicted Label 一样的尺寸，这样就很方便计算 Loss 了

Just Anchor Generation
Target Generation (Assign Label to Anchors)
Anchors in Training / Loss Function
Anchors in Validation / mAP

pred 是 cls_pred 是 (B, N, C+1), C 是 Object 的类别数，C + 1 表示算上了 Background 的 类别数，我擦，我真傻比，Predict 当然是每一类可能的都会有啦
ct 是 cls_target 大小是 (B, N)
pos = ct > 0 所有正类的 逻辑索引，大小还是 (B, N)，但是对于 每个 Anchor 如果匹配的是正类，这些地方的逻辑索引是 1
cls_loss = -nd.pick(pred, ct, axis=-1, keepdims=False)，pred 怎么以同等大小的 ct 来做 pick，傻逼了，pred 是 cls_pred 是 (B, N, C+1)，ct 是 (B, N) 当然可以做，而且返回的大小还是 (B, N)
pos = ct > 0 pos 是一个 True or False 的 Mask 矩阵，Positive 类，非 Background 的 Mask
cls_loss = -nd.pick(pred, ct, axis=-1, keepdims=False) 其中，ct 是 (B, N) 存储的是每一个 Anchor 的 Groundtruth Label，这些 Groundtruth Label 作为 Index，nd.pick(pred, ct, axis=-1, keepdims=False) 就是取出了 pred 中对应 Index 的值，也就是去除了那些 Model 在那些 Anchor 的 Groundtruth Label 的 class 上的预测值，cls_loss 的尺寸还是 (B, N)
rank = (cls_loss * (pos - 1)).argsort(axis=1).argsort(axis=1) 中的 (pos - 1) 这个操作是将 Positive 类的 Mask 变成 Negative 类（Background）或者 Ignore 的 mask，正类对应的数值变为了 0，负类、Ignore 变为 -1，cls_loss * (pos - 1) 这个操作是点乘，其实就是讲正类的 loss 置零，仅仅保留负类、Ignore 的 Loss，尺寸还是 (B, N)；(cls_loss * (pos - 1)).argsort(axis=1) 是返回如果我要将这个内容每一行按照从小到大排列，那么目前的index我要怎么排的 index，举个例子，index = arr.argsort(axis=1), 最后 arr[index] 就是从小到大拍好的数组，因为是对 Loss 的负数排序，所以是 Loss 大的得到的 index 小，(cls_loss * (pos - 1)).argsort(axis=1).argsort(axis=1) 做两次 argsort(axis=1) 则是在最初的 (cls_loss * (pos - 1)) 中每个元素数值大小的排名，最小的排名是 0，这也就是 rank，所以程序是对的
pos.sum(axis=1) 其实求得是每一行（每一个 Image）中 Positive Anchor 的总数，pos.sum(axis=1) * self._negative_mining_ratio 就是算出对应的 Negative Anchor 应该的数量，因为 pos.sum(axis=1) 会把一个 B * N 的矩阵塌缩成一个 B 的 一维Array，(pos.sum(axis=1) * self._negative_mining_ratio).expand_dims(-1) 中最后的 expand_dims(-1) 操作是重新将其变为 B * 1 的矩阵，rank < (pos.sum(axis=1) * self._negative_mining_ratio).expand_dims(-1) 因为 rank 是根据 loss 的负数排的，所以这个操作调出的都是 Loss 最大的那些个，也就实现了 hard Negative mining，因为 rank 还是那个 B * N 的矩阵，所以 hard_negative 就是一个在计算 Loss 时会用到的 Negative Samples 的 Mask 矩阵
(pos + hard_negative) > 0 其实 pos + hard_negative 就是一个 Mask，> 0 还是这个 Mask，nd.where((pos + hard_negative) > 0, cls_loss, nd.zeros_like(cls_loss)) 就是 这个 Mask 为 1 的地方，也就是 Positive Sample + 被挑选出的 Hard Negative Mining Sample 的 Loss 保留，其余的都置0
cls_losses.append(nd.sum(cls_loss, axis=0, exclude=True) / num_pos_all) 这一步是对 axis = 0 求和，照理说应该是得到长度为 N 的矩阵，但是因为 exclude = True,得到的 是长度为 B 的矩阵，很奇怪，不奇怪，当没有 exclude=True 的时候，语义是对 axis 指定的这一个 axis 求和，但是当有了 exclude = True 后，意思是，把其他维度的都加起来，就还保留指定的 axis 这一维度，所以具体到我们这里的代码，因为 cls_loss 是 B * N，因为指定了 axis = 0，exclude=True，那么最后的 得到的是一个长度为 B 的一维向量
bp = _reshape_like(nd, bp, bt) 这句话的意思是让 bp 跟 bt 一个形状，我感觉这里是废话啊， 本来两个就都是 （B,N,4) 啊
box_loss = nd.abs(bp - bt) 这一步就是计算绝对值，此时 box_loss 还是一个 (B,N,4) 的矩阵，
box_loss = nd.where(box_loss > self._rho, box_loss - 0.5 * self._rho, (0.5 / self._rho) * nd.square(box_loss)) 这句话还是依照 loss 的定义计算， 此时 box_loss 还是一个 (B,N,4) 的矩阵，
box_loss = box_loss * pos.expand_dims(axis=-1)，pos 是 Positive Samples 的 Mask，pos.expand_dims(axis=-1) 只是将其从 (B,N) 变成了 (B,N,1)，(B,N,1) 可以跟 (B,N,4) 相乘吗，可以的，因为有 broadcasting，只保留 Positive Anchor 的 Localization Loss，忽略其余的
box_losses.append(nd.sum(box_loss, axis=0, exclude=True) / num_pos_all) 还是一样，就是求和，最后的尺寸是 B 的一维向量，也就是每个样本的 loss 组成的 一维 List
sum_losses.append(cls_losses[-1] + self._lambd * box_losses[-1]) 这一个就是讲 Confidence Loss 和 Localization Loss 加权