YimianDai/SSD.md

## SSD.md

      
    Raw
  

              SSD.md
            
          
    extra_spec = {
    300: [((256, 1, 1, 0), (512, 3, 2, 1)),
          ((128, 1, 1, 0), (256, 3, 2, 1)),
          ((128, 1, 1, 0), (256, 3, 1, 0)),
          ((128, 1, 1, 0), (256, 3, 1, 0))],

    512: [((256, 1, 1, 0), (512, 3, 2, 1)),
          ((128, 1, 1, 0), (256, 3, 2, 1)),
          ((128, 1, 1, 0), (256, 3, 2, 1)),
          ((128, 1, 1, 0), (256, 3, 2, 1)),
          ((128, 1, 1, 0), (256, 4, 1, 1))],
}
def __init__(self, layers, filters, extras, batch_norm=False, **kwargs):
    super(VGGAtrousExtractor, self).__init__(layers, filters, batch_norm, **kwargs)
    with self.name_scope():
        self.extras = nn.HybridSequential()
        for i, config in enumerate(extras):
            extra = nn.HybridSequential(prefix='extra%d_'%(i))
            with extra.name_scope():
                for f, k, s, p in config:
                    extra.add(nn.Conv2D(f, k, s, p, **self.init))
                    if batch_norm:
                        extra.add(nn.BatchNorm())
                    extra.add(nn.Activation('relu'))
            self.extras.add(extra)
for i, config in enumerate(extras): 每个 config 都是 ((256, 1, 1, 0), (512, 3, 2, 1)) 这样的 Tuple，里面还嵌套着两个 Tuple。
extra.add(nn.Conv2D(f, k, s, p, **self.init))
nn.Conv2D 前四个参数分别是
channels, kernel_size, strides=(1, 1), padding=(0, 0)
可以看到 extra_spec[300] 前两个 extra 层会做一个 stride 是 2 的 conv
1x512x60x60
1x1024x30x30
1x512x15x15
1x256x8x8
1x256x6x6
1x256x4x4
60*60*4 + 30*30*6 + 15*15*6 + 8*8*6 + 6*6*4 + 4*4*4 = 21742
对于 VGGAtrousBase 类
with self.name_scope():
    # we use pre-trained weights from caffe, initial scale must change
    init_scale = mx.nd.array([0.229, 0.224, 0.225]).reshape((1, 3, 1, 1)) * 255
    self.init_scale = self.params.get_constant('init_scale', init_scale)
    self.stages = nn.HybridSequential()
    for l, f in zip(layers, filters):
        stage = nn.HybridSequential(prefix='')
        with stage.name_scope():
            for _ in range(l):
                stage.add(nn.Conv2D(f, kernel_size=3, padding=1, **self.init))
                if batch_norm:
                    stage.add(nn.BatchNorm())
                stage.add(nn.Activation('relu'))
        self.stages.add(stage)

    # use dilated convolution instead of dense layers
    stage = nn.HybridSequential(prefix='dilated_')
    with stage.name_scope():
        stage.add(nn.Conv2D(1024, kernel_size=3, padding=6, dilation=6, **self.init))
        if batch_norm:
            stage.add(nn.BatchNorm())
        stage.add(nn.Activation('relu'))
        stage.add(nn.Conv2D(1024, kernel_size=1, **self.init))
        if batch_norm:
            stage.add(nn.BatchNorm())
        stage.add(nn.Activation('relu'))
    self.stages.add(stage)

    # normalize layer for 4-th stage
    self.norm4 = Normalize(filters[3], 20)
可以看到，VGGAtrousBase 类的网络结构由 self.stages 和 self.norm4 两块构成，其中 self.stages 还是前后分为 用传统 卷积 和 用 dilated convolution 两块
vgg_spec = {
    11: ([1, 1, 2, 2, 2], [64, 128, 256, 512, 512]),
    13: ([2, 2, 2, 2, 2], [64, 128, 256, 512, 512]),
    16: ([2, 2, 3, 3, 3], [64, 128, 256, 512, 512]),
    19: ([2, 2, 4, 4, 4], [64, 128, 256, 512, 512])
}
因为我们是 16 ，所以 layers 是 [2, 2, 3, 3, 3]，filters 是 [64, 128, 256, 512, 512]
在传统 卷积阶段，因为 kernel_size=3, padding=1，所以并不改变 feature map 的大小，因此 就是添加了 13 个 conv 层，每个之后都有 relu，在 relu 之前，可选要不要 BatchNorm，默认是关闭的
正常的是 dilation=(1, 1) kernel 的间隔数量，我觉得应该下 多少个 是另外一个 kernel 的点，正常的就是 下 1 个，当我 dilation=6 是表示下 6 个才是 kernel，相当于 这个 kernel 如果以 传统的 kernel 看是 (13, 13) 的大小，所以 padding = 6 是对的，以此 dialated conv 也不会改变 feature map 的大小
对于第一个 Layer 的输出
for stage in self.stages[:3]:
    x = stage(x)
    x = F.Pooling(x, pool_type='max', kernel=(2, 2), stride=(2, 2),
                  pooling_convention='full')
x = self.stages[3](x)
norm = self.norm4(x)
outputs.append(norm)
因为 stage 并不会改变 feature map 的大小，所有即使做了三次 pooling，因此，第一个 Layer 的 feature map 宽和高是原来的 1/8，如果输入是 (480, 480)，那么 layer 1 的大小是 (60, 60)
对于第二个 Layer 的输出
x = F.Pooling(x, pool_type='max', kernel=(2, 2), stride=(2, 2),
              pooling_convention='full')
x = self.stages[4](x)
x = F.Pooling(x, pool_type='max', kernel=(3, 3), stride=(1, 1), pad=(1, 1),
              pooling_convention='full')
x = self.stages[5](x)
outputs.append(x)
虽然做了两次 max pooling，但是需要注意的是第二次 max pooling 的 stride 是 1，且做了 padding，因此第二层 Layer 的 feature map 的宽和高是上一层的 1/2，是输入的 1/16，如果输入是 (480, 480)，那么 layer 2 的大小是 (30, 30)
第 3 - 6 层 Layer 都是 extras
从 extra_spec[300] 的数值看，layer 3 会做一个 stride = 2，pad = 1 的 conv，因此会把 feature map 进一步变小一半，因此第 3 层 Layer 的 feature map 的宽和高是上一层的 1/2，是输入的 1/32，如果输入是 (480, 480)，那么 layer 3 的大小是 (15, 15)
layer 4 会继续做一个 stride = 2，pad = 1 的 conv，因此会把 feature map 进一步变小一半，因此第 4 层 Layer 的 feature map 的宽和高是上一层的 1/2，是输入的 1/64，如果输入是 (480, 480)，那么 layer 4 的大小是 (8, 8)
layer 5 会做一个 Kernel size 是 3，stride = 1，但是 pad = 0 的 conv，因此 feature map 的大小会 -2，如果输入是 (480, 480)，layer 4 是 (8, 8)，那么 layer 5 的大小是 (6, 6)
layer 6 和 layer 5 一样会做一个 Kernel size 是 3，stride = 1，但是 pad = 0 的 conv，因此 feature map 的大小会 -2，如果输入是 (480, 480)，layer 6 是 (6, 6)，那么 layer 6 的大小是 (4, 4)
为什么 VGGAtrousBase 里面要做一个 dilation=6 的 dilated conv？use the atrous algorithm to ﬁll the “holes” 什么意思？哪里来的 holes？看这篇博客SSD 里的 atrous为了要重用 VGG 在 ImageNet 上的权重，但又因为去掉了 2*2-s2 为了让其感受野一样才有的做法
change pool5 from 2 × 2 − s2 to 3 × 3 − s1
原来的 感受野是
       g 
    
   a   b   c      2*2 pooling, stride 1
 
  x y z p q e      3 * 3 conv, stride 1
  
0 1 2 3 4 5 6 0  

gluoncv/model_zoo/ssd/presets.py 中定义了一堆 SSD 网络，其实代码基本都一样，都是调用 get_ssd 函数，需要指定

Backbone: name, base_size (用作 SSDAnchorGenerator 中的 im_size), features, filters
Anchor: sizes (Anchor Size), ratios (Anchor Ratio), steps (Anchor 铺排的 stride，也是下采样倍数)
Predictor: classes (用于构建 Prediction 类别数)
Parameters: pretrained, pretrained_base, dataset (指定 pretrained 权重文件名)

...

vgg16_atrous

300

voc

sizes=[30, 60, 111, 162, 213, 264, 315]


coco

sizes=[21, 45, 99, 153, 207, 261, 315]


custom

sizes=[21, 45, 99, 153, 207, 261, 315] 说明 custom 是按照 coco 来的


512

voc

sizes=[51.2, 76.8, 153.6, 230.4, 307.2, 384.0, 460.8, 537.6]


coco

sizes=[51.2, 76.8, 153.6, 230.4, 307.2, 384.0, 460.8, 537.6]


custom

sizes=[51.2, 76.8, 153.6, 230.4, 307.2, 384.0, 460.8, 537.6]


resnet18_v1

512

voc

sizes=[51.2, 102.4, 189.4, 276.4, 363.52, 450.6, 492]


coco

sizes=[51.2, 102.4, 189.4, 276.4, 363.52, 450.6, 492]


custom

sizes=[51.2, 102.4, 189.4, 276.4, 363.52, 450.6, 492]


resnet50_v1

512

voc

sizes=[51.2, 102.4, 189.4, 276.4, 363.52, 450.6, 492]


coco

sizes=[51.2, 133.12, 215.04, 296.96, 378.88, 460.8, 542.72]


custom

sizes=[51.2, 133.12, 215.04, 296.96, 378.88, 460.8, 542.72]


resnet101_v2

512

voc

sizes=[51.2, 102.4, 189.4, 276.4, 363.52, 450.6, 492]


resnet152_v2

512

voc

sizes=[51.2, 76.8, 153.6, 230.4, 307.2, 384.0, 460.8, 537.6]


mobilenet1_0

512

voc

sizes=[51.2, 102.4, 189.4, 276.4, 363.52, 450.6, 492]


coco

sizes=[51.2, 102.4, 189.4, 276.4, 363.52, 450.6, 492]


custom

sizes=[51.2, 102.4, 189.4, 276.4, 363.52, 450.6, 492]


mobilenet0_25

300

voc

sizes=[21, 45, 99, 153, 207, 261, 315]


coco

sizes=[21, 45, 99, 153, 207, 261, 315]


custom

sizes=[21, 45, 99, 153, 207, 261, 315]