wandering007/c3d model.md

## c3d model.md

      
    Raw
  

              c3d model.md
            
          
    Network settings （Caffe-style）


input dim: 3 x 16 x 128 x 171, jittering by random crops: 3 x 16 x 112 x 112.


1st layer group

conv1a: kernel: 64 * 3 x 3 x 3, stride: 1, output dim: 64 x 16 x 112 x 112.

relu1a: RELU.

pool1: MAX, kernel: 1 x 2 x 2, stride: [1, 2, 2], output dim: 64 x 16 x 56 x 56.


2nd layer group

conv2a: kernel: 128 * 3 x 3 x 3, output dim: 128 x 16 x 56 x 56.

relu2a: RELU.

pool2: MAX, kernel: 2 x 2 x 2, stride: [2, 2, 2], output dim: 128 x 8 x 28 x 28.


3rd layer group

conv3a: kernel: 256 * 3 x 3 x 3, output dim: 256 x 8 x 28 x 28.

relu3a: RELU.

conv3b: kernel: 256 * 3 x 3 x 3, output dim: 256 x 8 x 28 x 28.

relu3b: RELU.

pool3: MAX, 2 x 2 x 2, stride: [2, 2, 2], output dim: 256 x 4 x 14 x 14.


4th layer group

conv4a: 512 * 3 x 3 x 3, output dim: 512 x 4 x 14 x 14.

relu4a: RELU.

conv4b: 512 * 3 x 3 x 3, output dim: 512 x 4 x 14 x 14.

relu4b: RELU.

pool4: MAX, kernel: 2 x 2 x 2, stride: [2, 2, 2], output dim: 512 x 2 x 7 x 7.


5th layer group

conv5a: kernel: 512 * 3 x 3 x 3, output dim: 512 x 2 x 7 x 7.

relu5a: RELU.

conv5b: kernel: 512 * 3 x 3 x 3, output dim: 512 x 2 x 7 x 7.

relu5b: RELU.

pool5: MAX, kernel: 2 x 2 x 2, output dim: 512 x 1 x 4 x 4. (padding)


fc layers

fc6-1: ouput dim: 4096

relu6: RELU

drop6: DROPOUT, 0.5

fc7-1: output dim: 4096

relu7: RELU

drop7: DROPOUT, 0.5

fc8-1: output dim: 487

prob: SOFTMAX
[accuracy: ACCURACY]
Modification:
3 x 16 x 112 x 112 -> conv1a -> 64 x 14 x 112 x 112 -> pooling -> 64 x 7 x 56 x 56 -> conv2a -> 128 x 5 x 56 x 56 ->conv3a-> 256 x 3 x 56 x 56 -> conv3b -> 256 x 1 x 56 x 56


## dnn_papers.md

      
    Raw
  

              dnn_papers.md
            
          
    2015


Neural Module Networks. Jacob Andreas, Marcus Rohrbach, Trevor Darrell and Dan Klein. CVPR 2016.

Visual question answering is fundamentally compositional in nature---
a question like "where is the dog?" shares substructure with questions like "what color is the dog?" and "where is the cat?"
This paper seeks to simultaneously exploit the representational capacity of deep networks and the compositional linguistic structure of questions.
We describe a procedure for constructing and learning neural module networks, which compose collections of jointly-trained neural "modules" into deep networks for question answering.
Our approach decomposes questions into their linguistic substructures, and uses these structures to dynamically instantiate modular networks (with reusable components for recognizing dogs, classifying colors, etc.).
The resulting compound networks are jointly trained. We evaluate our approach on two challenging datasets for visual question answering,
achieving state-of-the-art results on both the VQA natural image dataset and a new dataset of complex questions about abstract shapes.

2016

2017


Depthwise Separable Convolutions for Neural Machine Translation. Lukasz Kaiser, Aidan N. Gomez, Francois Chollet.

Depthwise separable convolutions reduce the number of parameters and computation used in convolutional operations while increasing representational efficiency.
They have been shown to be successful in image classification models, both in obtaining better models than previously possible for a given parameter count (the Xception architecture) and considerably reducing the number of parameters required to perform at a given level (the MobileNets family of architectures).
Recently, convolutional sequence-to-sequence networks have been applied to machine translation tasks with good results. In this work, we study how depthwise separable convolutions can be applied to neural machine translation.
We introduce a new architecture inspired by Xception and ByteNet, called SliceNet, which enables a significant reduction of the parameter count and amount of computation needed to obtain results like ByteNet, and, with a similar parameter count, achieves new state-of-the-art results.
In addition to showing that depthwise separable convolutions perform well for machine translation, we investigate the architectural changes that they enable: we observe that thanks to depthwise separability, we can increase the length of convolution windows, removing the need for filter dilation.
We also introduce a new "super-separable" convolution operation that further reduces the number of parameters and computational cost for obtaining state-of-the-art results.