Skip to content

Instantly share code, notes, and snippets.

@wandering007
Last active July 28, 2017 21:25
Show Gist options
  • Save wandering007/6a429ca71211cc2f2d81ca39ad12f430 to your computer and use it in GitHub Desktop.
Save wandering007/6a429ca71211cc2f2d81ca39ad12f430 to your computer and use it in GitHub Desktop.
Papers on Proposing Deep Neural Networks Models

Network settings (Caffe-style)

  • input dim: 3 x 16 x 128 x 171, jittering by random crops: 3 x 16 x 112 x 112.

  • 1st layer group
    conv1a: kernel: 64 * 3 x 3 x 3, stride: 1, output dim: 64 x 16 x 112 x 112.
    relu1a: RELU.
    pool1: MAX, kernel: 1 x 2 x 2, stride: [1, 2, 2], output dim: 64 x 16 x 56 x 56.

  • 2nd layer group
    conv2a: kernel: 128 * 3 x 3 x 3, output dim: 128 x 16 x 56 x 56.
    relu2a: RELU.
    pool2: MAX, kernel: 2 x 2 x 2, stride: [2, 2, 2], output dim: 128 x 8 x 28 x 28.

  • 3rd layer group
    conv3a: kernel: 256 * 3 x 3 x 3, output dim: 256 x 8 x 28 x 28.
    relu3a: RELU.
    conv3b: kernel: 256 * 3 x 3 x 3, output dim: 256 x 8 x 28 x 28.
    relu3b: RELU.
    pool3: MAX, 2 x 2 x 2, stride: [2, 2, 2], output dim: 256 x 4 x 14 x 14.

  • 4th layer group
    conv4a: 512 * 3 x 3 x 3, output dim: 512 x 4 x 14 x 14.
    relu4a: RELU.
    conv4b: 512 * 3 x 3 x 3, output dim: 512 x 4 x 14 x 14.
    relu4b: RELU.
    pool4: MAX, kernel: 2 x 2 x 2, stride: [2, 2, 2], output dim: 512 x 2 x 7 x 7.

  • 5th layer group
    conv5a: kernel: 512 * 3 x 3 x 3, output dim: 512 x 2 x 7 x 7.
    relu5a: RELU.
    conv5b: kernel: 512 * 3 x 3 x 3, output dim: 512 x 2 x 7 x 7.
    relu5b: RELU.
    pool5: MAX, kernel: 2 x 2 x 2, output dim: 512 x 1 x 4 x 4. (padding)

  • fc layers
    fc6-1: ouput dim: 4096
    relu6: RELU
    drop6: DROPOUT, 0.5
    fc7-1: output dim: 4096
    relu7: RELU
    drop7: DROPOUT, 0.5
    fc8-1: output dim: 487
    prob: SOFTMAX [accuracy: ACCURACY]

    Modification: 3 x 16 x 112 x 112 -> conv1a -> 64 x 14 x 112 x 112 -> pooling -> 64 x 7 x 56 x 56 -> conv2a -> 128 x 5 x 56 x 56 ->conv3a-> 256 x 3 x 56 x 56 -> conv3b -> 256 x 1 x 56 x 56

2015

  • Neural Module Networks. Jacob Andreas, Marcus Rohrbach, Trevor Darrell and Dan Klein. CVPR 2016.
    Visual question answering is fundamentally compositional in nature--- a question like "where is the dog?" shares substructure with questions like "what color is the dog?" and "where is the cat?" This paper seeks to simultaneously exploit the representational capacity of deep networks and the compositional linguistic structure of questions. We describe a procedure for constructing and learning neural module networks, which compose collections of jointly-trained neural "modules" into deep networks for question answering. Our approach decomposes questions into their linguistic substructures, and uses these structures to dynamically instantiate modular networks (with reusable components for recognizing dogs, classifying colors, etc.). The resulting compound networks are jointly trained. We evaluate our approach on two challenging datasets for visual question answering, achieving state-of-the-art results on both the VQA natural image dataset and a new dataset of complex questions about abstract shapes.

2016

2017

  • Depthwise Separable Convolutions for Neural Machine Translation. Lukasz Kaiser, Aidan N. Gomez, Francois Chollet.
    Depthwise separable convolutions reduce the number of parameters and computation used in convolutional operations while increasing representational efficiency. They have been shown to be successful in image classification models, both in obtaining better models than previously possible for a given parameter count (the Xception architecture) and considerably reducing the number of parameters required to perform at a given level (the MobileNets family of architectures). Recently, convolutional sequence-to-sequence networks have been applied to machine translation tasks with good results. In this work, we study how depthwise separable convolutions can be applied to neural machine translation. We introduce a new architecture inspired by Xception and ByteNet, called SliceNet, which enables a significant reduction of the parameter count and amount of computation needed to obtain results like ByteNet, and, with a similar parameter count, achieves new state-of-the-art results. In addition to showing that depthwise separable convolutions perform well for machine translation, we investigate the architectural changes that they enable: we observe that thanks to depthwise separability, we can increase the length of convolution windows, removing the need for filter dilation. We also introduce a new "super-separable" convolution operation that further reduces the number of parameters and computational cost for obtaining state-of-the-art results.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment