Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Star 2 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save shagunsodhani/3cc7066ce7de051d769908b8fab11990 to your computer and use it in GitHub Desktop.
Save shagunsodhani/3cc7066ce7de051d769908b8fab11990 to your computer and use it in GitHub Desktop.
Summary of "Conditional Image Generation with PixelCNN Decoders" paper

Conditional Image Generation with PixelCNN Decoders

Introduction

  • The paper explores the domain of conditional image generation by adopting and improving PixelCNN architecture.
  • Link to the paper

Based on PixelRNN and PixelCNN

  • Models image pixel by pixel by decomposing the joint image distribution as a product of conditionals.
  • PixelRNN uses two-dimensional LSTM while PixelCNN uses convolutional networks.
  • PixelRNN gives better results but PixelCNN is faster to train.

Gated PixelCNN

  • PixelRNN outperforms PixelCNN due to the larger receptive field and because they contain multiplicative units, LSTM gates, which allow modelling more complex interactions.
  • To account for these, deeper models and gated activation units (equation 2 in the paper) can be used respectively.
  • Masked convolutions can lead to blind spots in the receptive fields.
  • These can be removed by combining 2 convolutional network stacks:
    • Horizontal stack - conditions on the current row.
    • Vertical stack - conditions on all rows above the current row.
  • Every layer in the horizontal stack takes as input the output of the previous layer as well as that of the vertical stack.
  • Residual connections are used in the horizontal stack and not in the vertical stack (as they did not seem to improve results in the initial settings).

Conditional PixelCNN

  • Model conditional distribution of image, given the high-level description of the image, represented using the latent vector h (equation 4 in the paper)
  • This conditioning does not depend on the location of the pixel in the image.
  • To consider the location as well, map h to spatial representation s = m(h) (equation 5 in the the paper)

PixelCNN Auto-Encoders

  • Start with a traditional auto-encoder architecture and replace the deconvolutional decoder with PixelCNN and train the network end-to-end.

Experiments

  • For unconditional modelling, Gated PixelCNN either outperforms PixelRNN or performs almost as good and takes much less time to train.
  • In the case of conditioning on ImageNet classes, the log likelihood measure did not improve a lot but the visual quality of the generated sampled was significantly improved.
  • Paper also included sample images generated by conditioning on human portraits and by training a PixelCNN auto-encoder on ImageNet patches.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment