brannondorsey/pix2pix_paper_notes.md

## pix2pix_paper_notes.md

      
    Raw
  

              pix2pix_paper_notes.md
            
          
    Image-to-Image Translation with Conditional Adversarial Networks

Notes from arXiv:1611.07004v1 [cs.CV] 21 Nov 2016

Euclidean distance between predicted and ground truth pixels is not a good method of judging similarity because it yields blurry images.
GANs learn a loss function rather than using an existing one.
GANs learn a loss that tries to classify if the output image is real or fake, while simultaneously training a generative model to minimize this loss.
Conditional GANs (cGANs) learn a mapping from observed image x and random noise vector z to y: y = f(x, z)
The generator G is trained to produce outputs that cannot be distinguished from "real" images by an adversarially trained discrimintor, D which is trained to do as well as possible at detecting the generator's "fakes".
The discriminator D, learns to classify between real and synthesized pairs. The generator learns to fool the discriminator.
Unlike an unconditional GAN, both the generator and discriminator observe an input image z.
Asks G to not only fool the discriminator but also to be near the ground truth output in an L2 sense.
L1 distance between an output of G is used over L2 because it encourages less blurring.
Without z, the net could still learn a mapping from x to y but would produce deterministic outputs (and therefore fail to match any distribution other than a delta function. Past conditional GANs have acknowledged this and provided Gaussian noise z as an input to the generator, in addition to x)
Either vanilla encoder-decoder or Unet can be selected as the model for G in this implementation.
Both generator and discriminator use modules of the form convolution-BatchNorm-ReLu.
A defining feature of image-to-image translation problems is that they map a high resolution input grid to a high resolution output grid.
Input and output images differ in surface appearance, but both are renderings of the same underlying structure. Therefore, structure in the input is roughly aligned with structure in the output.
L1 loss does very well at low frequencies (I think this means general tonal-distribution/contrast, color-blotches, etc) but fails at high frequencies (crispness/edge/detail) (thus you get blurry images). This motivates restricting the GAN discriminator to only model high frequency structure, relying on an L1 term to force low frequency correctness. In order to model high frequencies, it is sufficient to restrict our attention to the structure in local image patches. Therefore, we design a discriminator architecture – which we term a PatchGAN – that only penalizes structure at the scale of patches. This discriminator tries to classify if each NxNpatch in an image is real or fake. We run this discriminator convolutationally across the image, averaging all responses to provide the ultimate output of D.
Because PatchGAN assumes independence between pixels seperated by more than a patch diameter (N) it can be thought of as a form of texture/style loss.
To optimize our networks we alternate between one gradient descent step on D, then one step on G (using minibatch SGD applying the Adam solver)
In our experiments, we use batch size 1 for certain experiments and 4 for others, noting little difference between these two conditions.
To explore the generality of conditional GANs, we test the method on a variety of tasks and datasets, including both graphics tasks, like photo generation, and vision tasks, like semantic segmentation.
Evaluating the quality of synthesized images is an open and difficult problem. Traditional metrics such as per-pixel mean-squared error do not assess joint statistics of the result, and therefore do not measure the very structure that structured losses aim to capture.
FCN-Score: while quantitative evaluation of generative models is known to be challenging, recent works have tried using pre-trained semantic classifiers to measure the discriminability of the generated images as a pseudo-metric. The intuition is that if the generated images are realistic, classifiers trained on real images will be able to classify the synthesized image correctly as well.
cGANs seems to work much better than GANs for this type of image-to-image transformation, as it seems that with a GAN, the generator collapses into producing nearly the exact same output regardless of the input photograph.
16x16 PatchGAN produces sharp outputs but causes tiling artifacts, 70x70 PatchGAN alleviates these artifacts. 256x256 ImageGAN doesn't appear to improve the tiling artifacts and yields a lower FCN-score.
An advantage of the PatchGAN is that a fixed-size patch discriminator can be applied to arbitrarily large images. This allows us to train on, say, 256x256 images and test/sample/generate on 512x512.
cGANs appear to be effective on problems where the output is highly detailed or photographic, as is common in image processing and graphics tasks.
When semantic segmentation is required (i.e. going from image to label) L1 performs better than cGAN. We argue that for vision problems, the goal (i.e. predicting output close to ground
truth) may be less ambiguous than graphics tasks, and reconstruction losses like L1 are mostly sufficient.

Conclusion

The results in this paper suggest that conditional adversarial networks are a promising approach for many image-to-image translation tasks, especially those involving highly structured graphical outputs. These networks learn a loss adapted to the task and data at hand, which makes them applicable in a wide variety of settings.
Misc


Least absolute deviations (L1) and Least square errors (L2) are the two standard loss functions, that decides what function should be minimized while learning from a dataset. (source)
How, using pix2pix, do you specify a loss of L1, L1+GAN, and L1+cGAN?

Resources


GAN paper