One of the most fundamental objects in machine learning are probabilsitic models, and one of the most fundamental questions are how do make such models (i.e. how do you represent probability distributions?) and how do you train them.
The success of language models like GPT-3 stems in big part from the idea of representing probability distributions autoregressively, which allows computing the exact probability of a sequence/sentence, by decomposing it as a product of conditional probabilities over a tractably small number of elements (the tokens).
$P(\mathbf{x}) = P(\mathbf{x}_0) P(\mathbf{x}_1|\mathbf{x}_0)P(\mathbf{x}_2|\mathbf{x}_0, \mathbf{x}_1) ... P(\mathbf{x}_N|\mathbf{x}0,...,\mathbf{x}{N-1})$
This approach is powerful, because it can in principle represent any probability distribution over sequences, as long as the NNs used to represent the conditional distribution are powerful enough. Furthermore, it can be trained by exact likelihood maximization, which ensures the model will approximate the true distribution, with enough data.
The problem comes for higher dimensional or continuous objects. One can in principle deal with these, by treating them as sequences of lower dimensional and discrete objects, and then using the autoregressive idea above. This is what is done, for example, with PixelCNN/PixelRNNs, where the image is modeled as a sequence of pixles, or patches, in some particular order. However, this has disadvantages, which stem from the idea that this autoregressive represnetation of the probability distribution may not be an efficient representation of the true generative process behind what we are trying to model (for example it's likely better to model images in a more "parallel/hierarhical" way than autoregresisvely over pixels in an arbitrary order). This could cause:
- Bad inductive biases, and thus worse generalization
- Slower training and inference. This seems to be the main problem with autoregressive models in practice, for high dimensional stuff.
Given the above problem, the natural solution is to consider more general ways of parametrizing a probability distribution. These have generally been studied under a formalism where we consider a noise source
How do we train these models? There are several ways. Here I discuss three common ones:
Variational autoencoders work with implicit generative models. We call
The challenge here is precisely how to make
- We write down the expectation as a sum over
$z$ . Only possible when the space of $z$s is finite and small - We use the "reparametrization trick". We express
$z=g(h(x),\epsilon)$ , where$h(x)$ is a deterministic function and all the randomness is in$\epsilon$ which is distributed with a distribution that does not depend on the parameters of$Q$ . Then we can take the expectation can be taken only over the distribution of$\epsilon$ (as it's the only random variable), and because it doesn't depend on the parameters of$Q$ , then the gradient can be taken inside the expectation, and we can estimate gradients by the gradients over the parameters, for a sample of$\epsilon$ .
The only other requirement to compute the ELBO is that the KL divergence between
Classical VAEs satisfy these requirements by assuming
Recently, more modern VAEs have begun playing with the freedom to learn the prior
Finally, Very Deep VAEs (https://arxiv.org/abs/2011.10650), have brought the power of hierarchical VAEs into prominence. One can parametrize
There's one detail to do with how the ELBO is optimized, when optimizing all of
I just recently learned about the idea of optimizing the ELBO w.r.t.
Advantages:
- Fast to train and to do inference with
- The generator can be quite flexible/general, which helps being able to faithfully learn complex probability distributions
Disadvantages:
- The training procedure is only approximate, and requires
$Q(z|x)$ to be able to model the posterior distribution faithfully, while still allowing to estimate gradients. Honeslty, I thought this was more of a problem before, but now having read (and understood better), the recent advances with VQ-VAEs, dVAEs, and VDVAEs, I think this may be less of a problem. But it may still cause problems sometimes, I'm not sure. I still don't fully understand the way prior-trained VAEs are working.
Normalizing flows are a way to get an explicit generative model, where one can efficiently compute
- The forward flow, is a function
$f$ mapping mapping$x$ to$z$ . This flow has to be reversible, which in particular implies that the Jacobian is non-singular. - The reverse flow
$f^{-1}$ , is the inverse of the above function. This is the flow that allows to sample from the model, by sampling a latent vector from the latent distribution$P(z)$ , and sending it through the flow.
The model is trained by exact likelihood maximization. We can compute the likelihood of the data, by using the transformation of variables formula, to compute
Most of the ingenuity in desining normalizing flows has gone into desining expressive reversible functions. The Real NVP paper introduced reversible "coupling layers" which are a very flexible (and clever I think) class of reversible transformations, which also have simple Jacobians. I recommend the Real NVP paper as a good introduction to NFs. The Glow paper introduced reversible 1x1 convolutoins, which are also often used.
MoGlow made use of the coupling layers to make the NF conditional on the recent history, thus making the model autoregressive.
As an interesting recent development in NFs, there's FFJORD, which uses the interesting fact that as ODEs are reversible, one can use Neural ODEs to make NFs!
The main advantages of NFs are:
- They model the distribution very faithfully, avoiding problems of model collapse, for example. This typically means they are better at producing diverse solutions than models that suffer from mode collapse.
- They do exact inference, meaning that if your model is flexible enough, and your optimizer is reasonable, you shouldn't have stability problems like often found in GANs.
The main disadvantages of NFs are:
- They are slower at training and inference. I am not sure how much this is the case quantiatively, and I think also it may depend on different factors, like how good your optimizer is. But generally, a well-tempered GAN can train faster, and also will be faster at inference time. I think the slower inference of NFs is mostly because they are less parameter efficient (you need larger models to reach similar performance). Continuous NFs (like FFJORD) seem to be a lot more parameter efficient, but they are slow for other reasons, namely that Neural ODEs are slow.
- Because they are so good at modelling the distribution of the data, they can also model errors/flukes in the data, which show up (with the frequency at which they showed in the data), while some less powerful models may actually be unaffected by these outliers.
I'm not going to spend as much time on GANs because they are more well-known. But in short, they forgo the idea of approximating maximum likelihood estimation. Rather they use an implicit model
The main advantages of GANs are:
- The quality of samples produces by GANs tends to be quite high. This may be in part due to the flexibility in designing the generator, which has been seized for some very clever architectures like StyleGAN.
- They are fast to run (inference), and to train (if the training can be made stable...).
The main problems of GANs are:
- Mode collapse. It is hard to keep the samples diverse, and to cover the full diversity of the true distribution. For real world scenearios, there's no guarantee that GANs will approximate well the true distribution
- Stability of training. The minimax objective tends to diverge or stall much more easily than the maximum likelihood objective.
There's a recent review of the main different types of probabilistic deep generative models, including the ones above. However, note that it only focuses on images, so it is unclear how much their conclusions transfer to other domains (e.g. audio, video, motion, etc). I think the question of which generative models work well for different tasks is a very interesting and valuable one, as these models are so fundamental and useful for real-world applications, where stochasticity and uncertainty means that we need to model probability distributions!
Also, in the sections above, I comment on the advantages and disadvantages.
Overall, at least in image world (less clear about other modalities), NFs produce better quality results than classical VAEs, but worse than GANs and worse than modern VAEs. NFs have the advantage of exact likelihood maximization, which makes the training more stable. However, modern VAEs seem to be approximating MLE well enough, that this may not be an advantage any more. Furthermore, NFs have more architectural constraints than VAEs, in terms of the generator, so that it may be harder to fit a complex probability distribution with NFs (which is often translated by NFs being less parameter/compute efficient; i.e. they need more parameters/compute to approximate a distribution with the same fidelity).
I decided to give normalizing flows a try, because they are state of the art for motion generation (MoGlow), but I now think it's also worth trying VQ-VAE/dVAEs!