guillefix/generative_models.md

## generative_models.md

      
    Raw
  

              generative_models.md
            
          
    One of the most fundamental objects in machine learning are probabilsitic models, and one of the most fundamental questions are how do make such models (i.e. how do you represent probability distributions?) and how do you train them.
Intro to deep generative models

A general solution: autoregressive models

The success of language models like GPT-3 stems in big part from the idea of representing probability distributions autoregressively, which allows computing the exact probability of a sequence/sentence, by decomposing it as a product of conditional probabilities over a tractably small number of elements (the tokens).
$P(\mathbf{x}) = P(\mathbf{x}_0) P(\mathbf{x}_1|\mathbf{x}_0)P(\mathbf{x}_2|\mathbf{x}_0, \mathbf{x}_1) ... P(\mathbf{x}_N|\mathbf{x}0,...,\mathbf{x}{N-1})$
This approach is powerful, because it can in principle represent any probability distribution over sequences, as long as the NNs used to represent the conditional distribution are powerful enough. Furthermore, it can be trained by exact likelihood maximization, which ensures the model will approximate the true distribution, with enough data.
Problems with modeling higher dimensional stuff

The problem comes for higher dimensional or continuous objects. One can in principle deal with these, by treating them as sequences of lower dimensional and discrete objects, and then using the autoregressive idea above. This is what is done, for example, with PixelCNN/PixelRNNs, where the image is modeled as a sequence of pixles, or patches, in some particular order. However, this has disadvantages, which stem from the idea that this autoregressive represnetation of the probability distribution may not be an efficient representation of the true generative process behind what we are trying to model (for example it's likely better to model images in a more "parallel/hierarhical" way than autoregresisvely over pixels in an arbitrary order). This could cause:

Bad inductive biases, and thus worse generalization
Slower training and inference. This seems to be the main problem with autoregressive models in practice, for high dimensional stuff.

The solution: consider more general generators

Given the above problem, the natural solution is to consider more general ways of parametrizing a probability distribution. These have generally been studied under a formalism where we consider a noise source $z$, also called the latent variables or latent vector, which is then fed into a determnistic function. Even more generally, the latent may be feed into a probability distribution $P(x|z)$. Our probabilstic model is now a latent-variable model $P(x)=P(x|z)P(z)$ where both $P(x|z)$ and $P(z)$ may in principle be learnable. As we will see, for some models, integrating over $z$, to compute the value $P(x)$ is possible. These are called explicit generative models. In cases, where $P(x)$ is not computable, but we can still sample from it, the model is called an implicit generative model.
How do we train these models? There are several ways. Here I discuss three common ones:
VAEs

Variational autoencoders work with implicit generative models. We call $P(x|z)$ the decoder, and we extend the model with an encoder $Q(z|x)$. We train the parameters of $Q(z|x)$ together with the parameters of $P(x|z)$, and possibly $P(z)$, to mazimize the ELBO. For a sufficiently expressive encoder, and good optimizer, this will be equivalent to maximum likelihood estimation of our probabilistic model, without needing to calculate $P(x)$ explicitly!
The challenge here is precisely how to make $Q(z|x)$ expressive enough. In principle, we would like it to be able to express any probability distribution. Furthermore, the ELBO involves an expectation with respect to $Q(z|x)$. We need to differentiate with respect to the parameters of Q. There are at least two ways to do this:

We write down the expectation as a sum over $z$. Only possible when the space of $z$s is finite and small
We use the "reparametrization trick". We express $z=g(h(x),\epsilon)$, where $h(x)$ is a deterministic function and all the randomness is in $\epsilon$ which is distributed with a distribution that does not depend on the parameters of $Q$. Then we can take the expectation can be taken only over the distribution of $\epsilon$ (as it's the only random variable), and because it doesn't depend on the parameters of $Q$, then the gradient can be taken inside the expectation, and we can estimate gradients by the gradients over the parameters, for a sample of $\epsilon$.

The only other requirement to compute the ELBO is that the KL divergence between $Q(z|x)$ and $P(z)$ can be computed efficiently.
Classical VAEs satisfy these requirements by assuming $Q(z|x)$ is a Gaussian with mean/variance given by functions of $x$, and that $P(z)$ is also a Gaussian. One can also extend this to a mixture of Gaussians.
Recently, more modern VAEs have begun playing with the freedom to learn the prior $P(z)$. This is done in VQ-VAEs, and the discrete VAE (dVAE)s in DALL-E. VQ-VAEs use another trick to estimate the gradient through its discrete latents, called the "pass-through" estimator, from an old paper by Bengio (which is also used in some earlier Mixture of Experts work). dVAEs use an approximate reparametrization of a discrete cateogorical distribution (like that froma  softmax), called the Gumbel-softmax distribution.
Finally, Very Deep VAEs (https://arxiv.org/abs/2011.10650), have brought the power of hierarchical VAEs into prominence. One can parametrize $P(z)$ and $Q(z|x)$ autoregressively, applying the reparmetrization trick at every step. This could represent a hierarchy, which is shown to work quite well, and be able to in principle learn anything you could learn with an autoregressive model, but be able to learn a more efficient representation if there's one.
There's one detail to do with how the ELBO is optimized, when optimizing all of $Q(z|x)$ and $P(x|z)$ and $P(z)$. The papers I've seen doing this (VQ-VAE and DALL-E), optimize the encoder-decoder first, assuming a maximum entropy $P(z)$, and only after they optimize $P(z)$. This two-step process seems necessary to have the model not collapse into a normal auto-encoder (pre-VAE), but I'm not sure what the justification is.
I just recently learned about the idea of optimizing the ELBO w.r.t. $P(z)$ and the more powerful representations of $Q(z|x)$ and this has increased my confidence on VAEs (together with their empirical success!). But next I'll explain the model I've bee playing with which has other advantages.
Advantages:

Fast to train and to do inference with
The generator can be quite flexible/general, which helps being able to faithfully learn complex probability distributions

Disadvantages:

The training procedure is only approximate, and requires $Q(z|x)$ to be able to model the posterior distribution faithfully, while still allowing to estimate gradients. Honeslty, I thought this was more of a problem before, but now having read (and understood better), the recent advances with VQ-VAEs, dVAEs, and VDVAEs, I think this may be less of a problem. But it may still cause problems sometimes, I'm not sure. I still don't fully understand the way prior-trained VAEs are working.

Normalizing flows

Normalizing flows are a way to get an explicit generative model, where one can efficiently compute $P(x)$, while keeping many of the advantages of VAEs. This is done because the function mapping the stochastic latent $z$ to $x$ is reversible. We usually define the following:

The forward flow, is a function $f$ mapping mapping $x$ to $z$. This flow has to be reversible, which in particular implies that the Jacobian is non-singular.
The reverse flow $f^{-1}$, is the inverse of the above function. This is the flow that allows to sample from the model, by sampling a latent vector from the latent distribution $P(z)$, and sending it through the flow.

The model is trained by exact likelihood maximization. We can compute the likelihood of the data, by using the transformation of variables formula, to compute $P(x)$ as the determinant of the Jacobian of the forward flow, times $P(f(x))$, where this $P$ (the distribution in latent space) is often a Gaussian, and $f(x)$ is the $z$ to which $x$ is mapped by the forward flow. By the chain rule, the Jacobian determinant of the flow, can be decomposed into a product of the Jacobian determinant of each of the atomic operations in the flow. There are tricks to make these Jacobians more efficient to compute.
Most of the ingenuity in desining normalizing flows has gone into desining expressive reversible functions. The Real NVP paper introduced reversible "coupling layers" which are a very flexible (and clever I think) class of reversible transformations, which also have simple Jacobians. I recommend the Real NVP paper as a good introduction to NFs. The Glow paper introduced reversible 1x1 convolutoins, which are also often used.
MoGlow made use of the coupling layers to make the NF conditional on the recent history, thus making the model autoregressive.
As an interesting recent development in NFs, there's FFJORD, which uses the interesting fact that as ODEs are reversible, one can use Neural ODEs to make NFs!
The main advantages of NFs are:

They model the distribution very faithfully, avoiding problems of model collapse, for example. This typically means they are better at producing diverse solutions than models that suffer from mode collapse.
They do exact inference, meaning that if your model is flexible enough, and your optimizer is reasonable, you shouldn't have stability problems like often found in GANs.

The main disadvantages of NFs are:

They are slower at training and inference. I am not sure how much this is the case quantiatively, and I think also it may depend on different factors, like how good your optimizer is. But generally, a well-tempered GAN can train faster, and also will be faster at inference time. I think the slower inference of NFs is mostly because they are less parameter efficient (you need larger models to reach similar performance). Continuous NFs (like FFJORD) seem to be a lot more parameter efficient, but they are slow for other reasons, namely that Neural ODEs are slow.
Because they are so good at modelling the distribution of the data, they can also model errors/flukes in the data, which show up (with the frequency at which they showed in the data), while some less powerful models may actually be unaffected by these outliers.

GANs

I'm not going to spend as much time on GANs because they are more well-known. But in short, they forgo the idea of approximating maximum likelihood estimation. Rather they use an implicit model $P(x|z)P(z)$ which can be sampled from, and where $P(x|z)$ is differentiable (usually deterministic, and called the generator $G(z)$), together with a discriminator $D(x)$, which is trained to distinguish real from generated $x$. The generator is trained to maximize the score $D(x)$ gives of being real. This minimax game has a Nash equilibrium where, for infinitely expressible $D(x)$, the probability distribution learnt matches the distribution of the true images.
The main advantages of GANs are:

The quality of samples produces by GANs tends to be quite high. This may be in part due to the flexibility in designing the generator, which has been seized for some very clever architectures like StyleGAN.
They are fast to run (inference), and to train (if the training can be made stable...).

The main problems of GANs are:

Mode collapse. It is hard to keep the samples diverse, and to cover the full diversity of the true distribution. For real world scenearios, there's no guarantee that GANs will approximate well the true distribution
Stability of training. The minimax objective tends to diverge or stall much more easily than the maximum likelihood objective.

Comparing them

There's a recent review of the main different types of probabilistic deep generative models, including the ones above. However, note that it only focuses on images, so it is unclear how much their conclusions transfer to other domains (e.g. audio, video, motion, etc). I think the question of which generative models work well for different tasks is a very interesting and valuable one, as these models are so fundamental and useful for real-world applications, where stochasticity and uncertainty means that we need to model probability distributions!
Also, in the sections above, I comment on the advantages and disadvantages.
Overall, at least in image world (less clear about other modalities), NFs produce better quality results than classical VAEs, but worse than GANs and worse than modern VAEs. NFs have the advantage of exact likelihood maximization, which makes the training more stable. However, modern VAEs seem to be approximating MLE well enough, that this may not be an advantage any more. Furthermore, NFs have more architectural constraints than VAEs, in terms of the generator, so that it may be harder to fit a complex probability distribution with NFs (which is often translated by NFs being less parameter/compute efficient; i.e. they need more parameters/compute to approximate a distribution with the same fidelity).
I decided to give normalizing flows a try, because they are state of the art for motion generation (MoGlow), but I now think it's also worth trying VQ-VAE/dVAEs!