Stable Diffusion's VAE is a neural network that encodes and decodes images into a compressed "latent" format. The encoder performs 48x lossy compression, and the decoder generates new detail to fill in the gaps.
(Calling this model a "VAE" is sort of a misnomer - it's an encoder with some very slight KL regularization, and a conditional GAN decoder)
This document is a big pile of various links with more info.
- CompVis
- [2021] The original decoder training code (using a vq bottleneck instead of a kl bottleneck) is from the taming transformers paper https://github.com/CompVis/taming-transformers
- [2022] The original SD VAE is the kl-f8 model from the latent diffusion paper https://github.com/CompVis/latent-diffusion#pretrained-autoencoding-models
- Stability
- [2022-10] SD-VAE-FT: These widely-used VAEs are finetunes of the compvis ones https://huggingface.co/stabilityai/sd-vae-ft-ema https://huggingface.co/stabilityai/sd-vae-ft-mse (only the decoder is changed) https://twitter.com/StabilityAI/status/1586183361361428480
- [2023-06] SDXL-VAE is a retrained-from-scratch model with the same code / architecture as the original https://huggingface.co/stabilityai/sdxl-vae https://arxiv.org/abs/2307.01952 https://github.com/Stability-AI/generative-models/tree/main. This comes in two versions (0.9 and 1.0) but the 0.9 one is generally considered to look better. Not clear if the training code in SGM repo works yet.
- [2023-11] The SVD-VAE is a finetuned version of SD-VAE-FT with added temporal (3d) convolutions in the decoder, intended to decode smooth (non-flickery) videos from batches of SD latents.
- OpenAI
- [2023-11] OpenAI trained a consistency-model decoder for the original SD VAE latent space https://github.com/openai/consistencydecoder https://cdn.openai.com/papers/dall-e-3.pdf. This is like 10x the size of the standard VAE, but quality is supposed to be higher.
- madebyollin (me)
- https://github.com/madebyollin/taesd - tiny distilled version of both the SD and SDXL autoencoder (also, removes some annoying scaling stuff & removes the stochasticity)
- https://huggingface.co/madebyollin/sdxl-vae-fp16-fix - finetuned version of the SDXL (0.9) VAE that works in fp16 precision without NaNs
- https://gist.github.com/madebyollin/865fa6a18d9099351ddbdfbe7299ccbf - modified version of mrsteyk's consistency decoder code
- birchlabs
- https://birchlabs.co.uk/machine-learning#vae-distillation - tiny MLP decoder & training code
- city96
- https://github.com/city96/SD-Latent-Interposer - converter between SD and SDXL latent spaces (with some artifacts)
- cccntu
- https://github.com/cccntu/fine-tune-models/ - vae finetuning code
- mosaicml
- mosaicml/diffusion#79 - vae training code
- various people working to get VAE training in diffusers
- Meta's emu model uses 10x compression instead of 50x and changes the adversarial loss a bit https://huggingface.co/papers/2309.15807
- GAN vs. Consistency VAE comparisons https://twitter.com/anotherjesse/status/1721754763149099246
- You can remove the TAESD upsampling layers to get lower-res RGB images https://twitter.com/madebyollin/status/1720847470245343631
- Seems likely that Bing uses original VAE decoder but ChatGPT uses the consistency model https://twitter.com/madebyollin/status/1715182160142111082
- The Retro Diffusion team have a special decoder for pixel art https://twitter.com/RealAstropulse/status/1674431288894459909
- SD VAE KL noise has very little effect
- Variances are really small https://twitter.com/Ethan_smith_20/status/1719768055902027840
- Taking the encoder mean instead of sampling causes no difference in most cases (even though sampling is the technically correct choice) https://twitter.com/Birchlabs/status/1721714156275933608
- There's an attention layer in the VAE that also doesn't do very much, can be disabled for some speedup
- SDXL and SD VAE latents are totally incompatible https://huggingface.co/madebyollin/sdxl-vae-fp16-fix/discussions/6
- The SD VAE encoder has an annoying bright spot that gets worse when encoding higher-resolution images (animation)
- original post https://www.reddit.com/r/StableDiffusion/comments/1ag5h5s/the_vae_used_for_stable_diffusion_1x2x_and_other/ / https://news.ycombinator.com/item?id=39215242
- commentary from ethan https://www.reddit.com/r/StableDiffusion/comments/1agd5pz/a_recent_post_went_viral_claiming_that_the_vae_is/
- commentary from me https://www.reddit.com/r/StableDiffusion/comments/1agd5pz/comment/koixp8d/?utm_source=reddit&utm_medium=web2x&context=3
Diagram of VAE
Animation of how VAE (decoder) is used during SD generation