Stable Diffusion's VAE is a neural network that encodes images into a compressed "latent" format and decodes them back. The encoder performs 48x lossy compression, and the decoder generates new detail to fill in the gaps.
(Calling this model a "VAE" is sort of a misnomer - it's an encoder with some very slight KL regularization, and a conditional GAN decoder)
This document is a big pile of various links with more info.
- CompVis
- [2021] The original decoder training code (using a vq bottleneck instead of a kl bottleneck) is from the taming transformers paper https://github.com/CompVis/taming-transformers
- [2022] The original SD VAE is the kl-f8 model from the latent diffusion paper https://github.com/CompVis/latent-diffusion#pretrained-autoencoding-models
- Stability
- [2022-10] SD-VAE-FT: These widely-used VAEs are finetunes of the compvis ones https://huggingface.co/stabilityai/sd-vae-ft-ema https://huggingface.co/stabilityai/sd-vae-ft-mse (only the decoder is changed) https://twitter.com/StabilityAI/status/1586183361361428480
- [2023-06] SDXL-VAE is a retrained-from-scratch model with the same code / architecture as the original https://huggingface.co/stabilityai/sdxl-vae https://arxiv.org/abs/2307.01952 https://github.com/Stability-AI/generative-models/tree/main. This comes in two versions (0.9 and 1.0) but the 0.9 one is generally considered to look better. Not clear if the training code in SGM repo works yet. Some additional ablations on the SDXL VAE are mentioned in the OpenReview discussion.
- [2023-11] The SVD-VAE is a finetuned version of SD-VAE-FT with added temporal (3d) convolutions in the decoder, intended to decode smooth (non-flickery) videos from batches of SD latents.
- [2024-06] The SD3-VAE is a new VAE trained from scratch with a more modest 16 latent channels (12x compression) instead of the 4 channels (48x compression) used by SD and SDXL. The motivation for this VAE is documented in the SD3 paper. It also adds a
shift_factor
that must be applied to latents alongside the usualscaling_factor
, and also removes thequant_conv
andpost_quant_conv
layers
- OpenAI
- [2023-11] OpenAI trained a consistency-model decoder for the original SD VAE latent space https://github.com/openai/consistencydecoder https://cdn.openai.com/papers/dall-e-3.pdf. This is like 10x the size of the standard VAE, but quality is supposed to be higher.
- BFL
- [2024-08-01] The FLUX.1 VAE uses the same config as the SD3 VAE but seems to have higher decoding quality in my testing. There are large activation values in the midblock attention (which can be disabled as usual)
- madebyollin (me)
- https://github.com/madebyollin/taesd - tiny distilled version of both the SD and SDXL autoencoder (also, removes some annoying scaling stuff & removes the stochasticity)
- https://github.com/madebyollin/seraena - example distilled VAE training code (showing a way to set up the adversarial losses)
- https://huggingface.co/madebyollin/sdxl-vae-fp16-fix - finetuned version of the SDXL (0.9) VAE that works in fp16 precision without NaNs
- https://gist.github.com/madebyollin/865fa6a18d9099351ddbdfbe7299ccbf - modified version of mrsteyk's consistency decoder code
- https://github.com/madebyollin/taesd - tiny distilled version of both the SD and SDXL autoencoder (also, removes some annoying scaling stuff & removes the stochasticity)
- birchlabs
- https://birchlabs.co.uk/machine-learning#vae-distillation - tiny MLP decoder & training code
- city96
- https://github.com/city96/SD-Latent-Interposer - converter between SD and SDXL latent spaces (with some artifacts)
- cccntu
- https://github.com/cccntu/fine-tune-models/ - vae finetuning code
- mosaicml
- mosaicml/diffusion#79 - vae training code
- various people working to get VAE training in diffusers
- Meta's emu model uses 10x compression instead of 50x and changes the adversarial loss a bit https://huggingface.co/papers/2309.15807
- GAN vs. Consistency VAE comparisons https://twitter.com/anotherjesse/status/1721754763149099246
- You can remove the TAESD upsampling layers to get lower-res RGB images https://twitter.com/madebyollin/status/1720847470245343631
- Seems likely that Bing uses original VAE decoder but ChatGPT uses the consistency model https://twitter.com/madebyollin/status/1715182160142111082
- The Retro Diffusion team have a special decoder for pixel art https://twitter.com/RealAstropulse/status/1674431288894459909
- SD VAE KL noise has very little effect
- Variances are really small https://twitter.com/Ethan_smith_20/status/1719768055902027840
- Taking the encoder mean instead of sampling causes no difference in most cases (even though sampling is the technically correct choice) https://twitter.com/Birchlabs/status/1721714156275933608
- There's an attention layer in the VAE that also doesn't do very much, can be disabled for some speedup
- SDXL and SD VAE latents are totally incompatible https://huggingface.co/madebyollin/sdxl-vae-fp16-fix/discussions/6
- The SD VAE encoder has an annoying bright spot that gets worse when encoding higher-resolution images (animation)
- original post https://www.reddit.com/r/StableDiffusion/comments/1ag5h5s/the_vae_used_for_stable_diffusion_1x2x_and_other/ / https://news.ycombinator.com/item?id=39215242
- commentary from ethan https://www.reddit.com/r/StableDiffusion/comments/1agd5pz/a_recent_post_went_viral_claiming_that_the_vae_is/
- commentary from me https://www.reddit.com/r/StableDiffusion/comments/1agd5pz/comment/koixp8d/?utm_source=reddit&utm_medium=web2x&context=3
- The FAL team trained their own openly-licensed 16-channel VAE (weights, training code) which also disables the midblock attention layer
- Ostris also trained an openly-licensed 16 channel VAE (weights) with reduced channel count for faster performance
Diagram of VAE
Animation of how VAE (decoder) is used during SD generation