madebyollin/notes_on_sd_vae.md

## notes_on_sd_vae.md

      
    Raw
  

              notes_on_sd_vae.md
            
          
    Notes / Links about Stable Diffusion VAE

Stable Diffusion's VAE is a neural network that encodes images into a compressed "latent" format and decodes them back. The encoder performs 48x lossy compression, and the decoder generates new detail to fill in the gaps.

(Calling this model a "VAE" is sort of a misnomer - it's an encoder with some very slight KL regularization, and a conditional GAN decoder)
This document is a big pile of various links with more info.
VAE Versions & Lineage


CompVis

[2021] The original decoder training code (using a vq bottleneck instead of a kl bottleneck) is from the taming transformers paper https://github.com/CompVis/taming-transformers
[2022] The original SD VAE is the kl-f8 model from the latent diffusion paper https://github.com/CompVis/latent-diffusion#pretrained-autoencoding-models


Stability

[2022-10] SD-VAE-FT: These widely-used VAEs are finetunes of the compvis ones https://huggingface.co/stabilityai/sd-vae-ft-ema https://huggingface.co/stabilityai/sd-vae-ft-mse (only the decoder is changed) https://twitter.com/StabilityAI/status/1586183361361428480
[2023-06] SDXL-VAE is a retrained-from-scratch model with the same code / architecture as the original https://huggingface.co/stabilityai/sdxl-vae https://arxiv.org/abs/2307.01952 https://github.com/Stability-AI/generative-models/tree/main. This comes in two versions (0.9 and 1.0) but the 0.9 one is generally considered to look better. Not clear if the training code in SGM repo works yet. Some additional ablations on the SDXL VAE are mentioned in the OpenReview discussion.
[2023-11] The SVD-VAE is a finetuned version of SD-VAE-FT with added temporal (3d) convolutions in the decoder, intended to decode smooth (non-flickery) videos from batches of SD latents.
[2024-06] The SD3-VAE is a new VAE trained from scratch with a more modest 16 latent channels (12x compression) instead of the 4 channels (48x compression) used by SD and SDXL. The motivation for this VAE is documented in the SD3 paper. It also adds a shift_factor that must be applied to latents alongside the usual scaling_factor, and also removes the quant_conv and post_quant_conv layers


OpenAI

[2023-11] OpenAI trained a consistency-model decoder for the original SD VAE latent space https://github.com/openai/consistencydecoder https://cdn.openai.com/papers/dall-e-3.pdf. This is like 10x the size of the standard VAE, but quality is supposed to be higher.


BFL

[2024-08-01] The FLUX.1 VAE uses the same config as the SD3 VAE but seems to have higher decoding quality in my testing. There are large activation values in the midblock attention (which can be disabled as usual)


Other SD-VAE-related Codebases


madebyollin (me)

https://github.com/madebyollin/taesd - tiny distilled version of both the SD and SDXL autoencoder (also, removes some annoying scaling stuff & removes the stochasticity)

https://github.com/madebyollin/seraena - example distilled VAE training code (showing a way to set up the adversarial losses)


https://huggingface.co/madebyollin/sdxl-vae-fp16-fix - finetuned version of the SDXL (0.9) VAE that works in fp16 precision without NaNs
https://gist.github.com/madebyollin/865fa6a18d9099351ddbdfbe7299ccbf - modified version of mrsteyk's consistency decoder code


birchlabs

https://birchlabs.co.uk/machine-learning#vae-distillation - tiny MLP decoder & training code


city96

https://github.com/city96/SD-Latent-Interposer - converter between SD and SDXL latent spaces (with some artifacts)


cccntu

https://github.com/cccntu/fine-tune-models/ - vae finetuning code


mosaicml

mosaicml/diffusion#79 - vae training code


various people working to get VAE training in diffusers

huggingface/diffusers#3726


Other Info


Meta's emu model uses 10x compression instead of 50x and changes the adversarial loss a bit https://huggingface.co/papers/2309.15807
GAN vs. Consistency VAE comparisons https://twitter.com/anotherjesse/status/1721754763149099246
You can remove the TAESD upsampling layers to get lower-res RGB images https://twitter.com/madebyollin/status/1720847470245343631
Seems likely that Bing uses original VAE decoder but ChatGPT uses the consistency model https://twitter.com/madebyollin/status/1715182160142111082
The Retro Diffusion team have a special decoder for pixel art https://twitter.com/RealAstropulse/status/1674431288894459909
SD VAE KL noise has very little effect

Variances are really small https://twitter.com/Ethan_smith_20/status/1719768055902027840
Taking the encoder mean instead of sampling causes no difference in most cases (even though sampling is the technically correct choice) https://twitter.com/Birchlabs/status/1721714156275933608


Visualization of how SDXL VAE distributes information for complicated patches across all the surrounding latents

https://x.com/rgilman33/status/1912206589173571616


There's an attention layer in the VAE that also doesn't do very much, can be disabled for some speedup

https://twitter.com/madebyollin/status/1624528513348046848
https://twitter.com/Birchlabs/status/1718796108812431674
https://x.com/rgilman33/status/1914273430611906590


SDXL and SD VAE latents are totally incompatible https://huggingface.co/madebyollin/sdxl-vae-fp16-fix/discussions/6
The SD VAE encoder has an annoying bright spot that gets worse when encoding higher-resolution images (animation)

original post https://www.reddit.com/r/StableDiffusion/comments/1ag5h5s/the_vae_used_for_stable_diffusion_1x2x_and_other/ / https://news.ycombinator.com/item?id=39215242
commentary from ethan https://www.reddit.com/r/StableDiffusion/comments/1agd5pz/a_recent_post_went_viral_claiming_that_the_vae_is/
commentary from me https://www.reddit.com/r/StableDiffusion/comments/1agd5pz/comment/koixp8d/?utm_source=reddit&utm_medium=web2x&context=3


The FAL team trained their own openly-licensed 16-channel VAE (weights, training code) which also disables the midblock attention layer
Ostris also trained an openly-licensed 16 channel VAE (weights) with reduced channel count for faster performance