Skip to content

Instantly share code, notes, and snippets.

Last active March 5, 2024 16:51
Show Gist options
  • Star 210 You must be signed in to star a gist
  • Fork 8 You must be signed in to fork a gist
  • Save harubaru/f727cedacae336d1f7877c4bbe2196e1 to your computer and use it in GitHub Desktop.
Save harubaru/f727cedacae336d1f7877c4bbe2196e1 to your computer and use it in GitHub Desktop.
Official Release Notes for Waifu Diffusion 1.3

Waifu Diffusion 1.3 Release Notes

HuggingFace Page for model download:

Table of Contents

Model Overview

The Waifu Diffusion 1.3 model is a Stable Diffusion model that has been finetuned from Stable Diffusion v1.4. I would like to personally thank everyone that had been involved with the development and release of Stable Diffusion, as all of this work for Waifu Diffusion would not have been possible without their original codebase and pre-existing model weights from which Waifu Diffusion was finetuned from.

The data used for finetuning Waifu Diffusion 1.3 was 680k text-image samples that had been downloaded through a booru site that provides high-quality tagging and original sources to the artworks themselves that are uploaded to the site. I also want to personally thank them as well, as without their hardwork the generative quality from this model would not have been feasible without going to financially extreme lengths to acquiring the data to use for training. The Booru in question would also like to remain anonymous due to the current climate regarding AI generated imagery.

Within the HuggingFace Waifu Diffusion 1.3 Repository are 4 final models:

  • Float 16 EMA Pruned: This is the smallest available form for the model at 2GB. This model is to be used for inference purposes only.
  • Float 32 EMA Pruned: The float32 weights are the second smallest available form of the model at 4GB. This is to be used for inference purposes only.
  • Float 32 Full Weights: The full weights contain the EMA weights which are not used during inference. These can be used for either training or inference.
  • Float 32 Full Weights + Optimizer Weights: The optimizer weights contain all of the optimizer states used during training. It is 14GB large and there is no quality difference between this model and the others as this model is to be used for training purposes only.

Various modifications to the data had been made since the Waifu Diffusion 1.2 model which included:

  • Removing underscores.
  • Removing parenthesis.
  • Separating each booru tag with a comma.
  • Randomizing tag order.

Training Process

The finetuning had been conducted with a fork from the original Stable Diffusion codebase, which is known as Waifu Diffusion. The differences between the main repo and the fork is that the fork includes fixes to the original training code as well as a custom dataloader to train on text-image pairs that are stored locally.

For finetuning, the base model Stable Diffusion 1.4 had been finetuned on 680k text-image samples for 10 epochs at a flat learning rate of 5e-6.

The hardware used was a GPU VM instance which had the following specs:

  • 8x 48GB A40 GPUs
  • 24 AMD Epyc Milan vCPU cores
  • 192GB of RAM
  • 250GB Storage

Training had taken approximately 10 days to finish and roughly around $3.1k had been spent on compute costs.


As a result of the removal of underscores and including random tag order, it is now much more easier to prompt with the Waifu Diffusion 1.3 model compared to the predecessor 1.2 version.

So now, how do you generate something? Let's say you have a regular prompt like this:

a girl wearing a hoodie in the rain

To generate an image of a girl wearing a hoodie in the rain using Waifu Diffusion 1.3, you would prompt it out with booru tags that would look like:

original, 1girl, solo, portrait, hoodie, wearing hoodie


Looks boring, right? Thankfully, with Booru tags, all you have to do to make the output look more detailed is to simply add more tags. Let's say you want to add rain and backlighting, then you would have a prompt that looks like:

original, 1girl, solo, portrait, hoodie, wearing hoodie, red hoodie, long sleeves, simple background, backlighting, rain, night, depth of field

There, that's more like it.

Overall, to get better outputs you have to:

  • Add more tags to your prompt, especially compositional tags. A full list can be found here.
  • Include a copyright tag that is associated with high quality art like genshin impact or arknights.
  • Be specific. Use specific tags. The model cannot assume what you want, so you have to be specific.
  • Do not use underscores. They're deprecated since Waifu Diffusion 1.2.


The Waifu Diffusion 1.3 Weights have been released under the CreativeML Open RAIL-M License.

Sample Generations


1girl, black eyes, black hair, black sweater, blue background, bob cut, closed mouth, glasses, medium hair, red-framed eyewear, simple background, solo, sweater, upper body, wide-eyed


1girl, aqua eyes, baseball cap, blonde hair, closed mouth, earrings, green background, hat, hoop earrings, jewelry, looking at viewer, shirt, short hair, simple background, solo, upper body, yellow shirt


1girl, black bra, black hair, black panties, blush, borrowed character, bra, breasts, cleavage, closed mouth, gradient hair, hair bun, heart, large breasts, lips, looking at viewer, multicolored hair, navel, panties, pointy ears, red hair, short hair, sweat, underwear


yakumo ran, arknights, 1girl, :d, animal ears, blonde hair, breasts, cowboy shot, extra ears, fox ears, fox shadow puppet, fox tail, head tilt, large breasts, looking at viewer, multiple tails, no headwear, short hair, simple background, smile, solo, tabard, tail, white background, yellow eyes


chen, arknights, 1girl, animal ears, brown hair, cat ears, cat tail, closed mouth, earrings, face, hat, jewelry, lips, multiple tails, nekomata, painterly, red eyes, short hair, simple background, solo, tail, white background

Team Members and Acknowledgements

This project would not have been possible without the incredible work by the CompVis Researchers. I would also like to personally thank everyone for their generous support in our Discord server! Thank you guys.

In order to reach us, you can join our Discord server.

Discord Server

Copy link

Birch-san commented Oct 8, 2022

congrats on the release!

when pulling the repo from huggingface: you can save ~120 GB by using git sparse-checkout to download just the epochs/distributions you want, instead of everything.

mkdir waifu-diffusion-v1-3
cd waifu-diffusion-v1-3
git init
git remote add -f origin
git config core.sparseCheckout true
echo '/wd-v1-3-float32.ckpt
/.gitattributes' > .git/info/sparse-checkout
git lfs install
git pull --depth 1 origin main

Copy link

Actually, I think wget is more simple to achieve download specific model in hunggingface.

Copy link

gwern commented Oct 9, 2022

Out of curiosity, why do the two Touhou samples use arknights in the prompt instead? Are they supposed to look Arknights-ish somehow?

Copy link

As a Touhou fan since high school, I'm so proud of this repo

Copy link

Out of curiosity, why do the two Touhou samples use arknights in the prompt instead? Are they supposed to look Arknights-ish somehow?

I guess so? The character's face expression somehow looks like more Arknights-ish, so does the brushstroke styles.

Copy link

zyddnys commented Oct 9, 2022

Fantastic work for making waifu generation available to everyone! I wonder do you finetune the VAE of SD as well or did you just kept it frozen?

Copy link

gwern commented Oct 10, 2022

@harubaru Waifu Diffusion request: could you dump, say, 100 random samples as a sheet of samples (maybe 4x25) for this writeup? Prompted samples are good, but people tend to over-focus on a few scenarios (not to mention keywords, like cargo-culting 'Rutkowski' everywhere) and ignore that the models are capable of doing far more. That would represent the diversity of the model's knowledge better than just a few hand-prompted samples. After discussing it a bit with Rivers and others, AFAICT, a prompt of the empty string "" should give a random sample from the full distribution, without bias towards any particular part, so you can get an idea of the full range of what it can do. EDIT: after some further thought, because W-D didn't do conditioning dropout (did it?), the random samples may be terrible. In that case, I suggest picking 100 random images from Danbooru and generating a sample with their tags, letting you do a real vs fake comparison for each. Not the same thing, but still very useful for someone trying to get an impression of W-D's overall capabilities.

Copy link

That's so cool, i've trying it now

Copy link

do brackets and such make a difference? Or how do you increase strengh of certain tags?

Copy link

sorry i forgot these gist things had comments lol @gwern, we didn't do any conditioning dropout even though we should have.

Copy link

gwern commented Oct 15, 2022

Yeah, I guessed not. After looking at a dump of random samples from SD courtesy of Summer-Stay, there's a noticeable quality gap, which makes me think that SD (and any derivatives) have been undertrained in terms of unconditional generation - I suspect that this may be suboptimal, because if it doesn't understand images in general, then that must compromise generating specific images, one would think. Anyway, this shouldn't apply to grabbing sets of real-world tags from random Danbooru posts and comparing side-by-side.

Copy link

it is very suboptimal. i just added conditioning dropout a couple days ago and i'm planning on retraining the whole model soon. i also added some code to use the penultimate layer of clip @gwern

Copy link

is it possible to use blacklist tags? i searched for a moment and i couldn't find a way to subtract terms from a prompt (at least in command line) like you would in a booru search (e.x: 1girl -1boy)

Copy link

gwern commented Oct 26, 2022

@Nicolas-GE The equivalent to blacklist tags in SD terminology would be "negative prompts", I believe.

Copy link

Birch-san commented Oct 27, 2022

@Nicolas-GE I'm aware of 3 ways to perform negative prompting:

the theory of multi-cond guidance feels the most correct to me. but people get very good results with negative prompting.
but negative prompt or emphasis are more performant, because multi-cond guidance requires you to denoise per condition. going from uncond, cond to uncond, cond0, cond1 is 50% more conditions to perform denoising on.

Copy link

does sampling method affect anything on quality?

Copy link

can image2image?

Copy link

Girl with horsebreeches

Copy link

Uploading IMAGE#418ab0ca-6eb9-4ff1-b856-f20137e8c042.png…

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment