SpencerLeung1024/notes_on_lcm_loras_with_openvino_stable_diffusion.md

## notes_on_lcm_loras_with_openvino_stable_diffusion.md

      
    Raw
  

              notes_on_lcm_loras_with_openvino_stable_diffusion.md
            
          
    LCM LoRAs + Accelerate with OpenVINO = 33x Speed

The struggle itself toward faster Stable Diffusion is enough to fill a man's heart. One must imagine potato laptop users happy.
This Gist accompanies my Reddit post (33x Speed on Potato Laptops with OpenVINO and LCM LoRAs) and explains how I got the 33x number on my potato laptop. I also included some notes and images from using LCM with OpenVINO in Stable Diffusion WebUI.
1: Contents


1: Contents
2: Where the Speedup Comes From


2.1: PyTorch 2.1.0+cpu


2.2: Accelerate with OpenVINO


2.3: Intel(R) UHD Graphics 620


2.4: Latent Consistency LoRA


3: How To Get It Running


3.1: repositories\k-diffusion\k_diffusion\sampling.py


3.2: modules\sd_samplers_kdiffusion.py


3.3: scripts\openvino_accelerate.py


3.4: Merging the LCM LoRA into Models


4: Image Gallery and Why I Used 6 Steps


4.1: Baseline


4.2: OpenVINO


4.3: LCM LoRA


4.4: OpenVINO + LCM LoRA


4.5: Experiments with Different Step Counts


5: Limitations
6: Conclusion

2: Where the Speedup Comes From

Using PyTorch 2.0.1+cu118 to generate using 20 steps of the Euler a sampler: 09:46
Using PyTorch 2.1.0+cpu, the OpenVINO script, my potato laptop's iGPU, and the LCM LoRA to generate using 6 steps of the LCM sampler: 00:18
Yes, you read that right. 6 steps. Not the 4 step inference discussed in SDXL in 4 steps with Latent Consistency LoRAs. See section 4 for why I chose 6 steps.
2.1: PyTorch 2.1.0+cpu

I noted in my previous Gist that the PyTorch version openvinotoolkit's WebUI uses is faster than the AUTOMATIC1111 default.
Both 2.0.1+cu118 and 2.1.0+cpu seem to be faster now. I don't know what changed though.

PyTorch version 2.0.1+cu118: 09:46, 29.33s/it, 1.00x speed
PyTorch version 2.1.0+cpu: 07:52, 23.61s/it, 1.24x speed

2.2: Accelerate with OpenVINO

I've noticed inconsistent model optimization times with the latest version of openvinotoolkit/stable-diffusion-webui. I disabled caching for these numbers but still don't recommend trusting the optimization time.
As I wrote in my previous Gist, optimization time is the difference between the reported durations of the 1st image and 2nd image. I have an i7-8550U laptop with 16 GB of RAM. It runs on 15 watts of power and has an Intel UHD 620 integrated graphics.


Accelerate with OpenVINO, CPU: 01:18 optimization time + 04:03 generation time, 12.16s/it, 2.41x speed


2.3: Intel(R) UHD Graphics 620


Accelerate with OpenVINO, GPU: 01:22 optimization time + 01:57 generation time, 5.86s/it, 5.01x speed


2.4: Latent Consistency LoRA

All the above numbers were from using 20 steps of Euler a. The number below is from using 6 steps of LCM.


Accelerate with OpenVINO, GPU, LCM: 01:56 optimization time + 00:18 generation time, 3.05s/it, 32.56x speed


Which we can round up to 33x. 🤓
LCMScheduler is about twice as fast as EulerAncestralDiscreteScheduler and I have no idea why. The hacked-in LCM sampler discussed in sections 3.1 and 3.2 below is about the same speed as Euler a. I thought Euler's method was the simplest and it wasn't possible for anything to do less per step. Maybe I shouldn't have slept through my numerical analysis course.
3: How To Get It Running

To install openvinotoolkit/stable-diffusion-webui, follow their instructions. I wrote an installation guide a couple months ago but it's outdated. For example, you now run .\webui-user.bat directly instead of the various setup batch files.
The LCM LoRA by Hugging Face is at latent-consistency/lcm-lora-sdv1-5.
Sections 3.1 and 3.2 are taken from fxwz's Reddit post (You can add the LCM sampler to A1111 with a little trick). Changing sampling.py and sd_samplers_kdiffusion.py allows you to set the LCM sampler and use the LCM LoRA (like <lcm-lora-sdv1-5:1>) outside of the OpenVINO script, in case you wanted to compare or something.
Sections 3.3 and 3.4 are for using OpenVINO acceleration and models with the LCM LoRA merged into them. This is what provides the 33x speedup.
3.1: repositories\k-diffusion\k_diffusion\sampling.py

Append this to the bottom of the file:
@torch.no_grad()
def sample_lcm(model, x, sigmas, extra_args=None, callback=None, disable=None, noise_sampler=None):
    extra_args = {} if extra_args is None else extra_args
    noise_sampler = default_noise_sampler(x) if noise_sampler is None else noise_sampler
    s_in = x.new_ones([x.shape[0]])
    for i in trange(len(sigmas) - 1, disable=disable):
        denoised = model(x, sigmas[i] * s_in, **extra_args)
        if callback is not None:
            callback({'x': x, 'i': i, 'sigma': sigmas[i], 'sigma_hat': sigmas[i], 'denoised': denoised})

        x = denoised
        if sigmas[i + 1] > 0:
            x += sigmas[i + 1] * noise_sampler(sigmas[i], sigmas[i + 1])
    return x

3.2: modules\sd_samplers_kdiffusion.py

Insert this at line 39, under the line above that mentions the 'restart' sampler:
('LCM', 'sample_lcm', ['lcm'], {}),

3.3: scripts\openvino_accelerate.py

As of the time of writing, openvinotoolkit/stable-diffusion-webui uses diffusers 0.23.0, which has the LCMScheduler needed for latent consistency LoRAs. However, the OpenVINO script does not list it as a usable sampler because the folks working on openvinotoolkit/stable-diffusion-webui have not yet verified that its output is correct.
Maybe it will be added by the time you read this. In any case, these were the changes I made to add LCM as an option.
Insert this at line 64, under the line above that imports AutoencoderKL:
    LCMScheduler,

Inside the set_scheduler function starting at line 510, add two lines so the new LCM option will be recognized:
Old Code (starting at line 527):
    elif (sampler_name == "PLMS"):
        sd_model.scheduler = PNDMScheduler.from_config(sd_model.scheduler.config)
    else:
        sd_model.scheduler = EulerAncestralDiscreteScheduler.from_config(sd_model.scheduler.config)

New Code:
    elif (sampler_name == "PLMS"):
        sd_model.scheduler = PNDMScheduler.from_config(sd_model.scheduler.config)
    elif (sampler_name == "LCM"):
        sd_model.scheduler = LCMScheduler.from_config(sd_model.scheduler.config)
    else:
        sd_model.scheduler = EulerAncestralDiscreteScheduler.from_config(sd_model.scheduler.config)

At line 1174 where it makes the sampling method selection UI, change the line so you can choose "LCM":
Old Code:
        sampler_name = gr.Radio(label="Select a sampling method", choices=["Euler a", "Euler", "LMS", "Heun", "DPM++ 2M", "LMS Karras", "DPM++ 2M Karras", "DDIM", "PLMS"], value="Euler a")

New Code:
        sampler_name = gr.Radio(label="Select a sampling method", choices=["Euler a", "Euler", "LMS", "Heun", "DPM++ 2M", "LMS Karras", "DPM++ 2M Karras", "LCM", "DDIM", "PLMS"], value="Euler a")

There is one more optional change you can do. When making the model using StableDiffusionPipeline.from_single_file, the OpenVINO script leaves most arguments as default. The default for scheduler_type is "pndm", which is special and also sets config["skip_prk_steps"] to True. The Euler a sampler doesn't respond to this but the LCM sampler complains.
It outputs this to your PowerShell before the progress bar for each image:
The config attributes {'skip_prk_steps': True} were passed to LCMScheduler, but are not expected and will be ignored. Please verify your scheduler_config.json configuration file.

You can stop that from showing up by changing the code starting at line 590. Don't worry, hardcoding scheduler_type="euler-ancestral" won't mess with which sampler ultimately gets used. All this is doing is stopping the model from automatically adding config["skip_prk_steps"]. The OpenVINO script later swaps out the sampler based on your choice.
Old Code:
            if model_config != "None":
                local_config_file = os.path.join(curr_dir_path, 'configs', model_config)
                sd_model = StableDiffusionPipeline.from_single_file(checkpoint_path, original_config_file=local_config_file, use_safetensors=True, variant="fp32", dtype=torch.float32)
            else:
                sd_model = StableDiffusionPipeline.from_single_file(checkpoint_path, original_config_file=checkpoint_config, use_safetensors=True, variant="fp32", dtype=torch.float32)

New Code:
            if model_config != "None":
                local_config_file = os.path.join(curr_dir_path, 'configs', model_config)
                sd_model = StableDiffusionPipeline.from_single_file(checkpoint_path, original_config_file=local_config_file, use_safetensors=True, variant="fp32", dtype=torch.float32, scheduler_type="euler-ancestral")
            else:
                sd_model = StableDiffusionPipeline.from_single_file(checkpoint_path, original_config_file=checkpoint_config, use_safetensors=True, variant="fp32", dtype=torch.float32, scheduler_type="euler-ancestral")

3.4: Merging the LCM LoRA into Models

Now that your Stable Diffusion WebUI is set up to use LCM sampling, you'll need a model with the LCM LoRA.
You can't just put <lcm-lora-sdv1-5:1> in the prompt though. openvinotoolkit/stable-diffusion-webui says the Accelerate with OpenVINO script can apply and use one LoRA if present, and the code seems to be doing the right thing, but I've never gotten the LoRA to actually work.
To work around this, you'll need to use bmaltais/kohya_ss, a GUI for the command-line utilities of kohya-ss/sd-scripts, to merge the LCM LoRA with your desired model.
Follow its installation instructions. Note that as of the time of writing, there's a bug that makes it unable to import from something called "library" and fail when merging. BanzaiPi wrote a Reddit post ("No module named 'library'" Error when Captioning in Kohya SS) that explains how to fix this by creating a "PYTHONPATH" environment variable with your kohya_ss install location.
To make a merged model, start the Kohya_ss GUI and do the following:

Click on Utilities, LoRA, then Merge LoRA
Find your SD model and the LCM LoRA
Set the LCM LoRA's merge ratio to 1
Choose a path to save the merged model to
Click Merge model

See the image:


  ℹ️ What about LCM_Dreamshaper_v7? The original model that introduced LCMs? (Click to expand)
It doesn't work in Stable Diffusion WebUI.

The SimianLuo/LCM_Dreamshaper_v7 model is built different. It's not just Stable Diffusion v1.5 (v1-5-pruned-emaonly) with weights changed. It has its own code for running its constituent parts in a particular way, and expects that workflows and UIs will let it do its thing. On the other hand, Stable Diffusion WebUI assumes all models are structured like Stable Diffusion v1.5.
There is 0xbitches/sd-webui-lcm, an extension to add support for LCM_Dreamshaper_v7 in Stable Diffusion WebUI. It goes in an "LCM" tab separate from normal "txt2img" due to changing many things about how images are generated.
The significance of the LCM LoRA is twofold. It lets anyone use latent consistency sampling on any model without spending 32 A100 GPU training hours. It also lets people use the same workflows (like Stable Diffusion WebUI) that they're familiar with.

4: Image Gallery and Why I Used 6 Steps

For these comparison grids, I used four models:

Stable Diffusion v1.5 (v1-5-pruned-emaonly) as a baseline
Realistic Vision V5.1 (realisticVisionV51-v51VAE) for realism
Dreamshaper 8 (dreamshaper_8) for illustration
Counterfeit-V3.0 (counterfeitV30_v30) for anime

I used popular models in each category for this test. Realistic Vision, Dreamshaper, and Counterfeit have 714k, 618k, and 262k downloads on Civitai as of the time of writing.
ChilloutMix for illustration and MeinaMix for anime are more popular (they have 809k and 300k downloads respectively) but I wanted models with less NSFW tendency for this comparison.
Counterfeit uses the kl-f8-anime2.ckpt VAE from Waifu Diffusion v1.4. It produces green spots if you don't provide a VAE override. The other models used here have no overrides.
I used the prompt two girls in a cafe. It's short but introduces subtle things like poses, hand and object placement, and lighting that challenge the models a bit. Any model can generate a portrait of 1girl that looks good, but introduce 2girls in a scene with multiple objects and you'll start to notice things.
No negative prompt because I can't get OpenVINO to work with negative prompts.
4.1: Baseline

Here's the four models, 20 steps of Euler a, CFG 7, with no script selected (so using PyTorch on the CPU).

The base Stable Diffusion is significantly worse than all other models. However, all models are lacking when it comes to demonstrating how hands and coffee cups are supposed to interact. That seems to be a limitation of SD v1.5.
Counterfeit puts the two girls a bit further away, which results in poor face generation. SD v1.5 generates faces poorly if they're smaller than about 50 pixels.
4.2: OpenVINO

Same settings, but using the Accelerate with OpenVINO script. The OpenVINO script has no grid option so I just made batch counts of 6, starting from seed 1111, and assembled them into a grid manually.
You can open the baseline image in a new window and splitscreen them to compare.

There's slight differences between these and the baseline.
4.3: LCM LoRA

Using the LCM LoRA (<lcm-lora-sdv1-5:1>) on top of the model. CFG is now 1 (where no consideration is given to the negative prompt). I've heard that the LCM sampler works up to CFG 2 (increasing consideration given to the negative prompt), but I can't get negative prompts to work in OpenVINO anyways. From my testing, CFGs above 1 degrade the output in terms of body proportions, number of characters, and colors and sharpness.
4 steps:

6 steps:

8 steps:

The 4 step grid has a very noticeable brown tint over everything. I assume that forcing the model to step quickly makes it alter most of the image to brown, since most pixels in images of cafes would have been brown. This tint gets lighter with 6 and 8 steps.
Compared to 20 steps of Euler a, both 6 and 8 steps of LCM are comparable in terms of image quality. Image composition still feels lacking, but that's probably the limitation of only having one third the number of steps to work with. The difference between 4 and 6 steps is significantly larger than between 6 and 8 steps. That's why I think LCM should be considered starting at 6 steps, at least for multi-character prompts. I've seen it generate good landscapes in as little as 2 steps though.
The LCM LoRA affects certain models in particular ways. Dreamshaper's style becomes closer to realism, and also makes the girls look more Asian. Counterfeit makes the scene darker.
4.4: OpenVINO + LCM LoRA

Using OpenVINO on models with the LCM LoRA merged into them.
4 steps:

6 steps:

8 steps:

The 4 step OpenVINO grid has less tint than the other 4 step. In fact, each grid looks slightly better than the equivalent without OpenVINO. It almost feels like these are 5, 7, and 9 steps. There might be something going on inside OpenVINO's LCMScheduler that isn't in the WebUI's hacked-in LCM sampler.
4.5: Experiments with Different Step Counts

I use anime models most of the time and noticed unusual behaviour at high step counts with them first.
Unfortunately, Counterfeit has a relatively flat style which makes it hard to show things. The effects at high step count are easier to see on a certain model with a bright, clear anime style. 😳
I also switched the prompt to 2girls, sitting, field, grass, flowers, looking at viewer because the effects (particularly putting the characters in shadow) are easier to notice in a daylight scene.
These use OpenVINO + LCM LoRA merged models. Uploaded as .jpg due to the 10 MB limit.
Counterfeit-V3.0:

Counterfeit likes to put characters in shadow and this is exaggerated by the LCM LoRA. Contrast also increases with step count. However, it produces good backgrounds at high step counts, so increasing step count may be useful if your goal is to generate flat anime style scenery and you aren't too concerned with characters.
At 15 to 20 steps, Counterfeit generates good scenery. Before this, contrast is too low and everything looks washed out. After this, differences are minimal and you'd probably get a better final result with a different sampler.
Hassaku v1.3:

Hassaku with LCM LoRA increases the contrast significantly in just a few steps. I think that at some step counts it looks better than the 20 step Euler a baseline in the top row, but those are strangely low contrast results for this model. I don't know why this specific prompt gives such washed out results with Euler a. Characters in shadow are still apparent but this model is less severely impacted. Unfortunately, at step counts of 15 and above, it starts producing oversaturated and flat styles with repetitive backgrounds.
At 10 to 12 steps, Hassaku generates good characters and scenery. Before this, it can produce good quality down to 6 steps but with a higher chance of problems in the composition. After this, the results become oversaturated.
5: Limitations


You can't use the speedup if you need to use anything other than 512 x 512 txt2img. This includes Hires. fix in the WebUI as well as extensions.
You can't use negative prompting, meaning you need a model that gives decent results with no negative prompts.
There's going to be a reduction in quality compared to other samplers.

6: Conclusion

Any time there's a new process that can reduce generation times to a fraction of what it originally took, it's reasonable to expect some tradeoff. A lot of the discussion around LCM LoRAs has been users misunderstanding the point of it. To someone running Stable Diffusion on the latest and greatest hardware, going from 0.6 seconds at 100% quality to 0.1 seconds at 80% quality isn't really useful. To potato laptop users, going from 120 seconds at 100% quality to 20 seconds at 80% quality is pretty neat.
I'm sure the smart people working on latent consistency and Stable Diffusion will find improvements in the coming months. In the meantime, people with RTX 4090s will keep using what works for them, and people with UHD 620 integrated graphics will check out this new method for improving anime waifu throughput.