Skip to content

Instantly share code, notes, and snippets.

@AmericanPresidentJimmyCarter
Last active December 2, 2024 03:13
Show Gist options
  • Save AmericanPresidentJimmyCarter/873985638e1f3541ba8b00137e7dacd9 to your computer and use it in GitHub Desktop.
Save AmericanPresidentJimmyCarter/873985638e1f3541ba8b00137e7dacd9 to your computer and use it in GitHub Desktop.
how to run flux on your 16gb potato
# First, in your terminal.
#
# $ python3 -m virtualenv env
# $ source env/bin/activate
# $ pip install torch torchvision transformers sentencepiece protobuf accelerate
# $ pip install git+https://github.com/huggingface/diffusers.git
# $ pip install optimum-quanto
import torch
from optimum.quanto import freeze, qfloat8, quantize
from diffusers import FlowMatchEulerDiscreteScheduler, AutoencoderKL
from diffusers.models.transformers.transformer_flux import FluxTransformer2DModel
from diffusers.pipelines.flux.pipeline_flux import FluxPipeline
from transformers import CLIPTextModel, CLIPTokenizer,T5EncoderModel, T5TokenizerFast
dtype = torch.bfloat16
# schnell is the distilled turbo model. For the CFG distilled model, use:
# bfl_repo = "black-forest-labs/FLUX.1-dev"
# revision = "refs/pr/3"
#
# The undistilled model that uses CFG ("pro") which can use negative prompts
# was not released.
bfl_repo = "black-forest-labs/FLUX.1-schnell"
revision = "refs/pr/1"
scheduler = FlowMatchEulerDiscreteScheduler.from_pretrained(bfl_repo, subfolder="scheduler", revision=revision)
text_encoder = CLIPTextModel.from_pretrained("openai/clip-vit-large-patch14", torch_dtype=dtype)
tokenizer = CLIPTokenizer.from_pretrained("openai/clip-vit-large-patch14", torch_dtype=dtype)
text_encoder_2 = T5EncoderModel.from_pretrained(bfl_repo, subfolder="text_encoder_2", torch_dtype=dtype, revision=revision)
tokenizer_2 = T5TokenizerFast.from_pretrained(bfl_repo, subfolder="tokenizer_2", torch_dtype=dtype, revision=revision)
vae = AutoencoderKL.from_pretrained(bfl_repo, subfolder="vae", torch_dtype=dtype, revision=revision)
transformer = FluxTransformer2DModel.from_pretrained(bfl_repo, subfolder="transformer", torch_dtype=dtype, revision=revision)
# Experimental: Try this to load in 4-bit for <16GB cards.
#
# from optimum.quanto import qint4
# quantize(transformer, weights=qint4, exclude=["proj_out", "x_embedder", "norm_out", "context_embedder"])
# freeze(transformer)
quantize(transformer, weights=qfloat8)
freeze(transformer)
quantize(text_encoder_2, weights=qfloat8)
freeze(text_encoder_2)
pipe = FluxPipeline(
scheduler=scheduler,
text_encoder=text_encoder,
tokenizer=tokenizer,
text_encoder_2=None,
tokenizer_2=tokenizer_2,
vae=vae,
transformer=None,
)
pipe.text_encoder_2 = text_encoder_2
pipe.transformer = transformer
pipe.enable_model_cpu_offload()
generator = torch.Generator().manual_seed(12345)
image = pipe(
prompt='nekomusume cat girl, digital painting',
width=1024,
height=1024,
num_inference_steps=4,
generator=generator,
guidance_scale=3.5,
).images[0]
image.save('test_flux_distilled.png')
@DotPoker2
Copy link

DotPoker2 commented Aug 2, 2024

i mean, i would, but i get an error !

Security issues. Click those links to learn more. But, try a normal command prompt instead and do not try and do this under program files. Create a new folder for your Python experiments.

this is my 2tb hard drive where i store all my automatic1111 works and my stable diffusion folders, i'm not storing it in my 250 SSD. the link takes me to bing... also i can't use normal command prompt, as my stupid head screwed it up trying to relocate the home file from my SSD to my 2tb Hard Drive, and that was 5 months ago, not even repairing it works, I'd have to uninstall and reinstall windows to fix it.

@DotPoker2

you can play with the model in Google Colab also, just make sure that you select a GPU from Runtime > Change runtime type > T4 (it's free) and then Runtime > Run all. It takes some minutes to download the weights and run it. Have fun:

flux.1-schnell - https://colab.research.google.com/github/camenduru/flux-jupyter/blob/main/flux.1-schnell_jupyter.ipynb flux.1-dev - https://colab.research.google.com/github/camenduru/flux-jupyter/blob/main/flux.1-dev_jupyter.ipynb flux.1-dev-image-2-image - https://colab.research.google.com/github/camenduru/flux-jupyter/blob/main/flux.1-dev-i2i_jupyter.ipynb

ok, so it's doing...something, not sure what, but it's loading the "prompt", that a good thing? also how do i get to run my own prompts, and where does it save the images it makes?
Screenshot (157)
edit: it generated the image...this is really not like what i'm used to on stable diffusion. took a few minutes too.
Screenshot (158)
is there any way that i can get Euler A with the collab file or am i stuck with the old slow Euler?

@SoftologyPro
Copy link

SoftologyPro commented Aug 2, 2024

I'm using a RT 4000 SFF, 20GB memory GPU. It takes about 30 to 60 seconds before Gradio becomes available. After that I can render pretty quickly.

That is faster than mine. Maybe the commercial grade GPUs do this faster. Do you see your physical RAM increase by around 40 GB as it gets ready?

Yes, a lot of physical memory is building up before it's ready. About 40 GB's here also.

OK, thanks for the gradio script. I am adding support for it in Visions of Chaos.

@SoftologyPro
Copy link

Does anybody know if the diffsuers based Flux allows image to image?
ie style an image with a prompt. Feed in a portrait of someone and use a prompt like "a zombie" to make them look like a zombie.

@AmericanPresidentJimmyCarter
Copy link
Author

Does anybody know if the diffsuers based Flux allows image to image? ie style an image with a prompt. Feed in a portrait of someone and use a prompt like "a zombie" to make them look like a zombie.

huggingface/diffusers#9070

@kareykar
Copy link

kareykar commented Aug 4, 2024

can you tell me how do i use this? im sorry im still new

@AdeilnAcinrst
Copy link

My potato 1070 (8GB) worked with 4-bit, thank you very much. I was banging my head for the entire weekend.

@pesto-lover
Copy link

I generated a 1K image using qint4 on a 2080-ti (thanks!) The time spent using the GPU was just a few seconds, but there were many minutes of CPU computation before that. Is there a way to pay that startup time only once if you have a large number of prompts that you want to loop through?

@SoftologyPro
Copy link

For anyone who needs it, I've added a Web UI to this snippet, so you can test it from a web UI: https://gist.github.com/VvanGemert/ab9c3ce63f12d429cf6075dbd764e57c

Can you modify the gradio to create and save as PNG rather than WEBP format?

@AmericanPresidentJimmyCarter
Copy link
Author

I generated a 1K image using qint4 on a 2080-ti (thanks!) The time spent using the GPU was just a few seconds, but there were many minutes of CPU computation before that. Is there a way to pay that startup time only once if you have a large number of prompts that you want to loop through?

The quantize steps take a long time, if you want to keep making images just add an input in a while loop for generation after loading the model so you can enter different prompts.

@acaladolopes
Copy link

acaladolopes commented Aug 5, 2024

here's a quick and dirty Flask api wrapper that uses this script. Hardcoded for flux-dev. Don't forget to "pip install Flask" in your venv.
Call it using a HTTP POST request with a json payload like this:

data = {
"prompt": "an angel",
"width": 1024,
"height": 1024,
"num_inference_steps": 20,
"guidance_scale": 3.5
}

You'll get a base64 PNG image in the response. Tested and working on Vast.ai rented GPUs

@bitnom
Copy link

bitnom commented Aug 7, 2024

you saved the planet

@KoppAlexander
Copy link

Is it possible to save/load quantized models?

Yes.

from optimum.quanto import freeze, qfloat8, quantize
from optimum.quanto.models.diffusers_models import QuantizedDiffusersModel
from optimum.quanto.models.transformers_models import QuantizedTransformersModel

from diffusers.models.transformers.transformer_flux import FluxTransformer2DModel
from transformers import AutoModelForCausalLM, T5EncoderModel

quantize(transformer, weights=qfloat8)
freeze(transformer)
transformer.save_pretrained('/home/user/storage/hf_cache/flux_distilled/transformer')

quantize(text_encoder_2, weights=qfloat8)
freeze(text_encoder_2)
text_encoder_2.save_pretrained('/home/user/storage/hf_cache/flux_distilled/text_encoder_2')

# To load...
#
# class QuantizedT5EncoderModelForCausalLM(QuantizedTransformersModel):
#     base_class = T5EncoderModel

# class QuantizedFluxTransformer2DModel(QuantizedDiffusersModel):
#     base_class = FluxTransformer2DModel

# text_encoder_2 = T5EncoderModel.from_pretrained('/home/user/storage/hf_cache/flux_distilled/text_encoder_2', torch_dtype=dtype)

# transformer = QuantizedFluxTransformer2DModel.from_pretrained('/home/user/storage/hf_cache/flux_distilled/transformer').to(torch_dtype=dtype)

Loading them seems very slow, there might be a bug.

When I'm saving the model using the code above, it does not save the qmap.json. This should be the case, see
here for diffusers and here for transformers.

If I then try to load the saved model, I get the error

ValueError: No quantization map found in C:\...... is this a quantized model ?

since there is no qmap.json file, see here

Any thoughts on this?

@KoppAlexander
Copy link

Got it almost running by manually saving the qmap; and also changed to "auto_class = T5EncoderModel":

class QuantizedFlux2DModel(QuantizedDiffusersModel):
    base_class = FluxTransformer2DModel

class QuantizedT5Model(QuantizedTransformersModel):
    auto_class = T5EncoderModel

# quantize text_encoder_2 qfloat8
print("start loading text_encoder_2...")
text_encoder_2 = T5EncoderModel.from_pretrained(bfl_repo, subfolder="text_encoder_2", torch_dtype=dtype)#, revision=revision)
print("start quantizing text_encoder_2...")
quantize(text_encoder_2, weights=qfloat8)
print("start freezing text_encoder_2...")
freeze(text_encoder_2)
print("start saving text_encoder_2...")
text_encoder_2.save_pretrained(f"{bfl_repo}/q_text_encoder_2")
# Save quantization map to be able to reload the model
qmap_name = os.path.join(f"{bfl_repo}/q_text_encoder_2", f"{QuantizedDiffusersModel.BASE_NAME}_qmap.json")
qmap = quantization_map(text_encoder_2)
with open(qmap_name, "w", encoding="utf8") as f:
    json.dump(qmap, f, indent=4)

Now, its saying "AttributeError: type object 'T5EncoderModel' has no attribute 'from_config'. Did you mean: '_from_config'?"

@KoppAlexander
Copy link

Got it finally running with some help from here

# quantize text_encoder_2 qfloat8
print("start loading text_encoder_2...")
text_encoder_2 = T5EncoderModel.from_pretrained(bfl_repo, subfolder="text_encoder_2", torch_dtype=dtype)#, revision=revision)
print("start quantizing text_encoder_2...")
quantize(text_encoder_2, weights=qfloat8)
print("start freezing text_encoder_2...")
freeze(text_encoder_2)
print("start saving text_encoder_2...")
text_encoder_2.save_pretrained(f"{bfl_repo}/q_text_encoder_2")
# Save quantization map to be able to reload the model
qmap_name = os.path.join(f"{bfl_repo}/q_text_encoder_2", f"{QuantizedTransformersModel.BASE_NAME}_qmap.json")
qmap = quantization_map(text_encoder_2)
with open(qmap_name, "w", encoding="utf8") as f:
    json.dump(qmap, f, indent=4)
print("start loading text_encoder_2...")
T5EncoderModel.from_config = lambda c: T5EncoderModel(c)  # Duck and tape for Quanto support.
text_encoder_2 = QuantizedT5Model.from_pretrained(f"{bfl_repo}/q_text_encoder_2")#, torch_dytpe=dtype)

@KoppAlexander
Copy link

KoppAlexander commented Aug 8, 2024

Still need some help on another thing: I cannot load the quantized transformer. I guess

# manual classes are necessary since optimum.quanto does not support these yet
class QuantizedFlux2DModel(QuantizedDiffusersModel):
    base_class = FluxTransformer2DModel

class QuantizedT5Model(QuantizedTransformersModel):
    auto_class = T5EncoderModel


# quantize transformer qfloat8
transformer = FluxTransformer2DModel.from_pretrained(bfl_repo, subfolder="transformer", torch_dtype=dtype)#, revision=revision)
quantize(transformer, weights=qfloat8)
freeze(transformer)

# save transformer qfloat8
transformer.save_pretrained(f"{bfl_repo}/q_transformer_qfloat8")
# Save quantization map to be able to reload the model
qmap_name = os.path.join(f"{bfl_repo}/q_transformer_qfloat8", f"{QuantizedDiffusersModel.BASE_NAME}_qmap.json")
qmap = quantization_map(transformer)
with open(qmap_name, "w", encoding="utf8") as f:
    json.dump(qmap, f, indent=4)

# load transformer  # currently NOT WORKING
transformer = QuantizedFlux2DModel.from_pretrained(f"{bfl_repo}/q_transformer_qfloat8", torch_dtype=dtype) # THIS IS WHAT I WOULD PROBABLY NEED OR SOMETHIN LIKE (low_cpu_mem_usage=True, device_map='auto')
transformer = QuantizedFlux2DModel.from_pretrained(f"{bfl_repo}/q_transformer_qfloat8").to(torch_dtype=dtype) # THIS WOULD PROBABLY RUN BUT NOT ON MY MACHINE

When I run this code:

transformer = QuantizedFlux2DModel.from_pretrained(f"{bfl_repo}/q_transformer_qfloat8", torch_dtype=dtype)

I get: "TypeError: QuantizedDiffusersModel.from_pretrained() got an unexpected keyword argument 'torch_dtype'" which arises since there's no torch_dtype as argument here

@sayakpaul Do you have an idea how to solve this? Thank you!

@AlexanderGoryun
Copy link

AlexanderGoryun commented Aug 13, 2024

A simple torch.save() / torch.load() combo for saving the whole model works fine and is relatively fast (loading the transformer takes about 10 seconds and the text encoder ~4 seconds). You'll get security warnings from pickle but they can be ignored in this case. Also make sure to put model.eval() after torch.load() as stated here: https://pytorch.org/tutorials/beginner/saving_loading_models.html.

Code example:

...
# quantizing and saving
quantize(transformer, weights=qfloat8)
freeze(transformer)
torch.save(transformer, 'D:/models/transformer.pt')

# loading
transformer = torch.load('D:/models/transformer.pt')
transformer.eval()
...

@smthemex
Copy link

A simple torch.save() / torch.load() combo for saving the whole model works fine and is relatively fast (loading the transformer takes about 10 seconds and the text encoder ~4 seconds). You'll get security warnings from pickle but they can be ignored in this case. Also make sure to put model.eval() after torch.load() as stated here: https://pytorch.org/tutorials/beginner/saving_loading_models.html.

Code example:

...
# quantizing and saving
quantize(transformer, weights=qfloat8)
freeze(transformer)
torch.save(transformer, 'D:/models/transformer.pt')

# loading
transformer = torch.load('D:/models/transformer.pt')
transformer.eval()
...

thanks,I test it , relatively fast. Code example:
#loading

transformer = torch.load('D:/models/transformer.pt')
transformer.eval()
pipe=FluxPipeline(...)
pipe.transformer = transformer
pipe.text_encoder_2 = text_encoder_2

@mahmoudz15
Copy link

stupid question but, what are we supposed to do with this code exactly to get it work ?

@SoftologyPro
Copy link

SoftologyPro commented Aug 18, 2024

transformer = torch.load('D:/models/transformer.pt')
transformer.eval()

Thanks. That gets the run time down from 2m45s to 1m01s.
Using the same logic to save text_encoders_2 and load it gets the total runtime down to 25 seconds.

@SoftologyPro
Copy link

SoftologyPro commented Aug 21, 2024

Can you modify the script to support the various Flux LoRAs out there at the moment?

I tried adding in
pipeline.load_lora_weights(adapter_id)
from here
https://huggingface.co/zouzoumaki/flux-loras
but that has issues with optimum.

LoRA support would be good. Especially now that with the loading of the models the whole script takes only 25 seconds.

@AmericanPresidentJimmyCarter
Copy link
Author

You need to

pipe.load_lora_weights("./pytorch_lora_weights.safetensors")
pipe.fuse_lora(lora_scale=0.125)
pipe.unload_lora_weights()

But it takes a long time. I am still waiting for a better way to do this with quanto.

@SoftologyPro
Copy link

You need to

pipe.load_lora_weights("./pytorch_lora_weights.safetensors")
pipe.fuse_lora(lora_scale=0.125)
pipe.unload_lora_weights()

But it takes a long time. I am still waiting for a better way to do this with quanto.

Like this?

pipe = FluxPipeline(
    scheduler=scheduler,
    text_encoder=text_encoder,
    tokenizer=tokenizer,
    text_encoder_2=text_encoder_2,
    tokenizer_2=tokenizer_2,
    vae=vae,
    transformer=transformer,
)

pipe.load_lora_weights("./pytorch_lora_weights.safetensors")
pipe.fuse_lora(lora_scale=0.125)
pipe.unload_lora_weights()

because that gives these errors

Traceback (most recent call last):
  File "flux_on_potato.py", line 146, in <module>
    pipe.load_lora_weights("./pytorch_lora_weights.safetensors")
  File "venv\lib\site-packages\diffusers\loaders\lora_pipeline.py", line 1620, in load_lora_weights
    self.load_lora_into_transformer(
  File "venv\lib\site-packages\diffusers\loaders\lora_pipeline.py", line 1700, in load_lora_into_transformer
    incompatible_keys = set_peft_model_state_dict(transformer, state_dict, adapter_name)
  File "venv\lib\site-packages\peft\utils\save_and_load.py", line 395, in set_peft_model_state_dict
    load_result = model.load_state_dict(peft_model_state_dict, strict=False)
  File "venv\lib\site-packages\torch\nn\modules\module.py", line 2201, in load_state_dict
    load(self, state_dict)
  File "venv\lib\site-packages\torch\nn\modules\module.py", line 2189, in load
    load(child, child_state_dict, child_prefix)  # noqa: F821
  File "venv\lib\site-packages\torch\nn\modules\module.py", line 2189, in load
    load(child, child_state_dict, child_prefix)  # noqa: F821
  File "venv\lib\site-packages\torch\nn\modules\module.py", line 2189, in load
    load(child, child_state_dict, child_prefix)  # noqa: F821
  File "venv\lib\site-packages\torch\nn\modules\module.py", line 2183, in load
    module._load_from_state_dict(
  File "venv\lib\site-packages\optimum\quanto\nn\qmodule.py", line 159, in _load_from_state_dict
    deserialized_weight = QBytesTensor.load_from_state_dict(
  File "venv\lib\site-packages\optimum\quanto\tensor\qbytes.py", line 90, in load_from_state_dict
    inner_tensors_dict[name] = state_dict.pop(prefix + name)
KeyError: 'time_text_embed.timestep_embedder.linear_1.weight._data'

@AmericanPresidentJimmyCarter
Copy link
Author

Yeah, something about that lora isn't working :( You can open an issue at https://github.com/huggingface/diffusers/issues

@SoftologyPro
Copy link

Yeah, something about that lora isn't working :( You can open an issue at https://github.com/huggingface/diffusers/issues

So you can use other LoRAs with that syntax? Can you share one that works?

@wuziq
Copy link

wuziq commented Sep 3, 2024

I tried making the edit in the script for quantize, ie

#quantize(transformer, weights=qfloat8)
quantize(transformer, weights=qint4, exclude=["proj_out", "x_embedder", "norm_out", "context_embedder"])

You also need to edit the import for optimum.quanto to from optimum.quanto import freeze, qfloat8, qint4, quantize so it know what qint4 is. But the script was even slower as it seems to only use CPU now? I gave up waiting after iterations 0/4 just sat there.

I noticed the same thing, too, 0% GPU usage and only CPU usage. Were you ever able to resolve this?

Edit: I should note that this seems to happen only on Windows. On my Ubuntu machine, the GPU does get utilized.

@SoftologyPro
Copy link

I never got an answer so I gave up on LoRA support.

@ateamthebset
Copy link

you need to add lora before quantizing

@matabear-wyx
Copy link

In my case, qfloat8 works well on V100. But 4090 needs qfloat8_e5m2, don't really know the reson tho :)

@owquresh
Copy link

owquresh commented Nov 6, 2024

test_flux_distilled
My generated image looks like this. what is the issue with the script?

@nirgoren
Copy link

test_flux_distilled My generated image looks like this. what is the issue with the script?

I got a similar result on my rtx4060ti (ubuntu), using qfloat8_e5m2 instead of qfloat8 solved it for me (not sure why).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment