Skip to content

Instantly share code, notes, and snippets.

@madebyollin
Last active February 10, 2024 02:25
Show Gist options
  • Save madebyollin/86b9596ffa4ab0fa7674a16ca2aeab3d to your computer and use it in GitHub Desktop.
Save madebyollin/86b9596ffa4ab0fa7674a16ca2aeab3d to your computer and use it in GitHub Desktop.
Stable Diffusion on Apple Silicon GPUs via CoreML; 2s / step on M1 Pro
# ------------------------------------------------------------------
# EDIT: I eventually found a faster way to run SD on macOS, via MPSGraph (~0.8s / step on M1 Pro):
# https://github.com/madebyollin/maple-diffusion
# The original CoreML-related code & discussion is preserved below :)
# ------------------------------------------------------------------
# you too can run stable diffusion on the apple silicon GPU (no ANE sadly)
#
# quick test portraits (each took 50 steps x 2s / step ~= 100s on my M1 Pro):
# * https://i.imgur.com/5ywISvm.png
# * https://i.imgur.com/94fMA22.png
# * https://i.imgur.com/WOpSweZ.png
# * https://i.imgur.com/Hns46Rk.png
# the default pytorch / cpu pipeline took ~4.2s / step and did not use the GPU
#
# how to use the GPU
# 0. https://coremltools.readme.io/docs/installation
# 1. https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/stable_diffusion.ipynb
# (but stop at 'pipe = pipe.to("cuda")'. also I used fp32 checkpoints, but it might work fine with fp16)
# 2. run this gist's code in a new cell instead. it will be very slow the first time, faster on reruns.
# 3. if it worked, proceed to run the image generation cell, which should now use the GPU
import coremltools as ct
from pathlib import Path
import torch as th
def generate_coreml_model_via_awful_hacks(f, out_name):
from coremltools.converters.mil import Builder as mb
from coremltools.converters.mil.frontend.torch.torch_op_registry import register_torch_op, _TORCH_OPS_REGISTRY
import coremltools.converters.mil.frontend.torch.ops as cml_ops
orig_einsum = th.einsum
def fake_einsum(a, b, c):
if a == 'b i d, b j d -> b i j': return th.bmm(b, c.permute(0, 2, 1))
if a == 'b i j, b j d -> b i d': return th.bmm(b, c)
raise ValueError(f"unsupported einsum {a} on {b.shape} {c.shape}")
th.einsum = fake_einsum
if "broadcast_to" in _TORCH_OPS_REGISTRY: del _TORCH_OPS_REGISTRY["broadcast_to"]
@register_torch_op
def broadcast_to(context, node): return cml_ops.expand(context, node)
if "gelu" in _TORCH_OPS_REGISTRY: del _TORCH_OPS_REGISTRY["gelu"]
@register_torch_op
def gelu(context, node): context.add(mb.gelu(x=context[node.inputs[0]], name=node.name))
class Undictifier(th.nn.Module):
def __init__(self, m):
super().__init__()
self.m = m
def forward(self, *args, **kwargs): return self.m(*args, **kwargs)["sample"]
print("tracing")
f_trace = th.jit.trace(Undictifier(f), (th.zeros(2, 4, 64, 64), th.zeros(1), th.zeros(2, 77, 768)), strict=False, check_trace=False)
print("converting")
f_coreml_fp16 = ct.convert(f_trace,
inputs=[ct.TensorType(shape=(2, 4, 64, 64)), ct.TensorType(shape=(1,)), ct.TensorType(shape=(2, 77, 768))],
convert_to="mlprogram", compute_precision=ct.precision.FLOAT16, skip_model_load=True)
f_coreml_fp16.save(f"{out_name}")
th.einsum = orig_einsum
print("the deed is done")
class UNetWrapper:
def __init__(self, f, out_name="unet.mlpackage"):
self.in_channels = f.in_channels
if not Path(out_name).exists():
print("generating coreml model"); generate_coreml_model_via_awful_hacks(f, out_name); print("saved")
# not only does ANE take forever to load because it recompiles each time - it then doesn't work!
# and NSLocalizedDescription = "Error computing NN outputs."; is not helpful... GPU it is
print("loading saved coreml model"); f_coreml_fp16 = ct.models.MLModel(out_name, compute_units=ct.ComputeUnit.CPU_AND_GPU); print("loaded")
self.f = f_coreml_fp16
def __call__(self, sample, timestep, encoder_hidden_states):
args = {"sample_1": sample.numpy(), "timestep": th.tensor([timestep], dtype=th.int32).numpy(), "input_35": encoder_hidden_states.numpy()}
for v in self.f.predict(args).values(): return {"sample": th.tensor(v, dtype=th.float32)}
pipe.unet = UNetWrapper(pipe.unet)
@madebyollin
Copy link
Author

BTW i assume you've seen https://github.com/apple/ml-stable-diffusion 🔥

Yes, but it's good to link that repo in this thread!

AFAICT, most (all?) of the UNet speedup Apple got was through optimizations in CoreML itself. On 13.0, in my test device (M1 Pro MBP), Apple's CoreML model graph did not seem any faster than the one I exported back in August. I think this matches what Birchlabs observed in their testing - all of the speedup comes from 13.1 upgrade.

Once I upgrade to 13.1 I'd love to re-check how various SD runners compare (MPSGraph vs. Apple's CoreML model vs. my naive CoreML export of the diffusers model) :)

@x4080
Copy link

x4080 commented Jun 4, 2023

@madebyollin did you have any chance to test the latest stable diffusion speeds, how about compared to latest pytorch in apple silicon

@madebyollin
Copy link
Author

@x4080 I did an informal check of UNet speeds today, 2023-06-04 (on M1 Pro 16GB, macOS Ventura 13.4).

  • PyTorch 2.0.1 / MPS: 1.10 it / s (iirc this is using MPSGraph on GPU, just with a lot of overhead and minimal operator fusion)

image

  • CoreML / CPU_AND_GPU / Apple's SPLIT_EINSUM config: 1.39 it / s

image

  • MPSGraph / GPU (Maple Diffusion): 1.44 it / s (0.69 s / it)

image

  • CoreML / ALL (CPU+GPU+ANE) / Apple's SPLIT_EINSUM config: 1.85 it / s. This is Apple's recommended config for good reason, but I observe a huge on-initial-model-load delay waiting for ANECompilerService, which makes it annoying to use in practice 😞

image

Summarizing: I get 1.68x speedup from using Apple's CoreML+ANE approach vs. just using PyTorch / MPS. It seems like using the ANE is important on the lower-end M{1, 2} and M{1,2} Pro chips. For the higher-end chips, the ANE probably doesn't contribute anything, but I expect running a low-overhead GPU compute graph (CoreML or Maple) still helps.

@x4080
Copy link

x4080 commented Jun 4, 2023

@madebyollin thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment