Simulation results, ranked from highest to lowest latency. Generated using https://gist.github.com/philipturner/d408351d68b5b1701bb651d4542e26e6
Raw data is private, but there's an older, publicly available substitute at https://gist.github.com/philipturner/94e7c5094915f23438440d49da823c9d
The statistics for MFA Winograd are speculative and eventual performance may be less. For example, Winograd may not be finished this summer, and/or upon publication of the code.
System:
- 32-core Apple 7 GPU, 1.296 GHz
- 409.6 GB/s bandwidth
- macOS 14 Developer Beta
Simulated (Metal Performance Shaders)
- simulated latency @ 30 steps:
- 15.3 s
- simulation config:
- framework: MPS
- precision: F16
- operations simulated: 1650 / 1685
Distribution | Latency | Operation Class |
---|---|---|
35.2% | 5.4 s | SELF ATTENTION GEMM |
19.8% | 3.0 s | CONVOLUTION (3x3) |
16.1% | 2.5 s | OTHER GEMM |
12.6% | 1.9 s | SOFTMAX |
7.4% | 1.1 s | CROSS ATTENTION GEMM |
3.3% | 504.2 ms | CONVOLUTION (1x1) |
2.3% | 353.2 ms | ELEMENTWISE |
2.0% | 306.2 ms | TRANSPOSE |
1.3% | 192.5 ms | NORMALIZATION |
Simulated (MFA Monolithic GEMM, MPS Everything Else)
- simulated latency @ 30 steps:
- 11.2 s
- simulation config:
- framework: MFA (monolithic GEMM, MPS Conv2D)
- precision: F16
- operations simulated: 1650 / 1685
Distribution | Latency | Operation Class |
---|---|---|
26.9% | 3.0 s | CONVOLUTION (3x3) |
19.8% | 2.2 s | OTHER GEMM |
19.1% | 2.1 s | SELF ATTENTION GEMM |
17.2% | 1.9 s | SOFTMAX |
4.9% | 551.3 ms | CROSS ATTENTION GEMM |
4.5% | 504.2 ms | CONVOLUTION (1x1) |
3.2% | 353.2 ms | ELEMENTWISE |
2.7% | 306.2 ms | TRANSPOSE |
1.7% | 192.5 ms | NORMALIZATION |
Simulated (MFA FlashAttention, MPS Nothing Except Winograd)
- simulated latency @ 30 steps:
- 6.8 s
- simulation config:
- framework: MFA (FlashAttention, MPS Conv2D)
- precision: F16
- operations simulated: 1650 / 1685
Distribution | Latency | Operation Class |
---|---|---|
44.3% | 3.0 s | CONVOLUTION (3x3) |
27.1% | 1.8 s | OTHER GEMM |
18.0% | 1.2 s | SELF ATTENTION GEMM |
5.2% | 355.6 ms | CROSS ATTENTION GEMM |
4.3% | 290.0 ms | CONVOLUTION (1x1) |
0.8% | 54.1 ms | SOFTMAX |
0.3% | 23.4 ms | NORMALIZATION |
0.0% | 0 µs | TRANSPOSE |
0.0% | 0 µs | ELEMENTWISE |
Simulated (MFA FlashAttention, MFA Winograd 2x)
- simulated latency @ 30 steps:
- 5.0 s
- simulation config:
- framework: MFA (FlashAttention, MFA Winograd)
- precision: F16
- operations simulated: 1650 / 1685
Distribution | Latency | Operation Class |
---|---|---|
37.1% | 1.8 s | OTHER GEMM |
24.5% | 1.2 s | SELF ATTENTION GEMM |
23.9% | 1.2 s | CONVOLUTION (3x3) |
7.1% | 355.6 ms | CROSS ATTENTION GEMM |
5.8% | 290.0 ms | CONVOLUTION (1x1) |
1.1% | 54.1 ms | SOFTMAX |
0.5% | 23.4 ms | NORMALIZATION |
0.0% | 0 µs | TRANSPOSE |
0.0% | 0 µs | ELEMENTWISE |
Theoretical (ConvGEMM):
- theoretical latency @ 30 steps:
- 5.0 s
Distribution | Latency | Operation Class |
---|---|---|
45.4% | 2.3 s | CONVOLUTION (3x3) |
30.0% | 1.5 s | OTHER GEMM |
13.9% | 692.3 ms | SELF ATTENTION GEMM |
4.9% | 246.5 ms | CONVOLUTION (1x1) |
4.7% | 235.7 ms | CROSS ATTENTION GEMM |
1.0% | 51.4 ms | SOFTMAX |
0.0% | 1.8 ms | NORMALIZATION |
0.0% | 0 µs | TRANSPOSE |
0.0% | 0 µs | ELEMENTWISE |
Theoretical (Winograd Asymptotic):
- theoretical latency @ 30 steps:
- 3.0 s
Distribution | Latency | Operation Class |
---|---|---|
50.3% | 1.5 s | OTHER GEMM |
23.3% | 692.3 ms | SELF ATTENTION GEMM |
8.4% | 251.4 ms | CONVOLUTION (3x3) |
8.3% | 246.5 ms | CONVOLUTION (1x1) |
7.9% | 235.7 ms | CROSS ATTENTION GEMM |
1.7% | 51.4 ms | SOFTMAX |
0.1% | 1.8 ms | NORMALIZATION |
0.0% | 0 µs | TRANSPOSE |
0.0% | 0 µs | ELEMENTWISE |