Skip to content

Instantly share code, notes, and snippets.

@philipturner
Last active June 23, 2023 02:06
Show Gist options
  • Save philipturner/a967b23ef3188464da67842c5c83cb6e to your computer and use it in GitHub Desktop.
Save philipturner/a967b23ef3188464da67842c5c83cb6e to your computer and use it in GitHub Desktop.

Simulation results, ranked from highest to lowest latency. Generated using https://gist.github.com/philipturner/d408351d68b5b1701bb651d4542e26e6

Raw data is private, but there's an older, publicly available substitute at https://gist.github.com/philipturner/94e7c5094915f23438440d49da823c9d

The statistics for MFA Winograd are speculative and eventual performance may be less. For example, Winograd may not be finished this summer, and/or upon publication of the code.


System:

  • 32-core Apple 7 GPU, 1.296 GHz
  • 409.6 GB/s bandwidth
  • macOS 14 Developer Beta

Simulated (Metal Performance Shaders)

  • simulated latency @ 30 steps:
    • 15.3 s
  • simulation config:
    • framework: MPS
    • precision: F16
    • operations simulated: 1650 / 1685
Distribution Latency Operation Class
35.2% 5.4 s SELF ATTENTION GEMM
19.8% 3.0 s CONVOLUTION (3x3)
16.1% 2.5 s OTHER GEMM
12.6% 1.9 s SOFTMAX
7.4% 1.1 s CROSS ATTENTION GEMM
3.3% 504.2 ms CONVOLUTION (1x1)
2.3% 353.2 ms ELEMENTWISE
2.0% 306.2 ms TRANSPOSE
1.3% 192.5 ms NORMALIZATION

Simulated (MFA Monolithic GEMM, MPS Everything Else)

  • simulated latency @ 30 steps:
    • 11.2 s
  • simulation config:
    • framework: MFA (monolithic GEMM, MPS Conv2D)
    • precision: F16
    • operations simulated: 1650 / 1685
Distribution Latency Operation Class
26.9% 3.0 s CONVOLUTION (3x3)
19.8% 2.2 s OTHER GEMM
19.1% 2.1 s SELF ATTENTION GEMM
17.2% 1.9 s SOFTMAX
4.9% 551.3 ms CROSS ATTENTION GEMM
4.5% 504.2 ms CONVOLUTION (1x1)
3.2% 353.2 ms ELEMENTWISE
2.7% 306.2 ms TRANSPOSE
1.7% 192.5 ms NORMALIZATION

Simulated (MFA FlashAttention, MPS Nothing Except Winograd)

  • simulated latency @ 30 steps:
    • 6.8 s
  • simulation config:
    • framework: MFA (FlashAttention, MPS Conv2D)
    • precision: F16
    • operations simulated: 1650 / 1685
Distribution Latency Operation Class
44.3% 3.0 s CONVOLUTION (3x3)
27.1% 1.8 s OTHER GEMM
18.0% 1.2 s SELF ATTENTION GEMM
5.2% 355.6 ms CROSS ATTENTION GEMM
4.3% 290.0 ms CONVOLUTION (1x1)
0.8% 54.1 ms SOFTMAX
0.3% 23.4 ms NORMALIZATION
0.0% 0 µs TRANSPOSE
0.0% 0 µs ELEMENTWISE

Simulated (MFA FlashAttention, MFA Winograd 2x)

  • simulated latency @ 30 steps:
    • 5.0 s
  • simulation config:
    • framework: MFA (FlashAttention, MFA Winograd)
    • precision: F16
    • operations simulated: 1650 / 1685
Distribution Latency Operation Class
37.1% 1.8 s OTHER GEMM
24.5% 1.2 s SELF ATTENTION GEMM
23.9% 1.2 s CONVOLUTION (3x3)
7.1% 355.6 ms CROSS ATTENTION GEMM
5.8% 290.0 ms CONVOLUTION (1x1)
1.1% 54.1 ms SOFTMAX
0.5% 23.4 ms NORMALIZATION
0.0% 0 µs TRANSPOSE
0.0% 0 µs ELEMENTWISE

Theoretical (ConvGEMM):

  • theoretical latency @ 30 steps:
    • 5.0 s
Distribution Latency Operation Class
45.4% 2.3 s CONVOLUTION (3x3)
30.0% 1.5 s OTHER GEMM
13.9% 692.3 ms SELF ATTENTION GEMM
4.9% 246.5 ms CONVOLUTION (1x1)
4.7% 235.7 ms CROSS ATTENTION GEMM
1.0% 51.4 ms SOFTMAX
0.0% 1.8 ms NORMALIZATION
0.0% 0 µs TRANSPOSE
0.0% 0 µs ELEMENTWISE

Theoretical (Winograd Asymptotic):

  • theoretical latency @ 30 steps:
    • 3.0 s
Distribution Latency Operation Class
50.3% 1.5 s OTHER GEMM
23.3% 692.3 ms SELF ATTENTION GEMM
8.4% 251.4 ms CONVOLUTION (3x3)
8.3% 246.5 ms CONVOLUTION (1x1)
7.9% 235.7 ms CROSS ATTENTION GEMM
1.7% 51.4 ms SOFTMAX
0.1% 1.8 ms NORMALIZATION
0.0% 0 µs TRANSPOSE
0.0% 0 µs ELEMENTWISE
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment