Skip to content

Instantly share code, notes, and snippets.

@bjacob
Last active November 22, 2023 00:24
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save bjacob/62e7143229fd142b45c171c07e0a7f5a to your computer and use it in GitHub Desktop.
Save bjacob/62e7143229fd142b45c171c07e0a7f5a to your computer and use it in GitHub Desktop.

Relative performance of matmul element types on x86 and Arm

Context

Recent efforts to run LLMs send us searching for some element types to quantize weights and activations into, that will somehow be wide enough to provide enough accuracy, and narrow enough to provide enough performance and/or memory compression.

This document is about the "performance" dimension, specifically on x86 and Arm architectures.

Primer: how to think of element type performance

This section is meant to provide a mental model that's simple and generally applicable on most CPU architectures, intentionally not getting into micro-architecture details. It should be accurate enough to inform high-level decisions of which element type to quantize to.

Think of CPU arithmetic throughput as the product (literally, multiplication) of several factors:

(scalar ops per second) = (Number of threads) * (Cycles per second) * (Instructions per cycle) * (Scalar ops per instruction)

To put that in terms of familiar units,

GFlop/s =  (Number of threads) * GHz * (Instructions per cycle) * (Scalar ops per instruction)

Now let's simplify this picture.

First, since we are discussing arithmetic details, we don't need to discuss threads here. The choice of an element type is generally orthogonal to the choice of how many threads are used. Caveat: that's actually not 100% always true, as there are cases where some element type can cause thermal throttling, and other cases where some element types are handled by ALUs that are shared across hardware threads, so there can be an interplay here. But at the level of detail that we are interested here, we can ignore that. Suffices to know that this consideration only adds more reward for favoring narrower element types.

Second, let's discuss the (Instructions per cycle) factor to understand that it is essentially out of our control and therefore, can be factored out of this discussion as well. CPUs have pipelines that allow multiple instructions to be executed in parallel. In general, utilizing that parallelism seems like it would require careful thinking about each instruction's ability to simultaneously execute with others. And that's indeed the case in general, which is why in general, CPU instruction scheduling is delicate. But here we don't need to deal with this full generality, because we are looking specifically at matrix multiplications, and the SIMD arithmetic instructions that we need are whichever SIMD instruction the ISA offers to perform some flavor of small batch matrix multiplications. And for these specific instructions, most CPUs tend to have the same number of (Instructions per cycle) for all of the instructions that we might want to use. There are some exceptions, that we will call out.

In case you're curious, here's what that (Instructions per cycle) figure usually looks like on common CPUs. If you look up CPU specs, this also matches what is commonly referred to as the number of SIMD pipelines. For example, if a CPU has a 4x128bit pipeline, it means you get 4 instructions per cycle (and each of them computes a 128-bit vector).

CPU Architecture SIMD instructions per cycle SIMD vector length
High-end Intel x86_64 2 512-bit
AMD Zen4 x86_64 1 512-bit
AMD Zen4 x86_64 2 256-bit
Low-end Intel or AMD x86_64 1 256-bit
Apple performance cores arm_64 4 128-bit
Cortex-X2 (biggest core in recent Android phone) arm_64 4 128-bit
Cortex-A710 (second-tier big core in recent Android phone) arm_64 2 128-bit
Cortex-A510 (little core in recent Android phone) arm_64 1 128-bit

But again, this shouldn't matter, as this can be factored out from element-type decisions as explained above.

Anyway, with these two simplifications, we are down to only having to think about the last factor: (Scalar ops per instruction). That is, we just need to understand which instructions each ISA offers for each element type, and how many scalar ops they each perform. And then add ah-hoc caveats when some instruction has a lower (Instructions per cycle) than others on some CPU.

SIMD instructions for matrix multiplications on x86 and Arm

Let's now catalogue SIMD instructions that can be useful to implement matrix multiplications on x86 and Arm.

Here's a mental model to categorize them: all of these instructions can be seen as doing a tiny batch-matmul-with-transposed-RHS op by themselves. As batch-matmul ops are characterized by four dimensions (B, M, N, K), with B standing for "batch size" and K being the reduction dimension size, so can we describe all of these instructions using such (B, M, N, K) tuples.

As pseudocode, each such instruction performs

for (b in range(B))
  for (m in range(M))
    for (n in range(N))
      for (k in range(K))
        out[b, m, n] += lhs[b, m, k] * rhs[b, n, k]

Any such instruction performs B*M*N*K scalar multiply-accumulates. As each multiply-accumulate is conventionally counted as 2 ops, that means that any such instruction performs 2*B*M*N*K ops.

An exception is instructions that do not accumulate into an existing accumulator, i.e. that behave as if the existing accumulator was zero-filled. We will reflect that in the Ops count in the table below and add an asterisk (*) to highlight that.

Some take-aways

Let's record some observations here before the big table below.

  1. CPUs have limited support for f16, particularly on x86. There's a current trend of getting f16 models from early GPU experiments and assuming that'll be a decent starting point for CPU as well. Often, it's not. On x86, outside of very-recent Intel Sapphire Rapids, there is no native f16 arithmetic. So it's slow.
  2. If you are going to do 16-bit floating point on CPU, do bf16 operands and f32 accumulator matmuls, and keep the rest in f32. Indeed, bf16 has surprisingly good support in recent x86 and Arm. Even when f16 is supported, bf16 is supported with better instructions. f16 gets at best some plain element-wise arithmetic, bf16 gets fancy dot-product / matmul instructions.
  3. If your workload is small-bit-depth integers, try to run it in integer arithmetic rather than dequantizing to floating point. On Arm, try hard to expand to 8bit for arithmetic. On x86, the bulk of narrow-integer arithmetic support is for 16bit operands, although if you can stomach the signed*unsigned semantics, there are some 8bit instructions that you can take further advantage of.
  4. You should probably not worry about having things fit in 16bit accumulator. On both x86 and Arm, you have narrow-integer arithmetic into 32bit accumulators.

The table

Architecture LHS RHS Out Instruction B M N K Ops CPU feature Support Caveats
arm_64 f32 f32 f32 FMLA (vector) 4 1 1 1 8 baseline universal
arm_64 f32 f32 f32 FMLA (by element) 1 4 1 1 8 baseline universal
arm_64 f16 f16 f16 FMLA (vector) 8 1 1 1 16 +fp16 ~2018
arm_64 f16 f16 f16 FMLA (by element) 1 8 1 1 16 +fp16 ~2018
arm_64 f16 f16 f32 FMLAL (vector) 4 1 1 1 8 +fp16fml ~2020
arm_64 f16 f16 f32 FMLAL (by element) 1 4 1 1 8 +fp16fml ~2020
arm_64 bf16 bf16 f32 BFMMLA 1 2 2 4 32 +bf16 ~2022 Slow on Apple M2
arm_64 i16 i16 i32 SMLAL (vector) (variants for all signednesses) 4 1 1 1 8 baseline universal
arm_64 i16 i16 i32 SMLAL (by element) (variants for all signednesses) 1 4 1 1 8 baseline universal
arm_64 i8 i8 i32 SDOT (variants for all signedness) 1 4 1 4 32 +dotprod ~2018
arm_64 i8 i8 i32 SDOT (vector) (variants for all signedness) 4 1 1 4 32 +dotprod ~2018
arm_64 i8 i8 i32 SDOT (by element) (variants for all signedness) 1 4 1 4 32 +dotprod ~2018
arm_64 i8 i8 i32 SMMLA (variants for all signedness) 1 2 2 8 64 +i8mm ~2022 Slow on Apple M2
x86_64 f32 f32 f32 VFMADD231PS (ymm*ymm) 8 1 1 1 16 +fma near-universal
x86_64 f32 f32 f32 VFMADD231PS (zmm*zmm) 16 1 1 1 32 +avx512f High-end Intel , AMD Zen4 Some Intel CPUs throttle on AVX-512. AMD Zen4 only has 256-bit pipelines, so most 512-bit ops are broken into 2 256-bit ops.
x86_64 f32 f32 f32 VFMADD231PS (zmm*m32bcst) 1 16 1 1 32 +avx512f See above about AVX-512 See above caveats about AVX-512
x86_64 f16 f16 f16 VFMADD231PH (ymm*ymm) 16 1 1 1 32 +avx512fp16 Only Intel Sapphire Rapids
x86_64 f16 f16 f16 VFMADD231PH (zmm*zmm) 32 1 1 1 64 +avx512fp16 Only Intel Sapphire Rapids See above caveats about AVX-512
x86_64 f16 f16 f16 VFMADD231PH (zmm*m16bcst) 1 32 1 1 64 +avx512fp16 Only Intel Sapphire Rapids See above caveats about AVX-512
x86_64 bf16 bf16 f32 VDPBF16PS (ymm*ymm) 8 1 1 2 32 +avx512bf16 ~2022 CPUs from both Intel and AMD (Zen4)
x86_64 bf16 bf16 f32 VDPBF16PS (zmm*zmm) 16 1 1 2 64 +avx512bf16 ~2022 CPUs from both Intel and AMD (Zen4) See above caveats about AVX-512
x86_64 bf16 bf16 f32 VDPBF16PS (zmm*m32bcst) 1 16 1 2 64 +avx512bf16 See above about AVX-512 See above caveats about AVX-512
x86_64 s8 u8 s16 VPMADDUBSW (ymm*ymm) 16 1 1 2 48* +avx2 near-universal
x86_64 s8 u8 s16 VPMADDUBSW (zmm*zmm) 32 1 1 2 96* +avx512bw See above about AVX-512
x86_64 s8 u8 s32 VPDPBUSD (ymm*ymm) 8 1 1 4 64 +avx512vnni ~2022 CPUs from both Intel and AMD (Zen4)
x86_64 s8 u8 s32 VPDPBUSD (zmm*zmm) 16 1 1 4 128 +avx512vnni ~2022 CPUs from both Intel and AMD (Zen4)
x86_64 s8 u8 s32 VPDPBUSD (zmm*m32bcst) 1 16 1 4 128 +avx512vnni ~2022 CPUs from both Intel and AMD (Zen4)
x86_64 s16 s16 s32 VPMADDWD (ymm*ymm) 8 1 1 2 24* +avx2 near-universal
x86_64 s16 s16 s32 VPMADDWD (zmm*zmm) 16 1 1 2 48* +avx512bw See above about AVX-512
x86_64 s16 s16 s32 VPDPWSSD (ymm*ymm) 8 1 1 2 32 +avx512vnni ~2022 CPUs from both Intel and AMD (Zen4)
x86_64 s16 s16 s32 VPDPWSSD (zmm*zmm) 16 1 1 2 64 +avx512vnni ~2022 CPUs from both Intel and AMD (Zen4)
x86_64 s16 s16 s32 VPDPWSSD (zmm*m32bcst) 1 16 1 2 64 +avx512vnni ~2022 CPUs from both Intel and AMD (Zen4)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment