Intel added the Galois Field instruction set (GFNI) extensions to their Sunny Cove and Tremont cores. What’s particularly interesting is that GFNI is the only new SIMD extension that came with SSE and VEX/AVX encodings (in addition to EVEX/AVX512), to allow it to be supported on all future Intel cores, including those which don’t support AVX512 (such as the Atom line, as well as Celeron/Pentium branded “big” cores).
I suspect GFNI was aimed at accelerating SM4 encryption, however, one of the instructions can be used for many other purposes. The extension includes three instructions, but of particular interest here is the Affine Transformation (GF2P8AFFINEQB
), aka bit-matrix multiply, instruction.
There have been various articles which discuss out-of-band use-cases where the instruction can be used, however, they’re somewhat spread around, so rather than re-explain it all, this will just be a listing of these.
- Why Ice Lake is Important (a bit-basher’s perspective)
- Bit permutation within bytes (8-bit shift/rotate), 8x8 bit matrix transpose
- Use AVX512 Galois field affine transformation for bit shuffling
- Provides more of an explanation of the first article, plus examples
- Additional samples: bit replication (or bit test), bit interleave, bit shuffle macro
- InstLatX64’s Twitter series:
- Bit reversal, rotate, shift
- 8x8 bit transpose, left-shift + add
- Prefix-xor, 8x8 binary matrix multiply, Rijndael xtime
- Replicate MSB/LSB, mirror on diagonal
- 512-bit prefix xor
- (more) 512-bit prefix xor
- Broadcast imm8 byte
- Parallel byte-histogramming
- pospopcnt (plus link to implementations of the above)
- Bit Matrix Multiplication in Commodity Processors
- Bit-permutation, bit gather/scatter
- A list of “out-of-band” uses for the GF2P8AFFINEQB instruction I haven’t seen documented elsewhere
- Count leading/trailing zero bits, arbitrary modular GF(2w) multiplication, fixed 2-bit packed arithmetic, bit-wise variable shift
- Wunk’s Yuzu emulator acceleration explanation
- Intel's GFNI Technology Guide