airMeng/Sparse pattern for AMX.md

## Sparse pattern for AMX.md

      
    Raw
  

              Sparse pattern for AMX.md
            
          
    As we all know, AMX ISA introduces the tdpbf16dps , which does 16x32 matrix times 32x16 matrix as the following:
FOR m := 0 TO dst.rows - 1
	tmp := dst.row[m]
	FOR k := 0 TO (a.colsb / 4) - 1                                                         // colsb => bytes per col, in BF16 case k = [0, 16)
		FOR n := 0 TO (dst.colsb / 4) - 1                                               // colsb => bytes per col, in BF16 case n = [0, 16)
			tmp.fp32[n] += FP32(a.row[m].bf16[2*k+0]) * FP32(b.row[k].bf16[2*n+0])
			tmp.fp32[n] += FP32(a.row[m].bf16[2*k+1]) * FP32(b.row[k].bf16[2*n+1])
		ENDFOR
	ENDFOR
	write_row_and_zero(dst, m, tmp, dst.colsb)
ENDFOR
zero_upper_rows(dst, dst.rows)
zero_tileconfig_start()

like the VNNI, AMX-BF16 needs re-layout too like the following image.

So, as a successor to AVX512 series sparse pattern, AMX pattern split the matix into many 1x16 blocks but then concat them to meet AMX 32x16 requirements.

Let the first matrix in the above image is A, the second is B. In our case, A is transposed weight(sparse), B is transposed activation. A can be compressed offline or before the inference, so we could directly use tileloadd instruction for 32 nonzero blocks in A (they’re stored continuously in memory). For B, We need 32 rows of 16 continuous columns for one tdpbf16ps. This may be good because 16 values in one row are continuous and can be loade via _mm256_loadu_epi32(note this isBF16). However, there are two tradeoff. The first is we need to re-layout the 32x16 values in B into amx layout as the images shows. The second is we need to transpose activation matrix,, which may be more time-consuming
EDIT: considering A is activation without transpose, B is weight without transpose too. we can save a lot of time because weight is fixed and can be compressed, concated and reordered offline. however, the load of activation is disaster because all 32 16x1 blocks are not contiguous can will greatly impact performance, related experiments WIP.
Then the key to the question is that how to re-layout activation on the fly. Luckily, some really smart guy give a brief and talented answer, you can refer to this link for the datails. Although the answer is for VNNI layout, but also applicable for our questions. The related code is as the following:
const static __m512i mask = _mm512_set_epi16(31,15,30,14,29,13,28,12,27,11,26,10,25,9,24,8,23,7,22,6,21,5,20,4,19,3,18,2,17,1,16,0);
__m256i lo = _mm256_loadu_epi(...);
__m256i li = _mm256_loadu_epi(...);
__m512i vec = _mm512_inserti32x8(_mm512_castsi256_si512(lo), li, 1);
__m512i permuted = _mm512_permutexvar_epi16(mask, vec);
_mm512_storeu_epi32(..., permuted);