Skip to content

Instantly share code, notes, and snippets.

@airMeng
airMeng / Sparse Pattern for VNNI.md
Last active June 28, 2022 08:55
Sparse Pattern for VNNI

As we all know, sparse patterns must align with target ISA, especially GEMM instruction. VNNI introduces the following GEMM example:

Description Multiply groups of 4 adjacent pairs of unsigned 8-bit integers in a with corresponding signed 8-bit integers in b, producing 4 intermediate signed 16-bit results. Sum these 4 results with the corresponding 32-bit integer in src, and store the packed 32-bit results in dst. Operation

FOR j := 0 to 15
	tmp1.word := Signed(ZeroExtend16(a.byte[4*j]) * SignExtend16(b.byte[4*j]))
	tmp2.word := Signed(ZeroExtend16(a.byte[4*j+1]) * SignExtend16(b.byte[4*j+1]))
@airMeng
airMeng / Sparsity pattern collection.md
Last active December 25, 2023 06:57
Sparsity pattern collection.md

Currently introduced by different ways, we will enable the following patterns for NLP ToolKits.

The first will be so-called 4x1 pattern, which I have clarified enough in this gist.

The second is for AMX, so called x16 pattern, which is clarified here

The above sparse pattern is accessible based on the current INC pruning

@airMeng
airMeng / Sparse pattern for AMX.md
Last active December 25, 2023 06:57
Sparse pattern for AMX.md

As we all know, AMX ISA introduces the tdpbf16dps , which does 16x32 matrix times 32x16 matrix as the following:

FOR m := 0 TO dst.rows - 1
	tmp := dst.row[m]
	FOR k := 0 TO (a.colsb / 4) - 1                                                         // colsb => bytes per col, in BF16 case k = [0, 16)
		FOR n := 0 TO (dst.colsb / 4) - 1                                               // colsb => bytes per col, in BF16 case n = [0, 16)
			tmp.fp32[n] += FP32(a.row[m].bf16[2*k+0]) * FP32(b.row[k].bf16[2*n+0])
			tmp.fp32[n] += FP32(a.row[m].bf16[2*k+1]) * FP32(b.row[k].bf16[2*n+1])
		ENDFOR
	ENDFOR
@airMeng
airMeng / Xbyak Learning Note.md
Last active December 25, 2023 06:57
Xbyak Learning Note.md

Let's start with a naive case, the following Code define void* function which takes pointers and intergers as input and put the sum into address the fourth pointer points to.

#include <xbyak/xbyak_util.h>

struct Code : public Xbyak::CodeGenerator {
    Code()
    {
        // xbyak also provides advanced usage like StakeFrame
        // see xbyak/sample/sf_test.cpp for how to use other parameter
        // Xbyak::util::StackFrame sf(this, 4);
@airMeng
airMeng / Debugging Xbyak via GDB.md
Last active December 25, 2023 06:57
Debugging Xbyak via GDB.md

OneDNN teams suggests to use SDE to dump the JITTed code like the following:

You can dump the JITTed kernel via the following c++ code:

void dump(const void *code, size_t code_size)
{
    FILE *file = fopen("dump.bin", "wb+");
    if (file) {
        size_t unused = fwrite(code, code_size, 1, file);
        fclose(file);
@airMeng
airMeng / MLIR Hello.md
Last active December 25, 2023 06:57
MLIR Hello.md

This is just my personal learning note in MLIR, just recording my questions, will or will not overlap with existing tutorial,

Frontend(parser)

8.30 update, parser is not necessary to understand MLIR and even makes it harder to understand MLIR itself.

I am not interested in any parser or frontend since it heavily depends on the source you choose and not general enough. However from my try, at least one thing about the parse is important. You have too options for get attributes(for example LHS of AddOp) of your customize Ops, one is like MLIR toy tutorial, you can define this method in your original IR(or called AST) and pass your method to parser then MLIR

167 /// Expression class for a binary operator.
168 class BinaryExprAST : public ExprAST {
@airMeng
airMeng / Add model support.md
Created October 30, 2023 07:02
Add model support.md

1. Model weight conversion

1.1 Pytorch weight parsing

1.2 tokenizer

2. Model enablements

2.1 model loading

model class

model struct (ffn, attn, norm tensors), tensor name mapping

model_context

model_load_internal (explaination of variables)
@airMeng
airMeng / 2d memcpy. opencl VS sycl.md
Last active December 25, 2023 06:57
2d memcpy. opencl VS sycl.md

So lucky we are that we have a genius team like oneAPI compiler team. One of their great contribution is that they never obey any common sense or ease-to-use, just not stingy with their talents. 2D load/store API is the one of examples that we should be grateful indeed especially after several hours' failed attempts.

The definition of 2d memcpy in OpenCL

// Enqueue command to write a 2D or 3D rectangular region to a buffer object from host memory.
cl_int clEnqueueWriteBufferRect(cl_command_queue command_queue,
                                cl_mem buffer,
                                cl_bool blocking_write,
                                // buffer offset, up to 3D
@airMeng
airMeng / NeuralSpeed X ITREX.md
Last active November 21, 2023 07:20
NeuralSpeed X ITREX.md

NerualSpeed(NS) is designed to provide the efficient inference of large language models (LLMs) on Intel platforms through the state-of-the-art (SOTA) model compression techniques. The work is highly inspired from llama.cpp.

Intel® Extension for Transformers(ITREX) is an innovative toolkit to accelerate Transformer-based models on Intel platforms, in particular, effective on 4th Intel Xeon Scalable processor Sapphire Rapids (codenamed Sapphire Rapids).

Install

Basically NS is a optional dependency of ITREX. You can install ITREX via binary wheel and NS will be installed as one of the requitements.

# define install requirements
@airMeng
airMeng / pytorch_lowerprecision.md
Last active December 26, 2023 07:56
pytorch_lowerprecision.md

Proposal of block-aware sub-byte dtype introduction to PyTorch

Authors:

  • @xinhe3, hengyume

Summary

The community are working on Deep Learning acceleration with sub-byte support. Considering alignment, elements are organized as blocks, and each block share a scale (and maybe a zero point). Some great examples are like