Meng, Hengyu airMeng

## XeTLA.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                airMeng
                / XeTLA.md
            
            
              Last active
              June 29, 2024 02:25
            
              
                XeTLA
              
          
    HW Target


Device
PVC
MTL
DG2
LNL/BMG(TODO)
ARL(TODO)


ISA
Xe
Xe-lpg
Xe-hpg
Xe2
Xe-lpg+


DPAS
8,8,16
NA
8,8,8
8,4,16
8,8,8


2D Block
32, 64
NA
NA
32, 64
NA


1D Block
64
32
32
64
32


How to Add a new HW


## LLM Int4 Inference on Arc.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                airMeng
                / LLM Int4 Inference on Arc.md
            
            
              Last active
              January 31, 2024 05:56
            
              
                LLM Int4 Inference on Arc 
              
          
    IPEX

Intel® Extension for PyTorch(IPEX) extends PyTorch* with up-to-date features optimizations for an extra performance boost on Intel hardware. Optimizations take advantage of Intel Xe Matrix Extensions (XMX) AI engines on Intel discrete GPUs. Moreover, Intel® Extension for PyTorch* provides easy GPU acceleration for Intel discrete GPUs through the PyTorch* xpu device.
XeTLA

Intel® Xe Templates for Linear Algebra (Intel® XeTLA) is a collection of SYCL/ESIMD templates that enable high-performance General Matrix Multiply (GEMM), Convolution (CONV), and related computations on Intel Xe GPU architecture. Intel® XeTLA offers reusable C++ templates for kernel, group and subgroup levels, allowing developers to optimize and specialize kernels based on data types, tiling policies, algorithms, fusion policies, and more.
Users can easily define new compression/de-compression prologue and insert right between BRGEMM to fully accelerate WOQ GEMM due to XeTLA's template designs.

  
## part_3_vectorization_techniques.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                airMeng
                / part_3_vectorization_techniques.md
            
            
              Created
              December 26, 2023 07:16
                — forked from mingfeima/part_3_vectorization_techniques.md
            
              
                PyTorch CPU Performance Optimization Tutorial - Section III
              
          
    Part III: Vectorization Techniques

(Training material on pytorch CPU performance optimization)

Part I: Memory Formats and Channels Last Optimization
Part II: Parallelization Techniques
Part IV: BFloat16 Kernel Optimization

Chinese version for this chapter, link.
This section contains the following subjects:

  
## pytorch_lowerprecision.md

      
              1 file
            
          
              1 fork
            
          
              0 comments
            
          
              0 stars
            
          
                airMeng
                / pytorch_lowerprecision.md
            
            
              Last active
              December 26, 2023 07:56
            
              
                pytorch_lowerprecision.md
              
          
    Proposal of block-aware sub-byte dtype introduction to PyTorch

Authors:

@xinhe3, hengyume

Summary

The community are working on Deep Learning acceleration with sub-byte support. Considering alignment, elements are organized as blocks, and each block share a scale (and maybe a zero point). Some great examples are like

llama.cpp supports 2-6bits.


## NeuralSpeed X ITREX.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                airMeng
                / NeuralSpeed X ITREX.md
            
            
              Last active
              November 21, 2023 07:20
            
              
                NeuralSpeed X ITREX.md
              
          
    NerualSpeed(NS) is designed to provide the efficient inference of large language models (LLMs) on Intel platforms through the state-of-the-art (SOTA) model compression techniques. The work is highly inspired from llama.cpp.
Intel® Extension for Transformers(ITREX) is an innovative toolkit to accelerate Transformer-based models on Intel platforms, in particular, effective on 4th Intel Xeon Scalable processor Sapphire Rapids (codenamed Sapphire Rapids).
Install

Basically NS is a optional dependency of ITREX. You can install ITREX via binary wheel and NS will be installed as one of the requitements.
# define install requirements

  
## 2d memcpy. opencl VS sycl.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                airMeng
                / 2d memcpy. opencl VS sycl.md
            
            
              Last active
              December 25, 2023 06:57
            
              
                2d memcpy. opencl VS sycl.md
              
          
    So lucky we are that we have a genius team like oneAPI compiler team. One of their great contribution is that they never obey any common sense or ease-to-use, just not stingy with their talents.
2D load/store API is the one of examples that we should be grateful indeed especially after several hours' failed attempts.
The definition of 2d memcpy in OpenCL
// Enqueue command to write a 2D or 3D rectangular region to a buffer object from host memory.
cl_int clEnqueueWriteBufferRect(cl_command_queue command_queue,
                                cl_mem buffer,
                                cl_bool blocking_write,
                                // buffer offset, up to 3D


## Add model support.md

      
              1 file
            
          
              2 forks
            
          
              0 comments
            
          
              0 stars
            
          
                airMeng
                / Add model support.md
            
            
              Created
              October 30, 2023 07:02
            
              
                Add model support.md
              
          
     1. Model weight conversion 

  1.1 Pytorch weight parsing

  1.2 tokenizer

2. Model enablements

  2.1 model loading 

  model class 

  model struct (ffn, attn, norm tensors), tensor name mapping 

  model_context 

 model_load_internal (explaination of variables) 

  
## MLIR Hello.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                airMeng
                / MLIR Hello.md
            
            
              Last active
              December 25, 2023 06:57
            
              
                MLIR Hello.md
              
          
    This is just my personal learning note in MLIR, just recording my questions, will or will not overlap with existing tutorial,
Frontend(parser)

8.30 update, parser is not necessary to understand MLIR and even makes it harder to understand MLIR itself.

I am not interested in any parser or frontend since it heavily depends on the source you choose and not general enough. However from my try, at least one thing about the parse is important. You have too options for get attributes(for example LHS of AddOp) of your customize Ops, one is like MLIR toy tutorial, you can define this method in your original IR(or called AST) and pass your method to parser then MLIR
167 /// Expression class for a binary operator.
168 class BinaryExprAST : public ExprAST {

  
## Debugging Xbyak via GDB.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              4 stars
            
          
                airMeng
                / Debugging Xbyak via GDB.md
            
            
              Last active
              December 25, 2023 06:57
            
              
                Debugging Xbyak via GDB.md
              
          
    OneDNN teams suggests to use SDE to dump the JITTed code like the following:
You can dump the JITTed kernel via the following c++ code:
void dump(const void *code, size_t code_size)
{
    FILE *file = fopen("dump.bin", "wb+");
    if (file) {
        size_t unused = fwrite(code, code_size, 1, file);
        fclose(file);

  
## Xbyak Learning Note.md

      
              1 file
            
          
              0 forks
            
          
              2 comments
            
          
              3 stars
            
          
                airMeng
                / Xbyak Learning Note.md
            
            
              Last active
              December 25, 2023 06:57
            
              
                Xbyak Learning Note.md
              
          
    Let's start with a naive case, the following Code define void* function which takes pointers and intergers as input and put the sum into address the fourth pointer points to.
#include <xbyak/xbyak_util.h>

struct Code : public Xbyak::CodeGenerator {
    Code()
    {
        // xbyak also provides advanced usage like StakeFrame
        // see xbyak/sample/sf_test.cpp for how to use other parameter
        // Xbyak::util::StackFrame sf(this, 4);
Device	PVC	MTL	DG2	LNL/BMG(TODO)	ARL(TODO)
ISA	Xe	Xe-lpg	Xe-hpg	Xe2	Xe-lpg+
DPAS	8,8,16	NA	8,8,8	8,4,16	8,8,8
2D Block	32, 64	NA	NA	32, 64	NA
1D Block	64	32	32	64	32