ZhiweiYan-96/dnnl_worshop.md Secret

## dnnl_worshop.md

      
    Raw
  

              dnnl_worshop.md
            
          
    Part 1: Translate oneDNN verbose to benchdnn command

Introduction

This doc provides some examples of translation from oneDNN verbose to benchdnn command. Generally, debugging a onednn prbolem needs make sure verbose produced by benchdnn is identical to ones from workloads. Hence, we need check identiness of verbose part by part. It's a little time-consuming.
This doc intends to provide baseline translation that we can quickly modify it to generate another benchdnn command for real-life IPEX workloads. For example, if we wanana to analyze a scurlpit convolution verbose, we may firstly copy an example below, and do some finetune on command. Then we can quickly test the correctness/perf of workloads verbose.
Benchdnn compilation bkc

git clone https://github.com/oneapi-src/oneDNN.git
git checkout <you-wanted-commit>
source ~/intel/oneapi/compiler/xxxx/env/vars.sh
source ~/intel/oneapi/tbb/xxx/env/vars.sh
mkdir -p build;
cd build;
export CC=icx;
export CXX=icpx;
cmake .. -DDNNL_CPU_RUNTIME=DPCPP -DDNNL_GPU_RUNTIME=DPCPP;
make -j;
export DNNL_VERBOSE=1

# build with ninja
mkdir -p build
cd build
export CC=icx
export CXX=icpx
cmake .. -G Ninja -DDNNL_GPU_RUNTIME=SYCL -DDNNL_CPU_RUNTIME=THREADPOOL
cmake --build .

# How to use
cd build;
./test/benchdnn/benchdnn xxxxx 

# oneDNN's benchdnn scripts is at
# test/benchdnn/inputs/
Common argument


Abbreviation: --sdt, --ddt, --wdt, source/destination/weight data type(dt), tag means format tag,
gpu engine --engine=gpu
--mode=p check performance issues --mode=C check accuracy issues

Examples

Reorder


FP32 case

Plain to block
Verbose:
onednn_verbose,primitive,exec,gpu,reorder,jit:ir,undef,src_f32::blocked:abcd::f0 dst_f32::blocked:Abcd16a::f0,,,64x1x1x1,0.137939
Benchdnn:
DNNL_VERBOSE=1 ./benchdnn --reorder --engine=gpu  --sdt=f32 --ddt=f32 --stag=abcd --dtag=Abcd16a 64x1x1x1

Special case: Quantization, FP32 to s8

Follwing quantization cases are from ResNet-50 workload. Dequantization is similar to quantization, the difference would only be sdt/ddt
The speciality of quantization lies at scale&zero-point may need be set. IPEX support bot symmetric and asymmetric quantization. Only scale is used when using dnnl symmetric quantization. zero_point is used for dnnl asymmetric quantization.
a. Pertensor
The example quantize a fp32 tensor to int8 tensor. attr-scales:dst:0 means the scale is applied on whole tensor.
Verbose:
onednn_verbose,primitive,exec,gpu,reorder,jit:ir,undef,src_f32::blocked:abcd::f0 dst_s8::blocked:abcd::f0,attr-scales:dst:0 ,,256x3x224x224,0.709961
Benchdnn command:
DNNL_VERBOSE=1 ./benchdnn --reorder --engine=gpu  --sdt=f32 --ddt=s8 --stag=abcd --dtag=abcd --attr-scales=dst:common 256x3x224x224
b. Perchannel
Pertensor/Perchannel is controlled by argument --attr-scales, --attr-zero-points plus description, common means pertensor, per_dim_0 means dimension dim=1<<0 needs scale/zp, per_dim_1 means dim=1<<1 needs scale/zp. Usually, channel dimension is at 2nd, like bchw. Thus, in most cases, we will uses per_dim_0 for per-channel quantization.
onednn_verbose,primitive,exec,gpu,reorder,ocl:ref:any,undef,src_f32::blocked:abcd::f0 dst_s8::blocked:abcd::f0,attr-scales:dst:1 attr-zero-points:dst:1 ,,64x3x7x7,0.302979
Benchdnn command
DNNL_VERBOSE=1 ./benchdnn --reorder --engine=gpu  --sdt=f32 --ddt=s8 --stag=abcd --dtag=abcd --attr-scales=dst:per_dim_0 --attr-zero-points=dst:per_dim_0 64x3x7x7
Convolution


FP32 Case

For convolution, whether there is bias is controlled by --dir argument. Use --dir=FWD_B with bias, while FWD_D for w/o bias case. Usually, IPEX would produce convolution verbose with block format. We need append --allow-enum-tags-only before any tag argument(stag, wtag, dtag). Otherwise, benchdnn would complain.
oneDNN verbose
onednn_verbose,primitive,exec,gpu,convolution,jit:ir,forward_training,src_f32:a:blocked:ABcd16a16b::f0 wei_f32:a:blocked:ABcd16b16a::f0 bia_f32::blocked:a::f0 dst_f32:a:blocked:ABcd16a16b::f0,attr-scratchpad:user attr-post-ops:binary_add:f32:15:ABcd16a16b+eltwise_relu ,alg:convolution_direct,mb256_ic64oc256_ih56oh56kh1sh1dh0ph0_iw56ow56kw1sw1dw0pw0,6.6731
Benchdnn command
DNNL_VERBOSE=1 ./benchdnn --engine=gpu --conv --dt=f32:f32:f32 --allow-enum-tags-only=0 --dir=FWD_B --attr-scratchpad=user --stag=ABcd16a16b --wtag=ABcd16b16a --dtag=ABcd16a16b --attr-post-ops=binary_add:f32:15:ABcd16a16b+eltwise_relu mb256_ic64oc256_ih56oh56kh1sh1dh0ph0_iw56ow56kw1sw1dw0pw0

INT8 cases

INT8 cases should be more complex than other dtype, due to it usually needs handle quant config and post-ops carefully, especially quant config. Generally, weight tensor should always be s8 symmetric. Activation should not be perchannel quantized. Another difeerence from FP32 case, is that bias is fused post-ops, so most cases would use --dir=FWD_D, evnen though they have bias.
a. Pertensor
Following example is a per tensor quantized convolution. attr-scales shows that src&weight&dest all use a scalar as scale. This conv has post-ops with attr-post-ops:binary_add:f32:2+eltwise_relu+eltwise_linear:41.7858. The first binary_add could be bias or a real binary post-ops.
onednn_verbose,primitive,exec,gpu,convolution,jit:ir,forward_training,src_s8:a:blocked:ABcd32a32b::f0 wei_s8:a:blocked:ABcd8b8a4b::f0 bia_undef::undef::: dst_s8:a:blocked:ABcd32a32b::f0,attr-scratchpad:user attr-scales:src0:0+wei:0 attr-post-ops:binary_add:f32:2+eltwise_relu+eltwise_linear:41.7858 ,alg:convolution_direct,mb256_ic64oc64_ih56oh56kh3sh1dh0ph1_iw56ow56kw3sw1dw0pw1,0.49707
DNNL_VERBOSE=1 ./benchdnn --engine=gpu --conv --dt=s8:s8:s8 --allow-enum-tags-only=0 --dir=FWD_D --attr-scratchpad=user  --attr-scales=wei:common:0.5+src:common:0.5  --stag=ABcd32a32b --wtag=ABcd8b8a4b --dtag=ABcd32a32b --attr-post-ops=binary_add:f32:2+eltwise_relu+eltwise_linear:41.7858  mb256_ic64oc64_ih56oh56kh3sh1dh0ph1_iw56ow56kw3sw1dw0pw1
b. Perchannel
Src&dst tensor should always be pertensor quantized, while weight can be per-channel quantized. wei:1 in verbose shows that the weight is per-channel quantized. --attr-scales=wei:per_oc The corrrsponding benchdnn argument is  --attr-scales=wei:per_oc as follows.
onednn_verbose,primitive,exec,gpu,convolution,jit:ir,forward_training,src_s8:a:blocked:ABcd32a32b::f0 wei_s8:a:blocked:ABcd8b8a4b::f0 bia_undef::undef::: dst_s8:a:blocked:ABcd32a32b::f0,attr-scratchpad:user attr-scales:src0:0+wei:1 attr-post-ops:binary_add:f32:2+eltwise_relu+eltwise_linear:60.9217 ,alg:convolution_direct,mb256_ic64oc64_ih56oh56kh1sh1dh0ph0_iw56ow56kw1sw1dw0pw0,0.376953
Benchdnn command
DNNL_VERBOSE=1 ./benchdnn --engine=gpu --conv --dt=s8:s8:s8 --allow-enum-tags-only=0 --dir=FWD_D --attr-scratchpad=user --stag=ABcd32a32b --wtag=ABcd8b8a4b --dtag=ABcd32a32b --attr-scales=wei:per_oc+src:common:0.5+dst:common:0.5 --attr-post-ops=binary_add:f32:2+eltwise_relu+eltwise_linear:60.9217  mb256_ic64oc64_ih56oh56kh1sh1dh0ph0_iw56ow56kw1sw1dw0pw0
Note that, per-channel weight quantization should always have no zero-point (unimplemented in oneDNN). In other words, weight should always be symmetric quantized.
c. Per-channel weight + Asymmetric + Post-ops
Following case is extracted from ResNet50 perchannel asymmetric quantization workloads. The attr-scales:src0:0+wei:1 means src is pertensor quantized, while weight is perchanel quantized. Src also has a zero-point, as shown attr-zero-points:src0:0. The zero-point is set by --attr-zero-points argument in benchdnn. Due to activation is always quantized pertensor, src:common is used.
onednn_verbose,primitive,exec,gpu,convolution,jit:ir,forward_training,src_u8::blocked:abcd::f0 wei_s8:ap:blocked:AcdB8a4b::f0 bia_undef::undef::: dst_u8:a:blocked:ABcd32a32b::f0,attr-scratchpad:user attr-scales:src0:0+wei:1 attr-zero-points:src0:0 attr-post-ops:binary_add:f32:2+eltwise_relu+eltwise_linear:39.2598 ,alg:convolution_direct,mb1024_ic3oc64_ih224oh112kh7sh2dh0ph3_iw224ow112kw7sw2dw0pw3,21.595
DNNL_VERBOSE=1 ./benchdnn --engine=gpu --conv --dt=u8:s8:u8 --allow-enum-tags-only=0 --dir=FWD_D --attr-scratchpad=user --stag=abcd --wtag=AcdB8a4b --dtag=ABcd32a32b --attr-scales=wei:per_oc+src:common:0.5 --attr-zero-points=src:common:0.5 --attr-post-ops=binary_add:f32:2+eltwise_relu+eltwise_linear:39.2598  mb1024_ic3oc64_ih224oh112kh7sh2dh0ph3_iw224ow112kw7sw2dw0pw3
Matmul

Matmul is similar to convolution actually, only some specific arguments need be set.
FP32 case:
Verbose
onednn_verbose,primitive,exec,gpu,matmul,jit:gemm:any,undef,src_f32::blocked:ab::f0 wei_f32::blocked:ba::f0 bia_f32::blocked:ab::f0_mask2 dst_f32::blocked:ab::f0,attr-scratchpad:user attr-post-ops:binary_add:f32:2 ,,1x2:2x6,0.35498
DNNL_VERBOSE=1 ./benchdnn --engine=gpu --matmul --dt=f32:f32:f32 --bia_dt=f32  --bia_mask=2 --attr-scratchpad=user --stag=ab --wtag=ba --dtag=ab  --attr-post-ops=binary_add:f32:2 1x2:2x6
Alternatively, you could use --strides argument to separately describe tensor format, like following. It has the same effect as the above benchdnn commond.
DNNL_VERBOSE=1 ./benchdnn --engine=gpu --matmul --dt=f32:f32:f32 --bia_dt=f32  --bia_mask=2 --attr-scratchpad=user  --strides=2x1:1x6:6x1  --attr-post-ops=binary_add:f32:2 1x2:2x6
Argument --strides is not supported in all primitives test (e.g. no support for conv). Matmul, batchnorm, reorder have support according to benchdnn doc. We could check the latest doc for knowing the current status.
Int8 cases
a. Pertensor
Argument --bia_mask decide whether or not to broadcast bias Tensor.
Verbose
onednn_verbose,primitive,exec,gpu,matmul,jit:gemm:any,undef,src_s8::blocked:ab::f0 wei_s8::blocked:ba::f0 bia_f32::blocked:ab::f0_mask2 dst_s8::blocked:ab::f0,attr-scratchpad:user attr-scales:src0:0+wei:0 attr-post-ops:eltwise_linear:4.37147 ,,256x2048:2048x1000,0.104004
Benchdnn command
DNNL_VERBOSE=1 ./benchdnn --engine=gpu --matmul --dt=s8:s8:s8 --bia_dt=f32  --bia_mask=1 --attr-scratchpad=user --stag=ab --wtag=ba --dtag=ab --attr-scales=wei:common:0+src:common:0.5 --attr-post-ops=eltwise_linear:4.37147 256x2048:2048x1000
b. Perchannel
Per-channel weight quantization in IPEX would have wei:2 in matmul verbose, while wei:1 in conv per channel quantization. This should be due to pytroch permute weight from OCxIC to ICxOC before feed tensor into primitive.
onednn_verbose,primitive,exec,gpu,matmul,jit:gemm:any,undef,src_u8::blocked:ab::f0 wei_s8::blocked:ba::f0 dst_u8::blocked:ab::f0,attr-scratchpad:user attr-scales:src0:0+wei:2 attr-zero-points:src0:0 attr-post-ops:eltwise_linear:2:2 ,,30x5:5x5,0.101074
 DNNL_VERBOSE=1 ./benchdnn --engine=gpu --matmul --dt=u8:s8:u8  --attr-scratchpad=user --stag=ab --wtag=ba --dtag=ab --attr-scales=wei:per_oc+src:common:0.5+dst:common:0.5 --attr-post-ops=eltwise_linear:2:2 30x5:5x5
Batch file

Sometimes, we may meet special issue that only occurs when two primitives run sequentially. For example, supporse, we met a problem only when runs two matmul primitives. The verbose is:
onednn_verbose,primitive,exec,gpu,matmul,jit:gemm:any,undef,src_u8::blocked:ab::f0 wei_s8::blocked:ba::f0 dst_u8::blocked:ab::f0,attr-scratchpad:user attr-scales:src0:0+wei:2 attr-zero-points:src0:0 attr-post-ops:eltwise_linear:2:2 ,,30x5:5x5,0.101074
onednn_verbose,primitive,exec,gpu,matmul,jit:gemm:any,undef,src_s8::blocked:ab::f0 wei_s8::blocked:ba::f0 bia_f32::blocked:ab::f0_mask2 dst_s8::blocked:ab::f0,attr-scratchpad:user attr-scales:src0:0+wei:0 attr-post-ops:eltwise_linear:4.37147 ,,256x2048:2048x1000,0.104004
If we want to reproduce the error, we can provide a file including the benchdnn command for both. Like,
 --dt=u8:s8:u8  --attr-scratchpad=user --stag=ab --wtag=ba --dtag=ab --attr-scales=wei:per_oc+src:common:0.5+dst:common:0.5 --attr-post-ops=eltwise_linear:2:2 30x5:5x5
 --dt=s8:s8:s8 --bia_dt=f32  --bia_mask=1 --attr-scratchpad=user --stag=ab --wtag=ba --dtag=ab --attr-scales=wei:common:0+src:common:0.5 --attr-post-ops=eltwise_linear:4.37147 256x2048:2048x1000
We can trigger the tests wit
DNNL_VERBOSE=1 ./benchdnn --matmul --mode=c --engine=gpu --batch=batch_file.txt
Part 2: Using oneDNN example to reproduce ipex issue

Introduction

For large number of situations, benchdnn could cover our issue. However, some cases would pass benchdnn but still reports problems in real workloads. For example, the whole workload have accuracy issue, while all single op can pass benchdnn. Another example is that we can hardly detect wrong memory operations like out-of-bound writing via benchdnn.
This part illustrates the way to reproduce issue by utilizing existing demo code in oneDNN. It is at c++ level and similar to integration code in IPEX, which provides more flexibility for us to approximate the problem in IPEX. oneDNN example codes supports most primitives. Also, it provides network inference examples. The sources are easy to undstand and are good startpoints for us to modify.
Build BKC

# Prepare build
mkdir -p build; cd build;
export CC=icx;
export CXX=icpx; # Use dpcpp for compilation
cmake .. -DDNNL_CPU_RUNTIME=DPCPP -DDNNL_GPU_RUNTIME=DPCPP -DBUILD_EXAMPLES=ON; # remember to set build examples
make -j 
The built example binary is at build/examples. The source file is at oneDNN/examples.
Single operator examples are located at oneDNN/examples/primitive. We can modify the source file here and trigger make -j in the build directory to make modifications into effect. Model inference examples can be found at oneDNN/examples, like cnn_inference_f32.cpp
Tips


Utils provided by oneDNN for examples

write_to_dnnl_memory:  write host data a dnnl::memory
read_from_dnnl_memory: copy data in a dnnl::memory back to host data
parse_engine_kind: decide which engine to use(cpu/gpu)
We can use reorder function to change format


Use gpu es engine

./build/examples/primitives-convolution-cpp  gpu
dnnl::engine engine(engine_kind, 0); The index of engine is hard-coded to 0 in the source file, if we need to use it, we need to change it manually.


Examples of Convolution

This section provides a small showcase to illustrate the procedure of using dnn demos to approximate IPEX one. Suppose we are going to reproduce a convolution UT using oneDNN examples.
The IPEX ut for gpu part is:
x = torch.randn(
    [1, 3, 256, 256], dtype=dtype, device=xpu_device, requires_grad=True
)
conv_xpu = nn.Conv2d(3, 64, kernel_size=3, stride=1, padding=1, bias=False).to("xpu")
y = conv_xpu(x)
The first step we need is to modify the configuration of the memory::desc and primitive::desc structs. Like following modifications in oneDNN/examples/primitives/convolution.cpp
// Tensor dimensions.
    const memory::dim N = 3, // batch size
            IC = 3, // input channels
            IH = 256, // input height
            IW = 256, // input width
            .
            .
            .
            OH = (IH - KH + PH_L + PH_R) / SH + 1, // output height
            OW = (IW - KW + PW_L + PW_R) / SW + 1; // output width

The default behavior in convolution example is similar to block format usage in IPEX.  Tag any means oneDNN library choosees a desired format for computation. If the input tensor or weight tensor has different memory::desc from the desired ones, it manually reorder the input/weight before executing the primitive. So the example already provides the expected behavior on ATSM platform. If we want similar behavior on PVC, we can use utils reorder to create channels-last tensor, and copy the plain format path code in csrc/gpu/oneDNN/Conv.h into this example.
auto conv_src_md = memory::desc(src_dims, dt::f32, tag::any);
auto conv_weights_md = memory::desc(weights_dims, dt::f32, tag::any);
auto conv_dst_md = memory::desc(dst_dims, dt::f32, tag::any);

if (conv_pd.src_desc() != user_src_mem.get_desc()) {
    conv_src_mem = memory(conv_pd.src_desc(), engine);
    reorder(user_src_mem, conv_src_mem)
            .execute(engine_stream, user_src_mem, conv_src_mem);
}

// src plain to cl
//onednn_verbose,primitive,exec,gpu,reorder,jit:ir,undef,src_f32::blocked:abcd::f0 dst_f32::blocked:acdb::f0,,,3x3x256x256,21.3879
//onednn_verbose,primitive,exec,gpu,reorder,jit:ir,undef,src_f32::blocked:abcd::f0 dst_f32::blocked:Abcd16a::f0,,,64x3x3x3,0.75708
//onednn_verbose,primitive,exec,gpu,convolution,jit:ir,forward_training,src_f32:a:blocked:acdb::f0 wei_f32:a:blocked:Abcd16a::f0 bia_f32::blocked:a::f0 dst_f32:a:blocked:aBcd16b::f0,attr-post-ops:eltwise_relu ,alg:convolution_direct,mb3_ic3oc64_ih256oh256kh3sh1dh0ph1_iw256ow256kw3sw1dw0pw1,13.5359

Usually, IPEX may auto-trigger block format on platforms like ATSM or in quantization usage. The problem we wanna to reproduce usually has blocked format tensor already. A block format tensor could be created via setting the format tag of memory::desc, like auto diff_dst_md = memory::desc(dst_dims, dt::bf16, tag::ABcd32a16b);
To reproduce the simple conv example above, we  introduce little modification on shape configuration. We can re-tigger the compilation through make. Then run the binary file in example directory, with gpu string argument to validating our change.
make -j
./examples/primitives-convolution-cpp gpu
Examples of checking out-of-bound write

A series of problem that is quite difficult to reproduce is the out-of-bound write. It has the possibility that benchdnn would fail to report the error. For this case, using oneDNN provided examples could help greatly.
The general idea is that, we acquire extra space bigger than the real requirement from operators. We set this extra space with special values like zero/NaN, or some other constant values. If an out-of-bound write occurs, this extra space should have elements being written dirty.
We use inner_product.cpp here as an example. The first step is requesting extra memory and modifying the input/weight tensor memory filling method. As shown below, we require extra 100 float (100x4 bytes) with regard to diff_weight_data and diff_bias.  Then we fix the value to NaN when filling the memory of and diff_weights_ptr, diff_bias_ptr.
With created memory, we use the created pointer to initialize dnnl::memory object. We can use memory::desc::get_size() method to query the needed size of blocked tensor.
const int EXTRA = 100;
std::vector<float> src_data(product(src_dims));
std::vector<float> diff_dst_data(product(diff_dst_dims));
std::vector<float> diff_weights_data(product(diff_weights_dims) + EXTRA);
std::vector<float> diff_bias_data(OC + EXTRA);

// Allocate more memory than needed
auto q = dnnl::sycl_interop::get_queue(engine_stream);
auto diff_weight_ptr = sycl::aligned_alloc_shared<float>(
    64, product(diff_weights_dims) + EXTRA, q);
q.fill<float>(diff_weight_ptr, std::numeric_limits<float>::quiet_NaN(),
            product(diff_weights_dims) + EXTRA);
auto diff_bias_ptr =
    sycl::aligned_alloc_shared<float>(64, product(diff_bias_dims) + EXTRA, q);
q.fill<float>(diff_bias_ptr, std::numeric_limits<float>::quiet_NaN(),
            product(diff_bias_dims) + EXTRA);

auto diff_weight_md = memory::desc(diff_weights_dims, dt::f32, tag::io);
// Create onednn memory using created  memory
auto diff_weights_mem =
    memory(diff_weight_md, engine, static_cast<void*>(diff_weight_ptr));
auto diff_bias_mem =
    memory(diff_bias_md, engine, static_cast<void*>(diff_bias_ptr));
Then we can feed the memory to primitive computation. After finishing the computation, we then copy the result back into the host memory.
auto diff_weight_size = (product(diff_weights_dims) + EXTRA) * sizeof(float);
auto diff_bias_size = (product(diff_bias_dims) + EXTRA) * sizeof(float);
q.memcpy(diff_weights_data.data(), diff_weights_mem.get_data_handle(),
        diff_weight_size)
    .wait();
q.memcpy(diff_bias_data.data(), diff_bias_mem.get_data_handle(),
        diff_bias_size)
    .wait();
Then, we can check whether there exists dirty value in the extra allocated memory.
  for (int i = 0; i < EXTRA; ++i) {
    auto diff_weight_val = diff_weights_data[product(diff_weights_dims) + i];
    if (!std::isnan(diff_weight_val)) {
      std::cout << diff_weight_val << std::endl;
      std::cout << "Error! diff_weight write more bits than expected."
                << std::endl;
      break;
    }
    auto diff_bias_val = diff_bias_data[product(diff_bias_dims) + i];
    if (!std::isnan(diff_bias_val)) {
      std::cout << diff_bias_val << std::endl;
      std::cout << "Error! diff_bias write more bits than expected."
                << std::endl;
      break;
    }
  }
Other Benchdnn Examples

LSTM

./benchdnn --rnn --cfg=bf16 --alg=VANILLA_LSTM --engine=gpu l1t115mb1sic1024slc1024dhc1024dic1024  #LSTM
./benchdnn --rnn --engine=gpu --cfg=bf16f32 --alg=LBR_GRU --prop=BWD_DW l1t63mb512sic681slc681dhc681dic681 #GRU