This doc provides some examples of translation from oneDNN verbose to benchdnn command. Generally, debugging a onednn prbolem needs make sure verbose produced by benchdnn is identical to ones from workloads. Hence, we need check identiness of verbose part by part. It's a little time-consuming.
This doc intends to provide baseline translation that we can quickly modify it to generate another benchdnn command for real-life IPEX workloads. For example, if we wanana to analyze a scurlpit convolution verbose, we may firstly copy an example below, and do some finetune on command. Then we can quickly test the correctness/perf of workloads verbose.
git clone https://github.com/oneapi-src/oneDNN.git
git checkout <you-wanted-commit>
source ~/intel/oneapi/compiler/xxxx/env/vars.sh
source ~/intel/oneapi/tbb/xxx/env/vars.sh
mkdir -p build;
cd build;
export CC=icx;
export CXX=icpx;
cmake .. -DDNNL_CPU_RUNTIME=DPCPP -DDNNL_GPU_RUNTIME=DPCPP;
make -j;
export DNNL_VERBOSE=1
# build with ninja
mkdir -p build
cd build
export CC=icx
export CXX=icpx
cmake .. -G Ninja -DDNNL_GPU_RUNTIME=SYCL -DDNNL_CPU_RUNTIME=THREADPOOL
cmake --build .
# How to use
cd build;
./test/benchdnn/benchdnn xxxxx
# oneDNN's benchdnn scripts is at
# test/benchdnn/inputs/
- Abbreviation:
--sdt, --ddt, --wdt
, source/destination/weight data type(dt
),tag
means format tag, - gpu engine
--engine=gpu
--mode=p
check performance issues--mode=C
check accuracy issues
- FP32 case
Plain to block
Verbose:
onednn_verbose,primitive,exec,gpu,reorder,jit:ir,undef,src_f32::blocked:abcd::f0 dst_f32::blocked:Abcd16a::f0,,,64x1x1x1,0.137939
Benchdnn:
DNNL_VERBOSE=1 ./benchdnn --reorder --engine=gpu --sdt=f32 --ddt=f32 --stag=abcd --dtag=Abcd16a 64x1x1x1
- Special case: Quantization, FP32 to s8
Follwing quantization cases are from ResNet-50 workload. Dequantization is similar to quantization, the difference would only be sdt/ddt
The speciality of quantization lies at scale&zero-point may need be set. IPEX support bot symmetric and asymmetric quantization. Only scale
is used when using dnnl symmetric quantization. zero_point
is used for dnnl asymmetric quantization.
a. Pertensor
The example quantize a fp32 tensor to int8 tensor. attr-scales:dst:0
means the scale is applied on whole tensor.
Verbose:
onednn_verbose,primitive,exec,gpu,reorder,jit:ir,undef,src_f32::blocked:abcd::f0 dst_s8::blocked:abcd::f0,attr-scales:dst:0 ,,256x3x224x224,0.709961
Benchdnn command:
DNNL_VERBOSE=1 ./benchdnn --reorder --engine=gpu --sdt=f32 --ddt=s8 --stag=abcd --dtag=abcd --attr-scales=dst:common 256x3x224x224
b. Perchannel
Pertensor/Perchannel is controlled by argument --attr-scales, --attr-zero-points
plus description, common
means pertensor, per_dim_0
means dimension dim=1<<0
needs scale/zp, per_dim_1
means dim=1<<1
needs scale/zp. Usually, channel dimension is at 2nd
, like bchw
. Thus, in most cases, we will uses per_dim_0
for per-channel quantization.
onednn_verbose,primitive,exec,gpu,reorder,ocl:ref:any,undef,src_f32::blocked:abcd::f0 dst_s8::blocked:abcd::f0,attr-scales:dst:1 attr-zero-points:dst:1 ,,64x3x7x7,0.302979
Benchdnn command
DNNL_VERBOSE=1 ./benchdnn --reorder --engine=gpu --sdt=f32 --ddt=s8 --stag=abcd --dtag=abcd --attr-scales=dst:per_dim_0 --attr-zero-points=dst:per_dim_0 64x3x7x7
- FP32 Case
For convolution, whether there is bias is controlled by --dir
argument. Use --dir=FWD_B
with bias, while FWD_D
for w/o bias case. Usually, IPEX would produce convolution verbose with block format. We need append --allow-enum-tags-only
before any tag argument(stag, wtag, dtag
). Otherwise, benchdnn would complain.
oneDNN verbose
onednn_verbose,primitive,exec,gpu,convolution,jit:ir,forward_training,src_f32:a:blocked:ABcd16a16b::f0 wei_f32:a:blocked:ABcd16b16a::f0 bia_f32::blocked:a::f0 dst_f32:a:blocked:ABcd16a16b::f0,attr-scratchpad:user attr-post-ops:binary_add:f32:15:ABcd16a16b+eltwise_relu ,alg:convolution_direct,mb256_ic64oc256_ih56oh56kh1sh1dh0ph0_iw56ow56kw1sw1dw0pw0,6.6731
Benchdnn command
DNNL_VERBOSE=1 ./benchdnn --engine=gpu --conv --dt=f32:f32:f32 --allow-enum-tags-only=0 --dir=FWD_B --attr-scratchpad=user --stag=ABcd16a16b --wtag=ABcd16b16a --dtag=ABcd16a16b --attr-post-ops=binary_add:f32:15:ABcd16a16b+eltwise_relu mb256_ic64oc256_ih56oh56kh1sh1dh0ph0_iw56ow56kw1sw1dw0pw0
- INT8 cases
INT8 cases should be more complex than other dtype, due to it usually needs handle quant config and post-ops carefully, especially quant config. Generally, weight tensor should always be s8 symmetric. Activation should not be perchannel quantized. Another difeerence from FP32 case, is that bias is fused post-ops, so most cases would use --dir=FWD_D
, evnen though they have bias.
a. Pertensor
Following example is a per tensor quantized convolution. attr-scales
shows that src&weight&dest
all use a scalar as scale. This conv has post-ops with attr-post-ops:binary_add:f32:2+eltwise_relu+eltwise_linear:41.7858
. The first binary_add
could be bias or a real binary post-ops.
onednn_verbose,primitive,exec,gpu,convolution,jit:ir,forward_training,src_s8:a:blocked:ABcd32a32b::f0 wei_s8:a:blocked:ABcd8b8a4b::f0 bia_undef::undef::: dst_s8:a:blocked:ABcd32a32b::f0,attr-scratchpad:user attr-scales:src0:0+wei:0 attr-post-ops:binary_add:f32:2+eltwise_relu+eltwise_linear:41.7858 ,alg:convolution_direct,mb256_ic64oc64_ih56oh56kh3sh1dh0ph1_iw56ow56kw3sw1dw0pw1,0.49707
DNNL_VERBOSE=1 ./benchdnn --engine=gpu --conv --dt=s8:s8:s8 --allow-enum-tags-only=0 --dir=FWD_D --attr-scratchpad=user --attr-scales=wei:common:0.5+src:common:0.5 --stag=ABcd32a32b --wtag=ABcd8b8a4b --dtag=ABcd32a32b --attr-post-ops=binary_add:f32:2+eltwise_relu+eltwise_linear:41.7858 mb256_ic64oc64_ih56oh56kh3sh1dh0ph1_iw56ow56kw3sw1dw0pw1
b. Perchannel
Src&dst tensor should always be pertensor quantized, while weight can be per-channel quantized. wei:1
in verbose shows that the weight is per-channel quantized. --attr-scales=wei:per_oc
The corrrsponding benchdnn argument is --attr-scales=wei:per_oc
as follows.
onednn_verbose,primitive,exec,gpu,convolution,jit:ir,forward_training,src_s8:a:blocked:ABcd32a32b::f0 wei_s8:a:blocked:ABcd8b8a4b::f0 bia_undef::undef::: dst_s8:a:blocked:ABcd32a32b::f0,attr-scratchpad:user attr-scales:src0:0+wei:1 attr-post-ops:binary_add:f32:2+eltwise_relu+eltwise_linear:60.9217 ,alg:convolution_direct,mb256_ic64oc64_ih56oh56kh1sh1dh0ph0_iw56ow56kw1sw1dw0pw0,0.376953
Benchdnn command
DNNL_VERBOSE=1 ./benchdnn --engine=gpu --conv --dt=s8:s8:s8 --allow-enum-tags-only=0 --dir=FWD_D --attr-scratchpad=user --stag=ABcd32a32b --wtag=ABcd8b8a4b --dtag=ABcd32a32b --attr-scales=wei:per_oc+src:common:0.5+dst:common:0.5 --attr-post-ops=binary_add:f32:2+eltwise_relu+eltwise_linear:60.9217 mb256_ic64oc64_ih56oh56kh1sh1dh0ph0_iw56ow56kw1sw1dw0pw0
Note that, per-channel weight quantization should always have no zero-point (unimplemented in oneDNN). In other words, weight should always be symmetric quantized.
c. Per-channel weight + Asymmetric + Post-ops
Following case is extracted from ResNet50 perchannel asymmetric quantization workloads. The attr-scales:src0:0+wei:1
means src is pertensor quantized, while weight is perchanel quantized. Src also has a zero-point, as shown attr-zero-points:src0:0
. The zero-point is set by --attr-zero-points
argument in benchdnn. Due to activation is always quantized pertensor, src:common
is used.
onednn_verbose,primitive,exec,gpu,convolution,jit:ir,forward_training,src_u8::blocked:abcd::f0 wei_s8:ap:blocked:AcdB8a4b::f0 bia_undef::undef::: dst_u8:a:blocked:ABcd32a32b::f0,attr-scratchpad:user attr-scales:src0:0+wei:1 attr-zero-points:src0:0 attr-post-ops:binary_add:f32:2+eltwise_relu+eltwise_linear:39.2598 ,alg:convolution_direct,mb1024_ic3oc64_ih224oh112kh7sh2dh0ph3_iw224ow112kw7sw2dw0pw3,21.595
DNNL_VERBOSE=1 ./benchdnn --engine=gpu --conv --dt=u8:s8:u8 --allow-enum-tags-only=0 --dir=FWD_D --attr-scratchpad=user --stag=abcd --wtag=AcdB8a4b --dtag=ABcd32a32b --attr-scales=wei:per_oc+src:common:0.5 --attr-zero-points=src:common:0.5 --attr-post-ops=binary_add:f32:2+eltwise_relu+eltwise_linear:39.2598 mb1024_ic3oc64_ih224oh112kh7sh2dh0ph3_iw224ow112kw7sw2dw0pw3
Matmul is similar to convolution actually, only some specific arguments need be set.
FP32 case:
Verbose
onednn_verbose,primitive,exec,gpu,matmul,jit:gemm:any,undef,src_f32::blocked:ab::f0 wei_f32::blocked:ba::f0 bia_f32::blocked:ab::f0_mask2 dst_f32::blocked:ab::f0,attr-scratchpad:user attr-post-ops:binary_add:f32:2 ,,1x2:2x6,0.35498
DNNL_VERBOSE=1 ./benchdnn --engine=gpu --matmul --dt=f32:f32:f32 --bia_dt=f32 --bia_mask=2 --attr-scratchpad=user --stag=ab --wtag=ba --dtag=ab --attr-post-ops=binary_add:f32:2 1x2:2x6
Alternatively, you could use --strides
argument to separately describe tensor format, like following. It has the same effect as the above benchdnn commond.
DNNL_VERBOSE=1 ./benchdnn --engine=gpu --matmul --dt=f32:f32:f32 --bia_dt=f32 --bia_mask=2 --attr-scratchpad=user --strides=2x1:1x6:6x1 --attr-post-ops=binary_add:f32:2 1x2:2x6
Argument --strides
is not supported in all primitives test (e.g. no support for conv). Matmul, batchnorm, reorder
have support according to benchdnn doc. We could check the latest doc for knowing the current status.
Int8 cases a. Pertensor
Argument --bia_mask
decide whether or not to broadcast bias Tensor.
Verbose
onednn_verbose,primitive,exec,gpu,matmul,jit:gemm:any,undef,src_s8::blocked:ab::f0 wei_s8::blocked:ba::f0 bia_f32::blocked:ab::f0_mask2 dst_s8::blocked:ab::f0,attr-scratchpad:user attr-scales:src0:0+wei:0 attr-post-ops:eltwise_linear:4.37147 ,,256x2048:2048x1000,0.104004
Benchdnn command
DNNL_VERBOSE=1 ./benchdnn --engine=gpu --matmul --dt=s8:s8:s8 --bia_dt=f32 --bia_mask=1 --attr-scratchpad=user --stag=ab --wtag=ba --dtag=ab --attr-scales=wei:common:0+src:common:0.5 --attr-post-ops=eltwise_linear:4.37147 256x2048:2048x1000
b. Perchannel
Per-channel weight quantization in IPEX would have wei:2
in matmul verbose, while wei:1
in conv per channel quantization. This should be due to pytroch permute weight from OCxIC
to ICxOC
before feed tensor into primitive.
onednn_verbose,primitive,exec,gpu,matmul,jit:gemm:any,undef,src_u8::blocked:ab::f0 wei_s8::blocked:ba::f0 dst_u8::blocked:ab::f0,attr-scratchpad:user attr-scales:src0:0+wei:2 attr-zero-points:src0:0 attr-post-ops:eltwise_linear:2:2 ,,30x5:5x5,0.101074
DNNL_VERBOSE=1 ./benchdnn --engine=gpu --matmul --dt=u8:s8:u8 --attr-scratchpad=user --stag=ab --wtag=ba --dtag=ab --attr-scales=wei:per_oc+src:common:0.5+dst:common:0.5 --attr-post-ops=eltwise_linear:2:2 30x5:5x5
Sometimes, we may meet special issue that only occurs when two primitives run sequentially. For example, supporse, we met a problem only when runs two matmul primitives. The verbose is:
onednn_verbose,primitive,exec,gpu,matmul,jit:gemm:any,undef,src_u8::blocked:ab::f0 wei_s8::blocked:ba::f0 dst_u8::blocked:ab::f0,attr-scratchpad:user attr-scales:src0:0+wei:2 attr-zero-points:src0:0 attr-post-ops:eltwise_linear:2:2 ,,30x5:5x5,0.101074
onednn_verbose,primitive,exec,gpu,matmul,jit:gemm:any,undef,src_s8::blocked:ab::f0 wei_s8::blocked:ba::f0 bia_f32::blocked:ab::f0_mask2 dst_s8::blocked:ab::f0,attr-scratchpad:user attr-scales:src0:0+wei:0 attr-post-ops:eltwise_linear:4.37147 ,,256x2048:2048x1000,0.104004
If we want to reproduce the error, we can provide a file including the benchdnn command for both. Like,
--dt=u8:s8:u8 --attr-scratchpad=user --stag=ab --wtag=ba --dtag=ab --attr-scales=wei:per_oc+src:common:0.5+dst:common:0.5 --attr-post-ops=eltwise_linear:2:2 30x5:5x5
--dt=s8:s8:s8 --bia_dt=f32 --bia_mask=1 --attr-scratchpad=user --stag=ab --wtag=ba --dtag=ab --attr-scales=wei:common:0+src:common:0.5 --attr-post-ops=eltwise_linear:4.37147 256x2048:2048x1000
We can trigger the tests wit
DNNL_VERBOSE=1 ./benchdnn --matmul --mode=c --engine=gpu --batch=batch_file.txt
For large number of situations, benchdnn could cover our issue. However, some cases would pass benchdnn but still reports problems in real workloads. For example, the whole workload have accuracy issue, while all single op can pass benchdnn. Another example is that we can hardly detect wrong memory operations like out-of-bound writing via benchdnn.
This part illustrates the way to reproduce issue by utilizing existing demo code in oneDNN. It is at c++ level and similar to integration code in IPEX, which provides more flexibility for us to approximate the problem in IPEX. oneDNN example codes supports most primitives. Also, it provides network inference examples. The sources are easy to undstand and are good startpoints for us to modify.
# Prepare build
mkdir -p build; cd build;
export CC=icx;
export CXX=icpx; # Use dpcpp for compilation
cmake .. -DDNNL_CPU_RUNTIME=DPCPP -DDNNL_GPU_RUNTIME=DPCPP -DBUILD_EXAMPLES=ON; # remember to set build examples
make -j
The built example binary is at build/examples
. The source file is at oneDNN/examples
.
Single operator examples are located at oneDNN/examples/primitive
. We can modify the source file here and trigger make -j
in the build directory to make modifications into effect. Model inference examples can be found at oneDNN/examples
, like cnn_inference_f32.cpp
- Utils provided by oneDNN for examples
write_to_dnnl_memory
: write host data a dnnl::memoryread_from_dnnl_memory
: copy data in a dnnl::memory back to host dataparse_engine_kind
: decide which engine to use(cpu/gpu)- We can use
reorder
function to change format
- Use gpu es engine
./build/examples/primitives-convolution-cpp gpu
dnnl::engine engine(engine_kind, 0);
The index of engine is hard-coded to 0 in the source file, if we need to use it, we need to change it manually.
This section provides a small showcase to illustrate the procedure of using dnn demos to approximate IPEX one. Suppose we are going to reproduce a convolution UT using oneDNN examples. The IPEX ut for gpu part is:
x = torch.randn(
[1, 3, 256, 256], dtype=dtype, device=xpu_device, requires_grad=True
)
conv_xpu = nn.Conv2d(3, 64, kernel_size=3, stride=1, padding=1, bias=False).to("xpu")
y = conv_xpu(x)
The first step we need is to modify the configuration of the memory::desc
and primitive::desc
structs. Like following modifications in oneDNN/examples/primitives/convolution.cpp
// Tensor dimensions.
const memory::dim N = 3, // batch size
IC = 3, // input channels
IH = 256, // input height
IW = 256, // input width
.
.
.
OH = (IH - KH + PH_L + PH_R) / SH + 1, // output height
OW = (IW - KW + PW_L + PW_R) / SW + 1; // output width
The default behavior in convolution example is similar to block format usage in IPEX. Tag any
means oneDNN library choosees a desired format for computation. If the input tensor or weight tensor has different memory::desc
from the desired ones, it manually reorder the input/weight before executing the primitive. So the example already provides the expected behavior on ATSM platform. If we want similar behavior on PVC, we can use utils reorder to create channels-last
tensor, and copy the plain format path code in csrc/gpu/oneDNN/Conv.h
into this example.
auto conv_src_md = memory::desc(src_dims, dt::f32, tag::any);
auto conv_weights_md = memory::desc(weights_dims, dt::f32, tag::any);
auto conv_dst_md = memory::desc(dst_dims, dt::f32, tag::any);
if (conv_pd.src_desc() != user_src_mem.get_desc()) {
conv_src_mem = memory(conv_pd.src_desc(), engine);
reorder(user_src_mem, conv_src_mem)
.execute(engine_stream, user_src_mem, conv_src_mem);
}
// src plain to cl
//onednn_verbose,primitive,exec,gpu,reorder,jit:ir,undef,src_f32::blocked:abcd::f0 dst_f32::blocked:acdb::f0,,,3x3x256x256,21.3879
//onednn_verbose,primitive,exec,gpu,reorder,jit:ir,undef,src_f32::blocked:abcd::f0 dst_f32::blocked:Abcd16a::f0,,,64x3x3x3,0.75708
//onednn_verbose,primitive,exec,gpu,convolution,jit:ir,forward_training,src_f32:a:blocked:acdb::f0 wei_f32:a:blocked:Abcd16a::f0 bia_f32::blocked:a::f0 dst_f32:a:blocked:aBcd16b::f0,attr-post-ops:eltwise_relu ,alg:convolution_direct,mb3_ic3oc64_ih256oh256kh3sh1dh0ph1_iw256ow256kw3sw1dw0pw1,13.5359
Usually, IPEX may auto-trigger block format on platforms like ATSM or in quantization usage. The problem we wanna to reproduce usually has blocked format tensor already. A block format tensor could be created via setting the format tag of memory::desc
, like auto diff_dst_md = memory::desc(dst_dims, dt::bf16, tag::ABcd32a16b);
To reproduce the simple conv example above, we introduce little modification on shape configuration. We can re-tigger the compilation through make
. Then run the binary file in example directory, with gpu
string argument to validating our change.
make -j
./examples/primitives-convolution-cpp gpu
A series of problem that is quite difficult to reproduce is the out-of-bound write. It has the possibility that benchdnn would fail to report the error. For this case, using oneDNN provided examples could help greatly.
The general idea is that, we acquire extra space bigger than the real requirement from operators. We set this extra space with special values like zero/NaN, or some other constant values. If an out-of-bound write occurs, this extra space should have elements being written dirty.
We use inner_product.cpp
here as an example. The first step is requesting extra memory and modifying the input/weight tensor memory filling method. As shown below, we require extra 100 float (100x4 bytes) with regard to diff_weight_data
and diff_bias
. Then we fix the value to NaN
when filling the memory of and diff_weights_ptr
, diff_bias_ptr
.
With created memory, we use the created pointer to initialize dnnl::memory
object. We can use memory::desc::get_size()
method to query the needed size of blocked tensor.
const int EXTRA = 100;
std::vector<float> src_data(product(src_dims));
std::vector<float> diff_dst_data(product(diff_dst_dims));
std::vector<float> diff_weights_data(product(diff_weights_dims) + EXTRA);
std::vector<float> diff_bias_data(OC + EXTRA);
// Allocate more memory than needed
auto q = dnnl::sycl_interop::get_queue(engine_stream);
auto diff_weight_ptr = sycl::aligned_alloc_shared<float>(
64, product(diff_weights_dims) + EXTRA, q);
q.fill<float>(diff_weight_ptr, std::numeric_limits<float>::quiet_NaN(),
product(diff_weights_dims) + EXTRA);
auto diff_bias_ptr =
sycl::aligned_alloc_shared<float>(64, product(diff_bias_dims) + EXTRA, q);
q.fill<float>(diff_bias_ptr, std::numeric_limits<float>::quiet_NaN(),
product(diff_bias_dims) + EXTRA);
auto diff_weight_md = memory::desc(diff_weights_dims, dt::f32, tag::io);
// Create onednn memory using created memory
auto diff_weights_mem =
memory(diff_weight_md, engine, static_cast<void*>(diff_weight_ptr));
auto diff_bias_mem =
memory(diff_bias_md, engine, static_cast<void*>(diff_bias_ptr));
Then we can feed the memory to primitive computation. After finishing the computation, we then copy the result back into the host memory.
auto diff_weight_size = (product(diff_weights_dims) + EXTRA) * sizeof(float);
auto diff_bias_size = (product(diff_bias_dims) + EXTRA) * sizeof(float);
q.memcpy(diff_weights_data.data(), diff_weights_mem.get_data_handle(),
diff_weight_size)
.wait();
q.memcpy(diff_bias_data.data(), diff_bias_mem.get_data_handle(),
diff_bias_size)
.wait();
Then, we can check whether there exists dirty value in the extra allocated memory.
for (int i = 0; i < EXTRA; ++i) {
auto diff_weight_val = diff_weights_data[product(diff_weights_dims) + i];
if (!std::isnan(diff_weight_val)) {
std::cout << diff_weight_val << std::endl;
std::cout << "Error! diff_weight write more bits than expected."
<< std::endl;
break;
}
auto diff_bias_val = diff_bias_data[product(diff_bias_dims) + i];
if (!std::isnan(diff_bias_val)) {
std::cout << diff_bias_val << std::endl;
std::cout << "Error! diff_bias write more bits than expected."
<< std::endl;
break;
}
}
./benchdnn --rnn --cfg=bf16 --alg=VANILLA_LSTM --engine=gpu l1t115mb1sic1024slc1024dhc1024dic1024 #LSTM
./benchdnn --rnn --engine=gpu --cfg=bf16f32 --alg=LBR_GRU --prop=BWD_DW l1t63mb512sic681slc681dhc681dic681 #GRU