General guidelines for CPU performance on PyTorch
This file serves a BKM to get better performance on CPU for PyTorch, mostly focusing on inference or deployment. Chinese version available here.
1. Use mkldnn layout
layout
refers to how data is organized in a tensor. PyTorch default layout is NCHW
, from optimization perspective, MKL-DNN library (renamed as DNNL recently) may choose a different layout, sometimes refered to as internal layout or primitive layout. This is actually a normal technique for acceleration libraries, common knowledge is that NHWC
runs faster than NCHW
for convolution, changing the default NCHW
to NHWC
is called a reorder
. MKL-DNN may choose different internal layouts based on the input pattern and the algorithm selected, e.g. nChw16c
, a.k.a. reorder a 4-dim tensor into 5-dim by chop down dimension C by 16, for vectorization purpose (AVX512 instruction length is 16x32 bit).
By default on CPU, conv2d
will ru