General guidelines for CPU performance on PyTorch
This file serves a BKM to get better performance on CPU for PyTorch, mostly focusing on inference or deployment. Chinese version available here.
1. Use mkldnn layout
layout refers to how data is organized in a tensor. PyTorch default layout is
NCHW, from optimization perspective, MKL-DNN library (renamed as DNNL recently) may choose a different layout, sometimes refered to as internal layout or primitive layout. This is actually a normal technique for acceleration libraries, common knowledge is that
NHWC runs faster than
NCHW for convolution, changing the default
NHWC is called a
reorder. MKL-DNN may choose different internal layouts based on the input pattern and the algorithm selected, e.g.
nChw16c, a.k.a. reorder a 4-dim tensor into 5-dim by chop down dimension C by 16, for vectorization purpose (AVX512 instruction length is 16x32 bit).
By default on CPU,
conv2d will ru