The purpose is to further improve PyTorch CPU performance on both imperative path and jit path.
MKLDNN requires to reorder memory from plain
layout to blocked
layout to achieve optimal performance on CPU, e.g. from nchw
to nChw16c
, etc. At this moment on PyTorch, MKLDNN operators reuse CPU tensor, which means for each MKLDNN operator, it takes three steps to finish the computation:
input_reorder(plain_layout, blocked_layout)
mkldnn_computation()
output_reorder(blocked_layout, plain_layout)
These reorder
s takes about 50% of time on a typical ImageNet topology, e.g. ResNet50
. Also MKLDNN chose different blocked
format according to different input config from Convolution
, with nn.Conv2d
always output in plain
layout, subsequent layers (BatchNorm
, Pooling
) would only execute on plain
layout and this is the slow path for MKLDNN. With these problems solved, the CNN models would have 3~4x speedup v.s. current performance.