(Training material on pytorch CPU performance optimization)
- Part II: Parallelization Techniques
- Part III: Vectorization Techniques
- Part IV: BFloat16 Kernel Optimization
Chinese version for this chapter, link.
(Training material on pytorch CPU performance optimization)
Chinese version for this chapter, link.
(Internal Tranining Material)
Usually the first step in performance optimization is to do profiling, e.g. to identify performance hotspots of a workload. This gist tells basic knowledge of performance profiling on PyTorch, you will get:
This tutorial takes one of my recent projects - pssp-transformer as an example to guide you through path of PyTorch CPU peformance optimization. Focus will be on Part 1 & Part 2.
This file serves a BKM to get better performance on CPU for PyTorch, mostly focusing on inference or deployment. Chinese version available here.
Right now, on PyTorch CPU path, you may choose to use 3 types of memory formats.
(Training material on pytorch CPU performance optimization)
Chinese version for this chapter, link.
This section contains the following subjects:
("mkldnn" has been renamed to "oneDNN", but exsiting PyTorch APIs still use "mkldnn", future work will align PyTorch user level APIs to "oneDNN")
NB: Memory format refers to data representation that describes how multidimensional arrays (nD) are stored in linear (1D) memory address space. Memory format has the same semantic with layout in oneDNN. Layout in PyTorch has other semantic ofdescribing dense or sparse with the attributes: 'torch.strided', 'torch.sparse_coo'.
(Training material on pytorch CPU performance optimization)
Chinese version for this chapter, link.
This section contains the following subjects:
This gist keeps a record of MKLDNN RNN integration job into PyTorch and serves a backup of PR26387, only inference feature is provided at the moment.
To use MKLDNN RNN in PyTorch:
example: how to enable mkl-dnn
RNN
import torch
from torch.utils import mkldnn as mkldnn_utils
Based on huggingface repo for performance evaluation, actual benchmark run script placed at repo. How to reproduce performance:
GLUE_DIR
to actual dataset path in run_inference.sh
.Inference performance result on Xeon 6148 (2x20 cores), single socket and single thread.
PyTorch can be installed via different channels: conda
, pip
, docker
, source code
...
By default, mkl and mkl-dnn are enabled; But this might not always be true, so it is still useful to learn how to check this by yourself:
### check where your torch is installed
python -c 'import torch; print(torch.__path__)'
(Training material on pytorch CPU performance optimization)
Chinese version for this chapter, link.
This section contains the following subjects: