Skip to content

Instantly share code, notes, and snippets.

@mingfeima
Last active May 10, 2023 10:58
Show Gist options
  • Save mingfeima/17942d01a7d1881eac2247be7227ca36 to your computer and use it in GitHub Desktop.
Save mingfeima/17942d01a7d1881eac2247be7227ca36 to your computer and use it in GitHub Desktop.
MKLDNN RNN integration in PyTorch

This gist keeps a record of MKLDNN RNN integration job into PyTorch and serves a backup of PR26387, only inference feature is provided at the moment.

To use MKLDNN RNN in PyTorch:

  1. convert model to mkldnn
  2. (optional) convert input and hx/cx to mkldnn

example: how to enable mkl-dnn RNN

import torch
from torch.utils import mkldnn as mkldnn_utils

# replace LSTM with MkldnnLSTM
rnn = torch.nn.LSTM(10, 20)
mkldnn_rnn = mkldnn_utils.to_mkldnn(rnn)

# random input
input = torch.randn(1, 5, 10)
hx = torch.randn(1, 5, 20)
cx = torch.randn(1, 5, 20)

# (optional) convert INPUTs into mkldnn layout
# Logic here is that
#   a) if the input/hx/cx is mkldnn layout, output/hy/cy will be mkldnn layout
#   b) if the input/hx/cx is dense layout, output/hy/cy will be dense layout
# to_mkldnn() is a out place memory copy, try to avoid do this for very iteration
# from performance pespective
input = input.to_mkldnn()
hx = hx.to_mkldnn()
cx = cx.to_mkldnn()

# evaluation
output, hidden = mkldnn_rnn(input, (hx, cx))
hy, cy = hidden

MKLDNN RNN APIs

MKLDNN RNN has some quite special presettings that differs from PyTorch:

 /* MKLDNN RNN weight format:
  * mkldnn expects 3 tensor for all layers/directions:
  *   weight_ih (ldigo): {num_layers, num_directions, input_size, num_gates, hidden_size}
  *   weight_hh (ldigo): {num_layers, num_directions, hidden_size, num_gates, hidden_size}
  *   bias (ldgo): {num_layers, num_directions, num_biases, hidden_size}
  *
  * for LSTM, bias has 4 gates:
  *   bias = bias_ih + bias_hh
  *
  * for GRU, bias has 4 gates:
  *   (PyTorch GRU bias)     (MKLDNN GRU bias)
  *   bias_ih    bias_hh          bias
  *   +-----+    +-----+       +---------+
  *   | rt1 |    | rt2 |       | zt1+zt2 |
  *   |-----|    |-----|       |---------|
  *   | zt1 |    | zt2 |       | rt1+rt2 |
  *   |-----|    |-----|       |---------|
  *   | nt1 |    | nt2 |       |   nt1   |
  *   +-----+    +-----+       |---------|
  *                            |   nt2   |
  *                            +---------+
  *
  * PyTorch RNN weight format:
  *   a list of length num_layers * num_directions:
  *   {
  *     weight_ih_00, weight_hh_00, bias_ih_00, bias_hh_00 // layer = 0, direction = 0
  *     weight_ih_01, weight_hh_01, bias_ih_01, bias_hh_01 // layer = 0, direction = 1
  *     ..., ..., ..., ...,
  *     weight_ih_ld, weight_hh_ld, bias_ih_ld, bias_hh_ld // layer = l, direction = d
  *   }
  *   weight_ih_ld: {num_gates * hidden_size, input_size}
  *   weight_hh_ld: {num_gates * hidden_size, hidden_size}
  *   bias_ih_ld: {num_gates * hidden_size}
  *   bias_hh_ld: {num_gates * hidden_size}
  */

Performance Improvements

MKLDNN RNN improves LSTM inference performance upto 5x, use benchmark to reproduce the result. The benchmark is using input_size=250, hidden_size=200 and run with single socket (20 cores) and single core respectively.

For the scenario of time_step=1 and single core inference, memory allocation consumes a considerable amount of time (~1/3), use jemmalloc can significantly improve overall performance, follow wiki to compile libjemalloc.so. This will give you additional 30% performance boost, free launch.

### run original
./run_single_batch_inference.sh

### run mkldnn
./run_single_batch_inference.sh --mkldnn

### run original with jemalloc
LD_PRELOAD=/home/mingfeim/packages/jemalloc-5.2.0/lib/libjemalloc.so ./run_single_batch_inference.sh

### run mkldnn with jemalloc
LD_PRELOAD=/home/mingfeim/packages/jemalloc-5.2.0/lib/libjemalloc.so ./run_single_batch_inference.sh --mkldnn

performance result on Xeon 6148 (unit: sentences per second, the higher the better)

time_step cores original mkldnn original (jemalloc) mkldnn (jemalloc) mkldnn v.s. original mkldnn jemalloc boost
15 20 629 3184 768 4114 5.06 1.29
15 1 807 2976 900 3676 3.69 1.24
1 1 5100 6653 5668 8418 1.30 1.27

Future Work

To further improve the performance:

  1. mkldnn requires hx, cx to be concat into one tensor src_iter, the concat inside ideep is 3x than at::cat.
  2. correspondingly, mkldnn requires dst_iter to be split into hy, cy, the split at::chunk is inplace and take no time, ideep::splitter is a memory copy.
  3. (done) double check whether exp and tanh is properly vectorized: from v0.20 on, elemwise ops in RNN is properly vectorized.
  4. provide inplace conversion to cpu tensor and mkldnn tensor.
@XiaoShen666-git
Copy link

Hi Mingfei,

I am trying to use MKLDNN to accelerate LSTM inference on a dual core Xeon server, and got following error msg:
RuntimeError: mkldnn_linear: weight and bias need to be mkldnn layout

After many searches and debug I have no clue and didn't find any document about this error. Could you please give a guidance?

Thank you very much!
Xiao

@mingfeima
Copy link
Author

@XiaoShen666-git, this work is outdated now. For CPU inference, you may try fbgemm based approach.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment