Based on huggingface repo for performance evaluation, actual benchmark run script placed at repo. How to reproduce performance:
- prepare dataset according to link.
- update
GLUE_DIR
to actual dataset path inrun_inference.sh
. - change env settings, the default setting is using 20 cores;
Inference performance result on Xeon 6148 (2x20 cores), single socket and single thread.
- MKL: version 2019.4 (conda install mkl mkl-include)
- MKLDNN: proposed in 21851
single instance (20 threads)
- MKL
>>> ./run_inference.sh
408/408 [00:24<00:00, 16.69it/s]
- MKLDNN
>>> ./run_instance.sh --mkldnn
408/408 [00:18<00:00, 21.95it/s]
multi instance (1 thread per instance)
- MKL
>>> ./run_inference.sh --multi_instances
Average latency per example: 469.058ms
Total number of iterations: 1000
Total number of iterations per second (across all threads): 42.64
Total time: 23.453s
- MKLDNN
>>> ./run_inference.sh --multi_instances --mkldnn
Average latency per example: 370.495ms
Total number of iterations: 1000
Total number of iterations per second (across all threads): 53.98
Total time: 18.525s
- skylake has special requirements on leading dimension of GEMM, when
LDA/LDB/LDC
is multiple of128
, will cause cache flush issue, see ref. - The following table compares performance of BERT (glue/MRPC) GEMMs on MKL and MKLDNN with original size and padded size (+16).
Table-1: single socket test result (20 threads)
size(original) | MKL | MKLDNN | size (padded) | MKL | MKLDNN |
---|---|---|---|---|---|
N=128, I=768, O=768 | 818.57 | 417.03 | N=128, I=784, O=784 | 1246.08 | 1282.33 |
N=128, I=768, O=3072 | 1369.88 | 1818.96 | N=128, I=784, O=3088 | 1908.46 | 1931.12 |
N=128, I=3072, O=768 | 676.20 | 1262.61 | N=128, I=3088, O=784 | 1768.28 | 1658.30 |
unit: Gflops
- Use the following script to reproduce this result:
run.sh:
num_threads=$1
script=$2
last_core=`expr $num_threads - 1`
echo "using $num_threads OMP threads"
echo "bind cores to 0~$last_core"
export OMP_NUM_THREADS=$num_threads
export KMP_AFFINITY=granularity=fine,compact,1,0
numactl --physcpubind=0-$last_core --membind=0 python $script
test_linear.py
import torch
import torch.nn as nn
from time import time
warmups = 1000
iters = 10000
def test_linear(batch_size, input_channel, output_channel):
input = torch.randn(batch_size, input_channel)
linear = nn.Linear(input_channel, output_channel)
for i in range(warmups):
output = linear(input)
t1 = time()
for i in range(iters):
output = linear(input)
t2 = time()
tt = (t2-t1)/iters
print("### Linear: (%d, %d) => (%d, %d): %f ms, %f Gflops"
% (batch_size, input_channel, batch_size, output_channel,
tt*1000, 2* batch_size*input_channel*output_channel/tt/1e9))
test_linear(128, 768, 768)
test_linear(128, 768, 3072)
test_linear(128, 3072, 768)
test_linear(128, 768+16, 768+16)
test_linear(128, 768+16, 3072+16)
test_linear(128, 3072+16, 768+16)
to run on single socket with 20 OMP threads:
./run.sh 20 test_linear.py
attach multi instance run raw logs and script:
script
mkldnn multi_instance
(pytorch-mingfei) [mingfeim@mlt-skx084 gemm]$ ./run_multi_instance.sh
Process [0/20] => run with core 0 - 0
Process [1/20] => run with core 1 - 1
Process [2/20] => run with core 2 - 2
Process [3/20] => run with core 3 - 3
Process [4/20] => run with core 4 - 4
Process [5/20] => run with core 5 - 5
Process [6/20] => run with core 6 - 6
Process [7/20] => run with core 7 - 7
Process [8/20] => run with core 8 - 8
Process [9/20] => run with core 9 - 9
Process [10/20] => run with core 10 - 10
Process [11/20] => run with core 11 - 11
Process [12/20] => run with core 12 - 12
Process [13/20] => run with core 13 - 13
Process [14/20] => run with core 14 - 14
Process [15/20] => run with core 15 - 15
Process [16/20] => run with core 16 - 16
Process [17/20] => run with core 17 - 17
Process [18/20] => run with core 18 - 18
Process [19/20] => run with core 19 - 19
Linear: (128, 768) => (128, 768): 1.879133 ms, 80.353507 Gflops
Linear: (128, 768) => (128, 3072): 7.797926 ms, 77.453900 Gflops
Linear: (128, 3072) => (128, 768): 9.332923 ms, 64.714961 Gflops
Linear: (128, 784) => (128, 784): 1.935849 ms, 81.283183 Gflops
Linear: (128, 784) => (128, 3088): 7.886846 ms, 78.583249 Gflops
Linear: (128, 3088) => (128, 784): 7.418364 ms, 83.545908 Gflops
mkl multi_instance
(pytorch-mingfei) [mingfeim@mlt-skx084 gemm]$ ./run_multi_instance.sh
Process [0/20] => run with core 0 - 0
Process [1/20] => run with core 1 - 1
Process [2/20] => run with core 2 - 2
Process [3/20] => run with core 3 - 3
Process [4/20] => run with core 4 - 4
Process [5/20] => run with core 5 - 5
Process [6/20] => run with core 6 - 6
Process [7/20] => run with core 7 - 7
Process [8/20] => run with core 8 - 8
Process [9/20] => run with core 9 - 9
Process [10/20] => run with core 10 - 10
Process [11/20] => run with core 11 - 11
Process [12/20] => run with core 12 - 12
Process [13/20] => run with core 13 - 13
Process [14/20] => run with core 14 - 14
Process [15/20] => run with core 15 - 15
Process [16/20] => run with core 16 - 16
Process [17/20] => run with core 17 - 17
Process [18/20] => run with core 18 - 18
Process [19/20] => run with core 19 - 19
Linear: (128, 768) => (128, 768): 1.699797 ms, 88.831161 Gflops
Linear: (128, 768) => (128, 3072): 7.188918 ms, 84.015391 Gflops
Linear: (128, 3072) => (128, 768): 7.282895 ms, 82.931275 Gflops
Linear: (128, 784) => (128, 784): 1.882290 ms, 83.596032 Gflops
Linear: (128, 784) => (128, 3088): 7.615179 ms, 81.386659 Gflops
Linear: (128, 3088) => (128, 784): 7.280856 ms, 85.123780 Gflops