mingfeima/bert_optimization.md

## bert_optimization.md

      
    Raw
  

              bert_optimization.md
            
          
    benchmark

Based on huggingface repo for performance evaluation, actual benchmark run script placed at repo.
How to reproduce performance:

prepare dataset according to link.
update GLUE_DIR to actual dataset path in run_inference.sh.
change env settings, the default setting is using 20 cores;

MKL v.s. MKLDNN

Inference performance result on Xeon 6148 (2x20 cores), single socket and single thread.

MKL: version 2019.4 (conda install mkl mkl-include)
MKLDNN: proposed in 21851

single instance (20 threads)

MKL

>>> ./run_inference.sh
408/408 [00:24<00:00, 16.69it/s]

MKLDNN

>>> ./run_instance.sh --mkldnn
408/408 [00:18<00:00, 21.95it/s]
multi instance (1 thread per instance)

MKL

>>> ./run_inference.sh --multi_instances
Average latency per example: 469.058ms
Total number of iterations: 1000
Total number of iterations per second (across all threads): 42.64
Total time: 23.453s

MKLDNN

>>> ./run_inference.sh --multi_instances --mkldnn
Average latency per example: 370.495ms
Total number of iterations: 1000
Total number of iterations per second (across all threads): 53.98
Total time: 18.525s
Impact of leading dimension padding


skylake has special requirements on leading dimension of GEMM, when LDA/LDB/LDC is multiple of 128, will cause cache flush issue, see ref.
The following table compares performance of BERT (glue/MRPC) GEMMs on MKL and MKLDNN with original size and padded size (+16).

Table-1: single socket test result (20 threads)


size(original)
MKL
MKLDNN
size (padded)
MKL
MKLDNN


N=128, I=768, O=768
818.57
417.03
N=128, I=784, O=784
1246.08
1282.33


N=128, I=768, O=3072
1369.88
1818.96
N=128, I=784, O=3088
1908.46
1931.12


N=128, I=3072, O=768
676.20
1262.61
N=128, I=3088, O=784
1768.28
1658.30


unit: Gflops

Use the following script to reproduce this result:

run.sh:
num_threads=$1
script=$2
last_core=`expr $num_threads - 1`


echo "using $num_threads OMP threads"
echo "bind cores to 0~$last_core"

export OMP_NUM_THREADS=$num_threads
export KMP_AFFINITY=granularity=fine,compact,1,0

numactl --physcpubind=0-$last_core --membind=0 python $script
test_linear.py
import torch
import torch.nn as nn
from time import time

warmups = 1000
iters = 10000

def test_linear(batch_size, input_channel, output_channel):
    input = torch.randn(batch_size, input_channel)
    linear = nn.Linear(input_channel, output_channel)

    for i in range(warmups):
        output = linear(input)

    t1 = time()
    for i in range(iters):
        output = linear(input)
    t2 = time()
    tt = (t2-t1)/iters

    print("### Linear: (%d, %d) => (%d, %d): %f ms, %f Gflops"
            % (batch_size, input_channel, batch_size, output_channel,
              tt*1000, 2* batch_size*input_channel*output_channel/tt/1e9))

test_linear(128, 768, 768)
test_linear(128, 768, 3072)
test_linear(128, 3072, 768)
test_linear(128, 768+16, 768+16)
test_linear(128, 768+16, 3072+16)
test_linear(128, 3072+16, 768+16)
to run on single socket with 20 OMP threads:
./run.sh 20 test_linear.py
size(original)	MKL	MKLDNN	size (padded)	MKL	MKLDNN
N=128, I=768, O=768	818.57	417.03	N=128, I=784, O=784	1246.08	1282.33
N=128, I=768, O=3072	1369.88	1818.96	N=128, I=784, O=3088	1908.46	1931.12
N=128, I=3072, O=768	676.20	1262.61	N=128, I=3088, O=784	1768.28	1658.30