Skip to content

Instantly share code, notes, and snippets.

@mingfeima
Last active July 8, 2022 06:13
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save mingfeima/fe8110e8bbaaf0579e5d721ae3357697 to your computer and use it in GitHub Desktop.
Save mingfeima/fe8110e8bbaaf0579e5d721ae3357697 to your computer and use it in GitHub Desktop.
BERT Optimization

benchmark

Based on huggingface repo for performance evaluation, actual benchmark run script placed at repo. How to reproduce performance:

  1. prepare dataset according to link.
  2. update GLUE_DIR to actual dataset path in run_inference.sh.
  3. change env settings, the default setting is using 20 cores;

MKL v.s. MKLDNN

Inference performance result on Xeon 6148 (2x20 cores), single socket and single thread.

  • MKL: version 2019.4 (conda install mkl mkl-include)
  • MKLDNN: proposed in 21851

single instance (20 threads)

  • MKL
>>> ./run_inference.sh
408/408 [00:24<00:00, 16.69it/s]
  • MKLDNN
>>> ./run_instance.sh --mkldnn
408/408 [00:18<00:00, 21.95it/s]

multi instance (1 thread per instance)

  • MKL
>>> ./run_inference.sh --multi_instances
Average latency per example: 469.058ms
Total number of iterations: 1000
Total number of iterations per second (across all threads): 42.64
Total time: 23.453s
  • MKLDNN
>>> ./run_inference.sh --multi_instances --mkldnn
Average latency per example: 370.495ms
Total number of iterations: 1000
Total number of iterations per second (across all threads): 53.98
Total time: 18.525s

Impact of leading dimension padding

  • skylake has special requirements on leading dimension of GEMM, when LDA/LDB/LDC is multiple of 128, will cause cache flush issue, see ref.
  • The following table compares performance of BERT (glue/MRPC) GEMMs on MKL and MKLDNN with original size and padded size (+16).

Table-1: single socket test result (20 threads)

size(original) MKL MKLDNN size (padded) MKL MKLDNN
N=128, I=768, O=768 818.57 417.03 N=128, I=784, O=784 1246.08 1282.33
N=128, I=768, O=3072 1369.88 1818.96 N=128, I=784, O=3088 1908.46 1931.12
N=128, I=3072, O=768 676.20 1262.61 N=128, I=3088, O=784 1768.28 1658.30

unit: Gflops

  • Use the following script to reproduce this result:

run.sh:

num_threads=$1
script=$2
last_core=`expr $num_threads - 1`


echo "using $num_threads OMP threads"
echo "bind cores to 0~$last_core"

export OMP_NUM_THREADS=$num_threads
export KMP_AFFINITY=granularity=fine,compact,1,0

numactl --physcpubind=0-$last_core --membind=0 python $script

test_linear.py

import torch
import torch.nn as nn
from time import time

warmups = 1000
iters = 10000

def test_linear(batch_size, input_channel, output_channel):
    input = torch.randn(batch_size, input_channel)
    linear = nn.Linear(input_channel, output_channel)

    for i in range(warmups):
        output = linear(input)

    t1 = time()
    for i in range(iters):
        output = linear(input)
    t2 = time()
    tt = (t2-t1)/iters

    print("### Linear: (%d, %d) => (%d, %d): %f ms, %f Gflops"
            % (batch_size, input_channel, batch_size, output_channel,
              tt*1000, 2* batch_size*input_channel*output_channel/tt/1e9))

test_linear(128, 768, 768)
test_linear(128, 768, 3072)
test_linear(128, 3072, 768)
test_linear(128, 768+16, 768+16)
test_linear(128, 768+16, 3072+16)
test_linear(128, 3072+16, 768+16)

to run on single socket with 20 OMP threads:

./run.sh 20 test_linear.py
@mingfeima
Copy link
Author

mingfeima commented Jun 12, 2019

attach raw logs:
(all flops numbers need to multiply 2)

  • mkl
(pytorch-mingfei) [mingfeim@mlt-skx059 gemm]$ ./run.sh 20 test_linear.py
using 20 OMP threads
bind cores to 0~19
### Linear: (128, 768) => (128, 768): 0.184461 ms, 409.286541 Gflops
### Linear: (128, 768) => (128, 3072): 0.440899 ms, 684.940432 Gflops
### Linear: (128, 3072) => (128, 768): 0.893200 ms, 338.098965 Gflops
### Linear: (128, 784) => (128, 784): 0.126277 ms, 623.041711 Gflops
### Linear: (128, 784) => (128, 3088): 0.324750 ms, 954.231869 Gflops
### Linear: (128, 3088) => (128, 784): 0.350496 ms, 884.138455 Gflops
(pytorch-mingfei) [mingfeim@mlt-skx059 gemm]$ ./run.sh 1 test_linear.py
using 1 OMP threads
bind cores to 0~0
### Linear: (128, 768) => (128, 768): 0.920864 ms, 81.985440 Gflops
### Linear: (128, 768) => (128, 3072): 3.543361 ms, 85.226953 Gflops
### Linear: (128, 3072) => (128, 768): 3.875439 ms, 77.924046 Gflops
### Linear: (128, 784) => (128, 784): 0.973221 ms, 80.840824 Gflops
### Linear: (128, 784) => (128, 3088): 3.605958 ms, 85.937485 Gflops
### Linear: (128, 3088) => (128, 784): 3.609848 ms, 85.844887 Gflops
  • mkldnn
(pytorch-mingfei) [mingfeim@mlt-skx059 gemm]$ ./run.sh 20 test_linear.py
using 20 OMP threads
bind cores to 0~19
### Linear: (128, 768) => (128, 768): 0.362074 ms, 208.513726 Gflops
### Linear: (128, 768) => (128, 3072): 0.332046 ms, 909.481931 Gflops
### Linear: (128, 3072) => (128, 768): 0.478356 ms, 631.307384 Gflops
### Linear: (128, 784) => (128, 784): 0.122708 ms, 641.164074 Gflops
### Linear: (128, 784) => (128, 3088): 0.320939 ms, 965.562278 Gflops
### Linear: (128, 3088) => (128, 784): 0.373740 ms, 829.151734 Gflops
(pytorch-mingfei) [mingfeim@mlt-skx059 gemm]$ ./run.sh 1 test_linear.py
using 1 OMP threads
bind cores to 0~0
### Linear: (128, 768) => (128, 768): 1.077302 ms, 70.080142 Gflops
### Linear: (128, 768) => (128, 3072): 4.235816 ms, 71.294389 Gflops
### Linear: (128, 3072) => (128, 768): 5.813250 ms, 71.294389 Gflops
### Linear: (128, 784) => (128, 784): 1.034091 ms, 71.294389 Gflops
### Linear: (128, 784) => (128, 3088): 4.010942 ms, 77.260391 Gflops
### Linear: (128, 3088) => (128, 784): 3.925236 ms, 78.947353 Gflops

@mingfeima
Copy link
Author

mingfeima commented Jun 12, 2019

TODOs:

  1. root cause why Transformer got lower performance??
  2. update multi processes performance (no weight sharing) on single socket. Run with BASH &?? Dummy a little bit...
  3. rewrite MkldnnLinear so that it accepts plain layout tensor inplace.
  4. (tbd) try eular conv2d (kernel=1) to replace gemm? make it an extension...

@mingfeima
Copy link
Author

attach multi instance run raw logs and script:

script

#!/bin/sh

CORES=`lscpu | grep Core | awk '{print $4}'`

corePerInstance=1
numInstance=$CORES

export OMP_NUM_THREADS=${corePerInstance}
export KMP_AFFINITY=granularity=fine,noduplicates,compact,1,0

GLUE_DIR=./dataset/glue_data
TASK_NAME=MRPC

for ((i=0; i<${numInstance}; i++))
do
    startCore=$[${i}*${corePerInstance}]
    endCore=$[${startCore}+${corePerInstance}-1]
        
    echo "# Process [${i}/${numInstance}] => run with core ${startCore} - ${endCore}"
    if [ ${i} == $[${numInstance} -1] ]
    ## only show the last instance's output
    then
        taskset -c ${startCore}-${endCore} numactl -l python test_linear.py
    else
        taskset -c ${startCore}-${endCore} numactl -l python test_linear.py > /dev/null &
    fi
done

mkldnn multi_instance

(pytorch-mingfei) [mingfeim@mlt-skx084 gemm]$ ./run_multi_instance.sh
Process [0/20] => run with core 0 - 0
Process [1/20] => run with core 1 - 1
Process [2/20] => run with core 2 - 2
Process [3/20] => run with core 3 - 3
Process [4/20] => run with core 4 - 4
Process [5/20] => run with core 5 - 5
Process [6/20] => run with core 6 - 6
Process [7/20] => run with core 7 - 7
Process [8/20] => run with core 8 - 8
Process [9/20] => run with core 9 - 9
Process [10/20] => run with core 10 - 10
Process [11/20] => run with core 11 - 11
Process [12/20] => run with core 12 - 12
Process [13/20] => run with core 13 - 13
Process [14/20] => run with core 14 - 14
Process [15/20] => run with core 15 - 15
Process [16/20] => run with core 16 - 16
Process [17/20] => run with core 17 - 17
Process [18/20] => run with core 18 - 18
Process [19/20] => run with core 19 - 19
Linear: (128, 768) => (128, 768): 1.879133 ms, 80.353507 Gflops
Linear: (128, 768) => (128, 3072): 7.797926 ms, 77.453900 Gflops
Linear: (128, 3072) => (128, 768): 9.332923 ms, 64.714961 Gflops
Linear: (128, 784) => (128, 784): 1.935849 ms, 81.283183 Gflops
Linear: (128, 784) => (128, 3088): 7.886846 ms, 78.583249 Gflops
Linear: (128, 3088) => (128, 784): 7.418364 ms, 83.545908 Gflops

mkl multi_instance

(pytorch-mingfei) [mingfeim@mlt-skx084 gemm]$ ./run_multi_instance.sh
Process [0/20] => run with core 0 - 0
Process [1/20] => run with core 1 - 1
Process [2/20] => run with core 2 - 2
Process [3/20] => run with core 3 - 3
Process [4/20] => run with core 4 - 4
Process [5/20] => run with core 5 - 5
Process [6/20] => run with core 6 - 6
Process [7/20] => run with core 7 - 7
Process [8/20] => run with core 8 - 8
Process [9/20] => run with core 9 - 9
Process [10/20] => run with core 10 - 10
Process [11/20] => run with core 11 - 11
Process [12/20] => run with core 12 - 12
Process [13/20] => run with core 13 - 13
Process [14/20] => run with core 14 - 14
Process [15/20] => run with core 15 - 15
Process [16/20] => run with core 16 - 16
Process [17/20] => run with core 17 - 17
Process [18/20] => run with core 18 - 18
Process [19/20] => run with core 19 - 19
Linear: (128, 768) => (128, 768): 1.699797 ms, 88.831161 Gflops
Linear: (128, 768) => (128, 3072): 7.188918 ms, 84.015391 Gflops
Linear: (128, 3072) => (128, 768): 7.282895 ms, 82.931275 Gflops
Linear: (128, 784) => (128, 784): 1.882290 ms, 83.596032 Gflops
Linear: (128, 784) => (128, 3088): 7.615179 ms, 81.386659 Gflops
Linear: (128, 3088) => (128, 784): 7.280856 ms, 85.123780 Gflops

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment