Ma Mingfei mingfeima

## pytorch_perf_optimization_cpu.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                mingfeima
                / pytorch_perf_optimization_cpu.md
            
            
              Last active
              December 29, 2017 01:51
            
          
    PyTorch Performance Optimization on CPU


pytorch mkldnn integration prototype design


mkldnn conv integration
conv3d parallelization: vol2col, col2vol
LSTM optimization non-fused: tanh/sigmoid parallelization


Create MKLDNN conda channel


MKLDNN tensor type


create lib/THMKL?


## [BKM] VTune.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                mingfeima
                / [BKM] VTune.md
            
            
              Last active
              May 22, 2019 01:42
            
              
                vtune tips
              
          
    Hotspot analysis:
/opt/intel/vtune_amplifier/bin64/amplxe-cl -collect hotspots -knob analyze-openmp=true -knob sampling-interval=10 --resume-after 5 -d 20 \
  -- /home/mingfeim/pytorch/unit_tests/run.sh
/opt/intel/vtune_amplifier/bin64/amplxe-cl -archive -r $1
Interpret vtune log function names:
e.g.


## embedding_optimization.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                mingfeima
                / embedding_optimization.md
            
            
              Created
              May 27, 2019 01:28
            
              
                Recommendation 
              
          
    nn.Embedding()
ref: TensorFlow

  
## topk.md

      
              1 file
            
          
              0 forks
            
          
              2 comments
            
          
              1 star
            
          
                mingfeima
                / topk.md
            
            
              Last active
              July 2, 2019 02:43
            
              
                topk_optimization_backups
              
          
    backups for PR19736 of topk() performance optimization on CPU.

description

Suppose input tensor has shape of [N, C], performance input.topk(K, sorted=Sorted) for the followings scenarios:

C = 10000, 40000, 320000
K = 10, 50, 100, C/10, C/2, C-5
Test with 20 threads and 1 thread
Test with Sorted=True and Sorted=False


## cat_perf_regression.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                mingfeima
                / cat_perf_regression.md
            
            
              Last active
              February 14, 2020 00:40
            
              
                keep log of cat performance regression
              
          
    trace #30806 of torch.cat() performance regression.
benchmark_all_test result, command line:
python -m benchmark_all_test --operators cat --tag_filter all

with commit 7b50e76255aebbbcdae702ee1f00d07d86b0112b

(pytorch-mingfei) [mingfeim@mlt-skx090 operator_benchmark]$ python -m benchmark_all_test --operators cat --tag_filter all


## mkldnn_integration_plan.md

      
              1 file
            
          
              1 fork
            
          
              0 comments
            
          
              0 stars
            
          
                mingfeima
                / mkldnn_integration_plan.md
            
            
              Last active
              May 15, 2020 20:03
            
              
                mkldnn integration plan, RFC draft
              
          
    MKL-DNN Integration Plan

The purpose is to further improve PyTorch CPU performance on both imperative path and jit path.
MKLDNN requires to reorder memory from plain layout to blocked layout to achieve optimal performance on CPU, e.g. from nchw to nChw16c, etc. At this moment on PyTorch, MKLDNN operators reuse CPU tensor, which means for each MKLDNN operator, it takes three steps to finish the computation:
input_reorder(plain_layout, blocked_layout)
mkldnn_computation()
output_reorder(blocked_layout, plain_layout)
These reorders takes about 50% of time on a typical ImageNet topology, e.g. ResNet50. Also MKLDNN chose different blocked format according to different input config from Convolution, with nn.Conv2d always output in plain layout, subsequent layers (BatchNorm, Pooling) would only execute on plain layout and this is the slow path for MKLDNN. With these problems solved, the CNN models would have 3~4x speedup v.s. current performance.

  
## dlrm_task_track.md

      
              1 file
            
          
              0 forks
            
          
              1 comment
            
          
              0 stars
            
          
                mingfeima
                / dlrm_task_track.md
            
            
              Last active
              April 16, 2021 05:04
            
              
                DLRM Task
              
          
    This Gist records optimization effort of DLRM on PyTorch CPU path.
Branch on track: dlrm
Task list:

 LAMB fused optimizer (fp32)
 Adagrad fused optimier (fp32)
 Split-SGD (bf16)


## part_4_bfloat16_kernel_optimization.md

      
              1 file
            
          
              1 fork
            
          
              0 comments
            
          
              0 stars
            
          
                mingfeima
                / part_4_bfloat16_kernel_optimization.md
            
            
              Last active
              July 8, 2022 06:04
            
              
                PyTorch CPU Performance Optimization Tutorial - Section IV
              
          
    Part IV: BFloat16 Kernel Optimization

(Training material on pytorch CPU performance optimization)

Part I: Memory Formats and Channels Last Optimization
Part II: Parallelization Techniques
Part III: Vectorization Techniques

Chinese version for this chapter, link.
This section contains the following subjects:

  
## pytorch_check_mkl_mkldnn.md

      
              1 file
            
          
              3 forks
            
          
              5 comments
            
          
              7 stars
            
          
                mingfeima
                / pytorch_check_mkl_mkldnn.md
            
            
              Last active
              July 8, 2022 06:09
            
              
                BKMs to check whether mkl or mkldnn is enabled on PyTorch
              
          
    BKMs to check whether mkl or mkldnn is enabled on PyTorch

PyTorch can be installed via different channels: conda, pip, docker, source code...
By default, mkl and mkl-dnn are enabled; But this might not always be true, so it is still useful to learn how to check this by yourself:
1. How to check whether mkl is enabled?

### check where your torch is installed
python -c 'import torch; print(torch.__path__)'

  
## bert_optimization.md

      
              1 file
            
          
              1 fork
            
          
              3 comments
            
          
              1 star
            
          
                mingfeima
                / bert_optimization.md
            
            
              Last active
              July 8, 2022 06:13
            
              
                BERT Optimization
              
          
    benchmark

Based on huggingface repo for performance evaluation, actual benchmark run script placed at repo.
How to reproduce performance:

prepare dataset according to link.
update GLUE_DIR to actual dataset path in run_inference.sh.
change env settings, the default setting is using 20 cores;

MKL v.s. MKLDNN

Inference performance result on Xeon 6148 (2x20 cores), single socket and single thread.