Ma Mingfei mingfeima

## part_3_vectorization_techniques.md

      
              1 file
            
          
              2 forks
            
          
              0 comments
            
          
              7 stars
            
          
                mingfeima
                / part_3_vectorization_techniques.md
            
            
              Last active
              June 28, 2024 11:03
            
              
                PyTorch CPU Performance Optimization Tutorial - Section III
              
          
    Part III: Vectorization Techniques

(Training material on pytorch CPU performance optimization)

Part I: Memory Formats and Channels Last Optimization
Part II: Parallelization Techniques
Part IV: BFloat16 Kernel Optimization

Chinese version for this chapter, link.
This section contains the following subjects:

  
## part_2_parallelization_techniques.md

      
              1 file
            
          
              1 fork
            
          
              0 comments
            
          
              3 stars
            
          
                mingfeima
                / part_2_parallelization_techniques.md
            
            
              Last active
              June 28, 2024 11:03
            
              
                PyTorch CPU Performance Optimization Tutorial - Section II
              
          
    Part II: Parallelization Techniques

(Training material on pytorch CPU performance optimization)

Part I: Memory Formats and Channels Last Optimization
Part III: Vectorization Techniques
Part IV: BFloat16 Kernel Optimization

Chinese version for this chapter, link.
This section contains the following subjects:

  
## part_4_bfloat16_kernel_optimization.md

      
              1 file
            
          
              1 fork
            
          
              0 comments
            
          
              1 star
            
          
                mingfeima
                / part_4_bfloat16_kernel_optimization.md
            
            
              Last active
              June 28, 2024 11:03
            
              
                PyTorch CPU Performance Optimization Tutorial - Section IV
              
          
    Part IV: BFloat16 Kernel Optimization

(Training material on pytorch CPU performance optimization)

Part I: Memory Formats and Channels Last Optimization
Part II: Parallelization Techniques
Part III: Vectorization Techniques

Chinese version for this chapter, link.
This section contains the following subjects:

  
## dlrm_task_track.md

      
              1 file
            
          
              0 forks
            
          
              1 comment
            
          
              0 stars
            
          
                mingfeima
                / dlrm_task_track.md
            
            
              Last active
              April 16, 2021 05:04
            
              
                DLRM Task
              
          
    This Gist records optimization effort of DLRM on PyTorch CPU path.
Branch on track: dlrm
Task list:

 LAMB fused optimizer (fp32)
 Adagrad fused optimier (fp32)
 Split-SGD (bf16)


## part_1_memory_format_and_channels_last_optimization.md

      
              1 file
            
          
              3 forks
            
          
              0 comments
            
          
              16 stars
            
          
                mingfeima
                / part_1_memory_format_and_channels_last_optimization.md
            
            
              Last active
              June 28, 2024 11:03
            
              
                PyTorch CPU Performance Optimization Tutorial - Section I
              
          
    Part I: Memory Formats and Channels Last Optimization

(Training material on pytorch CPU performance optimization)

Part II: Parallelization Techniques
Part III: Vectorization Techniques
Part IV: BFloat16 Kernel Optimization

Chinese version for this chapter, link.

  
## pytorch_channels_last_perf_optimization.md

      
              1 file
            
          
              3 forks
            
          
              1 comment
            
          
              17 stars
            
          
                mingfeima
                / pytorch_channels_last_perf_optimization.md
            
            
              Last active
              September 1, 2023 03:02
            
              
                PyTorch Channels Last memory format perf optimization and oneDNN integration plan.
              
          
    PyTorch Channels Last Memory Format Performance Optimization on CPU Path

("mkldnn" has been renamed to "oneDNN", but exsiting PyTorch APIs still use "mkldnn", future work will align PyTorch user level APIs to "oneDNN")
Table of Contents


PyTorch Channels Last memory format introduction
oneDNN API for NHWC layout
Generic Channels Last memory format optimization with ATen native
oneDNN NHWC integration

NB: Memory format refers to data representation that describes how multidimensional arrays (nD) are stored in linear (1D) memory address space. Memory format has the same semantic with layout in oneDNN. Layout in PyTorch has other semantic ofdescribing dense or sparse with the attributes: 'torch.strided', 'torch.sparse_coo'.

  
## pytorch_performance_profiling.md

      
              2 files
            
          
              10 forks
            
          
              3 comments
            
          
              47 stars
            
          
                mingfeima
                / pytorch_performance_profiling.md
            
            
              Last active
              June 7, 2024 01:51
            
              
                How to do performance profiling on PyTorch
              
          
    (Internal Tranining Material)
Usually the first step in performance optimization is to do profiling, e.g. to identify performance hotspots of a workload.
This gist tells basic knowledge of performance profiling on PyTorch, you will get:

How to find the bottleneck operator?
How to trace source file of a particular operator?
How do I indentify threading issues? (oversubscription)
How do I tell a specific operator is running efficiently or not?

This tutorial takes one of my recent projects - pssp-transformer as an example to guide you through path of PyTorch CPU peformance optimization. Focus will be on Part 1 & Part 2.

  
## cat_perf_regression.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                mingfeima
                / cat_perf_regression.md
            
            
              Last active
              February 14, 2020 00:40
            
              
                keep log of cat performance regression
              
          
    trace #30806 of torch.cat() performance regression.
benchmark_all_test result, command line:
python -m benchmark_all_test --operators cat --tag_filter all

with commit 7b50e76255aebbbcdae702ee1f00d07d86b0112b

(pytorch-mingfei) [mingfeim@mlt-skx090 operator_benchmark]$ python -m benchmark_all_test --operators cat --tag_filter all


## pytorch_check_mkl_mkldnn.md

      
              1 file
            
          
              3 forks
            
          
              5 comments
            
          
              7 stars
            
          
                mingfeima
                / pytorch_check_mkl_mkldnn.md
            
            
              Last active
              July 8, 2022 06:09
            
              
                BKMs to check whether mkl or mkldnn is enabled on PyTorch
              
          
    BKMs to check whether mkl or mkldnn is enabled on PyTorch

PyTorch can be installed via different channels: conda, pip, docker, source code...
By default, mkl and mkl-dnn are enabled; But this might not always be true, so it is still useful to learn how to check this by yourself:
1. How to check whether mkl is enabled?

### check where your torch is installed
python -c 'import torch; print(torch.__path__)'

  
## pytorch_cpu_perf_bkm.md

      
              1 file
            
          
              6 forks
            
          
              8 comments
            
          
              33 stars
            
          
                mingfeima
                / pytorch_cpu_perf_bkm.md
            
            
              Last active
              June 28, 2024 11:04
            
              
                BKM for PyTorch CPU Performance
              
          
    General guidelines for CPU performance on PyTorch

This file serves a BKM to get better performance on CPU for PyTorch, mostly focusing on inference or deployment. Chinese version available here.
1. Use channels last memory format

Right now, on PyTorch CPU path, you may choose to use 3 types of memory formats.

torch.contiguous_format: default memory format, also referred as NHCW.
torch.channels_last: also referred as NHWC.
torch._mkldnn: mkldnn blocked format.