Skip to content

Instantly share code, notes, and snippets.

@victoryang00
Last active May 8, 2023 17:39
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save victoryang00/cd6324acffc5ac79464d8409f768656a to your computer and use it in GitHub Desktop.
Save victoryang00/cd6324acffc5ac79464d8409f768656a to your computer and use it in GitHub Desktop.
[CAS-PRA sgemm optimization] For strength growth #牛逼

Why sgemm

Numerical linear algebra is a basic calculation module in scientific computing. The calculation of solving linear equations, linear least squares problems, eigenvalues ​​and singular values ​​is the most computationally intensive part of scientific computing. With the advent of numerical programming, it is very effective to use complex subroutine libraries to solve such problems. When we write program code that involves linear algebra operations, we usually decompose calculations into basic subroutines such as point multiplication or matrix vector multiplication. As a result, structured programming came into being, specifying basic building blocks and using unique mnemonic names to identify these operations, and in order to improve the efficiency of the algorithmic use of these algebra programs, the names and parameter lists of some of the basic operations were uniformly planned .

From 1973 to 1977, the first "level" Basic Linear Algebra Subprograms (BLAS) identified some kernel operations, mainly Fortran specifications and implementations of subroutines for scalar and vector operations [1]. With the advent of vectors, hierarchical storage, and parallel shared memory machines, 1984-1988 successively improved the second "level" BLAS of matrix vector operations and the third "level" BLAS [2,3] of operations between matrices. Specification. The three "levels" of BLAS are not only the division of its development process, but also a measure of the complexity of the algorithm [4]. In order to further develop the BLAS library, a BLAS technical forum meeting was started at the University of Tennessee symposium in 1995 to discuss the overall functions of BLAS, sparse BLAS, dense BLAS with distributed memory, extended BLAS calculation accuracy, and mixed BLAS calculation accuracy, interval BLAS and extensions to existing BLAS.

With the continuous development of the BLAS benchmark library, it has been able to be applied to many hardware platforms and serves programs related to numerical calculation in various industries. Among them, GeneralMatrix-matrixMultiplication (GEMM) is scientific computing (high-performance computing, machine learning). For basic operations in engineering and data applications, people have been aiming for different computing platforms, trying to find corresponding optimization methods to make their calculations faster.

Problem description

For decades, General Matrix-Matrix Multiplication (GEMM) has been the standard benchmark for computing performance. GEMM is the most commonly used type of computing model in high-performance computing. Whether in the HPC field, such as FFT, convolution, correlation, filtering, etc., or in the field of DeepLearning, such as convolution layers, fully connected layers, etc., its core algorithms can be directly or indirectly converted into matrix multiplication operations. The GEMM calculation formula is as follows:

$C \leftarrow \alpha \ op(A)\ op(B) + \beta \ C$

Among them, $op(X)$ represents matrix X, or transposed $X^{T}$ of matrix X, or conjugate transposed $X^{H}$ of matrix X, α and β are scalars, matrix A, m rows and k columns, matrix B, k rows and n columns, Matrix C, m rows and n columns.

There are two possible combinations of different numerical types and precisions in mixed precision:

  1. All scalar parameters and output parameters (scalar or array) are double precision, and at least one array is single precision. Then the type of combination is as follows: (S = Singlereal, D = Doublereal, C = Singlecomplex, Z = Doublecomplex)
$\alpha$ A B $\beta$ C
D S S D D
D S D D D
D D S D D
Z C C Z Z
Z C Z Z Z
Z Z C Z Z
  1. The precision of all floating-point parameters must be all single precision or all double precision. All scalar parameters and output parameters (scalars or arrays) are complex numbers (unless due to mathematical calculations, all scalar parameters must be real numbers, such as the sum in HERK). The types of combination are as follows:
$\alpha$ A B $\beta$ C
C S S C C
C S C C C
C C S C C
Z D D Z Z
Z D Z Z Z
Z Z D Z Z

BLAS implementations are usually optimized for calculation speed for specific machines, so using them can bring significant performance advantages. This competition focuses on the calculation performance of the single-precision real number matrix (SGEMM) on domestic advanced computing platforms. Players can refer to the rocBLAS function library [5, 6] for understanding of the relevant content. The API functions implemented by BatchSgemm in the rocblas function library For: rocblas_sgemm_strided_batched.

As the amount of data continues to increase, the size of matrices that need to be calculated in numerical calculations increases. People have proposed to use multiple batches to accelerate matrix multiplication calculations, because multi-batch matrix multiplication can better utilize the computing resources of hardware calculation accelerators. The sub-matrix in each batch in the calculation has a stride address offset and has the same size. Calculated as follows:

C[i* stride_c]←αop(A[i stride_a])* op(B[i* stride_b])+ βC[i stride_c] i ∈ [0,batch count - 1]

In order to further improve the calculation efficiency of the matrix, batch and strided calculation strategies are introduced on the basis of the original matrix multiplication. In order to make full use of the performance advantages of the GPU-like heterogeneous accelerator used in the cluster in this calculation method, the function implementation needs to be further optimized.

Test Methods

The example function to implement stridedbatchedmatrix-matrixoperations is as follows. In order to facilitate the better optimization of the players, the function and its parameters are explained as follows:

sgemm strided batched(  sgemm operation trans a,
                        sgemm operation trans b,
                        intm,
                        intn,
                        int k,
                        const float* alpha
                        const float A,
                        int lda,
                        int stride a,
                        const float* B
                        int ldb,
                        int stride b,
                        const float* beta,
                        float C,
                        int ldc,
                        int stride C 
                        int batch count)
typedef enum sgemm operation_ {
        operation none = 0,
        operation transpose = 1,
        operation conjugate transpose = 2
} sgemm operation;

Input parameters:

Parameter trans_a: sgemm_operation type. Details the format of op (A) used in matrix multiplication

If trans_a = operation_none, then op (A) = A;

If trans_a = operation_transpose, then op (A) = A ';

If trans_a = operation_conjugate_transpose, then op (A) = conjg (A ').

Parameter trans_b: sgemm_operation type. The definition is the same as trans_a;

Parameter m: the number of rows of matrix A, m> 0;

Parameter n: the number of columns in matrix B, n> 0;

Parameter k: the number of columns of matrix A and the number of rows of matrix B, K> 0;

Parameter alpha: is a single-precision real number, representing the scalar coefficient of matrix A;

Parameter A: indicates that the matrix stored in the form of a pointer on the GPU is a single-precision real number matrix A;

Parameter Ida: refers to the size of the first dimension of matrix A when it is actually stored, that is, if the matrix is stored first by row, then Ida K; if it is stored first by column, Ida ≯ M

Parameter stride_a: represents the span from the A matrix to the next matrix;

Parameter B: indicates that the matrix stored in the form of a pointer on the GPU is a single-precision real number matrix B;

Parameter Idb: refers to the size of the first dimension of matrix B in actual storage, and the meaning of details is the same as lda; stride_b represents the span from the start of matrix B to the next matrix;

Parameter beta: a single-precision real number representing the scalar coefficient of matrix C. If beta = 0, you do not need to define matrix C;

Parameter C: indicates that the matrix C stores the elements in the form of pointers as single-precision real numbers;

Parameter Idc: refers to the size of the first dimension of the matrix C in actual storage, with the same details as lda;

Parameter stride_c: represents the span from the start of the C matrix to the next matrix;

Parameter batch_count: indicates the number of sgemm operations in the same batch.

Output parameters

Parameter C: C matrix covering the input.

Claim:

  1. The player performs matrix multiplication performance optimization based on the given interface function. In a given test code, the API interface function cannot be changed. The non-fixed parameter players involved can be tuned by themselves according to the matrix size involved in the calculation;

  2. In the test sample section, the code gives a pseudo-random number generated dense matrix as test data to verify the performance of the algorithm. The main test case sizes are the following three:

M N K Batch
1 64 64 32 30
2 128 128 64 20
3 128 512 256 10
  1. The competitor needs to submit the function to implement part of the code, the compilation method of the function library, and the generated dynamic link library, test samples, test process, and the encrypted sequence of the running results.

Note: The contestants use the test script provided by the contest group as the basis to improve the function implementation part. To avoid affecting the contestants' performance, please compile and execute the time-encrypted serial code generated by calculating the corresponding matrix to the designated location on the web page.

Reference

[1] C. L. Lawson, R. J. Hanson, D. Kincaid, and F. T. Krogh. Basic Linear Algebra Subprograms for FORTRAN usage. ACM Trans.Math.Software,5:308-323,1979. [2] J. J. Dongarra, J. Du Croz, I. S. Duff, and D. Hammarling. A set of Level 3 Basic Linear Algebra Subprograms. ACM Trans. Math.Software,16:1-28,1990. [3] J. J. Dongarra, J. Du Croz, D. Hammarling, R. J. Hanson. An extended set of FORTRAN Basic Linear Algebra Subprograms. ACM Trans. Math. Software, 14:1-32, 399, 1988. [4] https://en.wikipedia.org/wiki/Basic_Linear_Algebra_Subprograms

What's new in CAS-PRA

The Device is controlled through slurm management software which requires sbatch scripts to run.

The device we can use

srun: job 1504488 queued and waiting for resources
srun: job 1504488 has been allocated resources
sh: lsmod: command not found
ROCk module is NOT loaded, possibly no GPU devices
Failed to get user name to check for video group membership
=====================    
HSA System Attributes    
=====================    
Runtime Version:         1.1
System Timestamp Freq.:  1000.000000MHz
Sig. Max Wait Duration:  18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model:           LARGE                              
System Endianness:       LITTLE                             

==========               
HSA Agents               
==========               
*******                  
Agent 1                  
*******                  
  Name:                    Hygon C86 7185 32-core Processor   
  Marketing Name:          Hygon C86 7185 32-core Processor   
  Vendor Name:             CPU                                
  Feature:                 None specified                     
  Profile:                 FULL_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        0(0x0)                             
  Queue Min Size:          0(0x0)                             
  Queue Max Size:          0(0x0)                             
  Queue Type:              MULTI                              
  Node:                    0                                  
  Device Type:             CPU                                
  Cache Info:              
    L1:                      32768(0x8000) KB                   
  Chip ID:                 0(0x0)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   2000                               
  BDFID:                   0                                  
  Internal Node ID:        0                                  
  Compute Unit:            8                                  
  SIMDs per CU:            0                                  
  Shader Engines:          0                                  
  Shader Arrs. per Eng.:   0                                  
  WatchPts on Addr. Ranges:1                                  
  Features:                None
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: KERNARG, FINE GRAINED
      Size:                    32694044(0x1f2df1c) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Acessible by all:        TRUE                               
    Pool 2                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    32694044(0x1f2df1c) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Acessible by all:        TRUE                               
  ISA Info:                
    N/A                      
*******                  
Agent 2                  
*******                  
  Name:                    Hygon C86 7185 32-core Processor   
  Marketing Name:          Hygon C86 7185 32-core Processor   
  Vendor Name:             CPU                                
  Feature:                 None specified                     
  Profile:                 FULL_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        0(0x0)                             
  Queue Min Size:          0(0x0)                             
  Queue Max Size:          0(0x0)                             
  Queue Type:              MULTI                              
  Node:                    1                                  
  Device Type:             CPU                                
  Cache Info:              
    L1:                      32768(0x8000) KB                   
  Chip ID:                 0(0x0)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   2000                               
  BDFID:                   0                                  
  Internal Node ID:        1                                  
  Compute Unit:            8                                  
  SIMDs per CU:            0                                  
  Shader Engines:          0                                  
  Shader Arrs. per Eng.:   0                                  
  WatchPts on Addr. Ranges:1                                  
  Features:                None
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: KERNARG, FINE GRAINED
      Size:                    33008704(0x1f7ac40) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Acessible by all:        TRUE                               
    Pool 2                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    33008704(0x1f7ac40) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Acessible by all:        TRUE                               
  ISA Info:                
    N/A                      
*******                  
Agent 3                  
*******                  
  Name:                    Hygon C86 7185 32-core Processor   
  Marketing Name:          Hygon C86 7185 32-core Processor   
  Vendor Name:             CPU                                
  Feature:                 None specified                     
  Profile:                 FULL_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        0(0x0)                             
  Queue Min Size:          0(0x0)                             
  Queue Max Size:          0(0x0)                             
  Queue Type:              MULTI                              
  Node:                    2                                  
  Device Type:             CPU                                
  Cache Info:              
    L1:                      32768(0x8000) KB                   
  Chip ID:                 0(0x0)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   2000                               
  BDFID:                   0                                  
  Internal Node ID:        2                                  
  Compute Unit:            8                                  
  SIMDs per CU:            0                                  
  Shader Engines:          0                                  
  Shader Arrs. per Eng.:   0                                  
  WatchPts on Addr. Ranges:1                                  
  Features:                None
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: KERNARG, FINE GRAINED
      Size:                    33024496(0x1f7e9f0) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Acessible by all:        TRUE                               
    Pool 2                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    33024496(0x1f7e9f0) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Acessible by all:        TRUE                               
  ISA Info:                
    N/A                      
*******                  
Agent 4                  
*******                  
  Name:                    Hygon C86 7185 32-core Processor   
  Marketing Name:          Hygon C86 7185 32-core Processor   
  Vendor Name:             CPU                                
  Feature:                 None specified                     
  Profile:                 FULL_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        0(0x0)                             
  Queue Min Size:          0(0x0)                             
  Queue Max Size:          0(0x0)                             
  Queue Type:              MULTI                              
  Node:                    3                                  
  Device Type:             CPU                                
  Cache Info:              
    L1:                      32768(0x8000) KB                   
  Chip ID:                 0(0x0)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   2000                               
  BDFID:                   0                                  
  Internal Node ID:        3                                  
  Compute Unit:            8                                  
  SIMDs per CU:            0                                  
  Shader Engines:          0                                  
  Shader Arrs. per Eng.:   0                                  
  WatchPts on Addr. Ranges:1                                  
  Features:                None
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: KERNARG, FINE GRAINED
      Size:                    33024028(0x1f7e81c) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Acessible by all:        TRUE                               
    Pool 2                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    33024028(0x1f7e81c) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Acessible by all:        TRUE                               
  ISA Info:                
    N/A                      
*******                  
Agent 5                  
*******                  
  Name:                    gfx906                             
  Marketing Name:          Device 66a1                        
  Vendor Name:             AMD                                
  Feature:                 KERNEL_DISPATCH                    
  Profile:                 BASE_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        128(0x80)                          
  Queue Min Size:          4096(0x1000)                       
  Queue Max Size:          131072(0x20000)                    
  Queue Type:              MULTI                              
  Node:                    4                                  
  Device Type:             GPU                                
  Cache Info:              
    L1:                      16(0x10) KB                        
  Chip ID:                 26273(0x66a1)                      
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   1700                               
  BDFID:                   1024                               
  Internal Node ID:        4                                  
  Compute Unit:            64                                 
  SIMDs per CU:            4                                  
  Shader Engines:          4                                  
  Shader Arrs. per Eng.:   1                                  
  WatchPts on Addr. Ranges:4                                  
  Features:                KERNEL_DISPATCH 
  Fast F16 Operation:      FALSE                              
  Wavefront Size:          64(0x40)                           
  Workgroup Max Size:      1024(0x400)                        
  Workgroup Max Size per Dimension:
    x                        1024(0x400)                        
    y                        1024(0x400)                        
    z                        1024(0x400)                        
  Max Waves Per CU:        40(0x28)                           
  Max Work-item Per CU:    2560(0xa00)                        
  Grid Max Size:           4294967295(0xffffffff)             
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)             
    y                        4294967295(0xffffffff)             
    z                        4294967295(0xffffffff)             
  Max fbarriers/Workgrp:   32                                 
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    16760832(0xffc000) KB              
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Acessible by all:        FALSE                              
    Pool 2                   
      Segment:                 GROUP                              
      Size:                    64(0x40) KB                        
      Allocatable:             FALSE                              
      Alloc Granule:           0KB                                
      Alloc Alignment:         0KB                                
      Acessible by all:        FALSE                              
  ISA Info:                
    ISA 1                    
      Name:                    amdgcn-amd-amdhsa--gfx906          
      Machine Models:          HSA_MACHINE_MODEL_LARGE            
      Profiles:                HSA_PROFILE_BASE                   
      Default Rounding Mode:   NEAR                               
      Default Rounding Mode:   NEAR                               
      Fast f16:                TRUE                               
      Workgroup Max Size:      1024(0x400)                        
      Workgroup Max Size per Dimension:
        x                        1024(0x400)                        
        y                        1024(0x400)                        
        z                        1024(0x400)                        
      Grid Max Size:           4294967295(0xffffffff)             
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)             
        y                        4294967295(0xffffffff)             
        z                        4294967295(0xffffffff)             
      FBarrier Max Size:       32                                 
*******                  
Agent 6                  
*******                  
  Name:                    gfx906                             
  Marketing Name:          Device 66a1                        
  Vendor Name:             AMD                                
  Feature:                 KERNEL_DISPATCH                    
  Profile:                 BASE_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        128(0x80)                          
  Queue Min Size:          4096(0x1000)                       
  Queue Max Size:          131072(0x20000)                    
  Queue Type:              MULTI                              
  Node:                    5                                  
  Device Type:             GPU                                
  Cache Info:              
    L1:                      16(0x10) KB                        
  Chip ID:                 26273(0x66a1)                      
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   1700                               
  BDFID:                   9728                               
  Internal Node ID:        5                                  
  Compute Unit:            64                                 
  SIMDs per CU:            4                                  
  Shader Engines:          4                                  
  Shader Arrs. per Eng.:   1                                  
  WatchPts on Addr. Ranges:4                                  
  Features:                KERNEL_DISPATCH 
  Fast F16 Operation:      FALSE                              
  Wavefront Size:          64(0x40)                           
  Workgroup Max Size:      1024(0x400)                        
  Workgroup Max Size per Dimension:
    x                        1024(0x400)                        
    y                        1024(0x400)                        
    z                        1024(0x400)                        
  Max Waves Per CU:        40(0x28)                           
  Max Work-item Per CU:    2560(0xa00)                        
  Grid Max Size:           4294967295(0xffffffff)             
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)             
    y                        4294967295(0xffffffff)             
    z                        4294967295(0xffffffff)             
  Max fbarriers/Workgrp:   32                                 
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    16760832(0xffc000) KB              
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Acessible by all:        FALSE                              
    Pool 2                   
      Segment:                 GROUP                              
      Size:                    64(0x40) KB                        
      Allocatable:             FALSE                              
      Alloc Granule:           0KB                                
      Alloc Alignment:         0KB                                
      Acessible by all:        FALSE                              
  ISA Info:                
    ISA 1                    
      Name:                    amdgcn-amd-amdhsa--gfx906          
      Machine Models:          HSA_MACHINE_MODEL_LARGE            
      Profiles:                HSA_PROFILE_BASE                   
      Default Rounding Mode:   NEAR                               
      Default Rounding Mode:   NEAR                               
      Fast f16:                TRUE                               
      Workgroup Max Size:      1024(0x400)                        
      Workgroup Max Size per Dimension:
        x                        1024(0x400)                        
        y                        1024(0x400)                        
        z                        1024(0x400)                        
      Grid Max Size:           4294967295(0xffffffff)             
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)             
        y                        4294967295(0xffffffff)             
        z                        4294967295(0xffffffff)             
      FBarrier Max Size:       32                                 
*******                  
Agent 7                  
*******                  
  Name:                    gfx906                             
  Marketing Name:          Device 66a1                        
  Vendor Name:             AMD                                
  Feature:                 KERNEL_DISPATCH                    
  Profile:                 BASE_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        128(0x80)                          
  Queue Min Size:          4096(0x1000)                       
  Queue Max Size:          131072(0x20000)                    
  Queue Type:              MULTI                              
  Node:                    6                                  
  Device Type:             GPU                                
  Cache Info:              
    L1:                      16(0x10) KB                        
  Chip ID:                 26273(0x66a1)                      
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   1700                               
  BDFID:                   17152                              
  Internal Node ID:        6                                  
  Compute Unit:            64                                 
  SIMDs per CU:            4                                  
  Shader Engines:          4                                  
  Shader Arrs. per Eng.:   1                                  
  WatchPts on Addr. Ranges:4                                  
  Features:                KERNEL_DISPATCH 
  Fast F16 Operation:      FALSE                              
  Wavefront Size:          64(0x40)                           
  Workgroup Max Size:      1024(0x400)                        
  Workgroup Max Size per Dimension:
    x                        1024(0x400)                        
    y                        1024(0x400)                        
    z                        1024(0x400)                        
  Max Waves Per CU:        40(0x28)                           
  Max Work-item Per CU:    2560(0xa00)                        
  Grid Max Size:           4294967295(0xffffffff)             
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)             
    y                        4294967295(0xffffffff)             
    z                        4294967295(0xffffffff)             
  Max fbarriers/Workgrp:   32                                 
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    16760832(0xffc000) KB              
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Acessible by all:        FALSE                              
    Pool 2                   
      Segment:                 GROUP                              
      Size:                    64(0x40) KB                        
      Allocatable:             FALSE                              
      Alloc Granule:           0KB                                
      Alloc Alignment:         0KB                                
      Acessible by all:        FALSE                              
  ISA Info:                
    ISA 1                    
      Name:                    amdgcn-amd-amdhsa--gfx906          
      Machine Models:          HSA_MACHINE_MODEL_LARGE            
      Profiles:                HSA_PROFILE_BASE                   
      Default Rounding Mode:   NEAR                               
      Default Rounding Mode:   NEAR                               
      Fast f16:                TRUE                               
      Workgroup Max Size:      1024(0x400)                        
      Workgroup Max Size per Dimension:
        x                        1024(0x400)                        
        y                        1024(0x400)                        
        z                        1024(0x400)                        
      Grid Max Size:           4294967295(0xffffffff)             
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)             
        y                        4294967295(0xffffffff)             
        z                        4294967295(0xffffffff)             
      FBarrier Max Size:       32                                 
*******                  
Agent 8                  
*******                  
  Name:                    gfx906                             
  Marketing Name:          Device 66a1                        
  Vendor Name:             AMD                                
  Feature:                 KERNEL_DISPATCH                    
  Profile:                 BASE_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        128(0x80)                          
  Queue Min Size:          4096(0x1000)                       
  Queue Max Size:          131072(0x20000)                    
  Queue Type:              MULTI                              
  Node:                    7                                  
  Device Type:             GPU                                
  Cache Info:              
    L1:                      16(0x10) KB                        
  Chip ID:                 26273(0x66a1)                      
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   1700                               
  BDFID:                   25344                              
  Internal Node ID:        7                                  
  Compute Unit:            64                                 
  SIMDs per CU:            4                                  
  Shader Engines:          4                                  
  Shader Arrs. per Eng.:   1                                  
  WatchPts on Addr. Ranges:4                                  
  Features:                KERNEL_DISPATCH 
  Fast F16 Operation:      FALSE                              
  Wavefront Size:          64(0x40)                           
  Workgroup Max Size:      1024(0x400)                        
  Workgroup Max Size per Dimension:
    x                        1024(0x400)                        
    y                        1024(0x400)                        
    z                        1024(0x400)                        
  Max Waves Per CU:        40(0x28)                           
  Max Work-item Per CU:    2560(0xa00)                        
  Grid Max Size:           4294967295(0xffffffff)             
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)             
    y                        4294967295(0xffffffff)             
    z                        4294967295(0xffffffff)             
  Max fbarriers/Workgrp:   32                                 
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    16760832(0xffc000) KB              
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Acessible by all:        FALSE                              
    Pool 2                   
      Segment:                 GROUP                              
      Size:                    64(0x40) KB                        
      Allocatable:             FALSE                              
      Alloc Granule:           0KB                                
      Alloc Alignment:         0KB                                
      Acessible by all:        FALSE                              
  ISA Info:                
    ISA 1                    
      Name:                    amdgcn-amd-amdhsa--gfx906          
      Machine Models:          HSA_MACHINE_MODEL_LARGE            
      Profiles:                HSA_PROFILE_BASE                   
      Default Rounding Mode:   NEAR                               
      Default Rounding Mode:   NEAR                               
      Fast f16:                TRUE                               
      Workgroup Max Size:      1024(0x400)                        
      Workgroup Max Size per Dimension:
        x                        1024(0x400)                        
        y                        1024(0x400)                        
        z                        1024(0x400)                        
      Grid Max Size:           4294967295(0xffffffff)             
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)             
        y                        4294967295(0xffffffff)             
        z                        4294967295(0xffffffff)             
      FBarrier Max Size:       32                                 
*** Done ***         ```bash

The result of our implementation

rocblas自身的实现为架构通用的tensile 机器码,很少会顾及isa架构相关的优化如内存读入,寄存器分配,block size大小等。在对rocblas_sgemm_strided_batch 和自己写的naive版本的batch进行profiling和extractkernel后,着重发现和了解了几个重难点。首先是块与线程部分。第一最高线程速率,硬件生成Threads的速率将直接影响最终程序的效率, 例如GPU显存的读写速度,测试发现gfx906获得64 threads/Cycles的极限性能;第二是1D形状的线程速率曲线。测试得到仅仅当 BlockDim = 256, 512, 1024时, 线程产生速度达到峰值。也即是如果能够将原来4 threads的工作合并到一个thread,每个线程处理的事务随之提高到4倍,例如读写操作,将极大地提高理论极限。在测试了2D 和 3D 之后得出以 256倍的BlockDim性能最佳。其次,在Instruction cahceline的部分,结果表明增加越多的线程能更有效的增加SIMD的效率。在对dump后的VGPR寄存器分析后,发现:HIPCC大多数时候不使用s_load_dword,C ++ Statemetn不能将内联汇编的输出用作操作数,C ++的操作数必须来自C ++,64位地址或数据在内联汇编中很难调用,内联汇编是一个很难用C ++变量,if-esle,for-loop等控制的程序。HIPCC分配SGPR / VGPR的情况不太好GCN限制:s_load_DWORDx4,x8,X16 SGPR必须以x4地址开头内联汇编很难具有正确的VGPR / SGPR设置。由此引出了我们的解决方案:每个 WorkGroup 的 Macro-Tile和Micro-Tile 的分配问题,也即是VLP SGEMM 大小为128的WorkGroup的Macro使用 M=64,N=128,256 的Macro则为 M=64,N=256,每个线程为的Micro tile 大小为M=64, N=1,即每个线程运算Matrix A= 64xK, Matrix B = Kx64, 结果在 Matrix-C 64 x1。对于64个线程,每个M的Matrix-C地址是连续的。每个Wave 的 Matrix A 的Basic Offset 被设定为 N/64 *64 *lda;而B为M/64 *64 lda,取数据的指令为s_load_dwordx8 s[32:39], s[12:13], s18。GCN架构总共有96个可用的SGPR 这个算法使用s32到95,只有64个读取A,而88的并行读入设计使得效率提升。对B来说,每个线程使用微区块大小M = 64,N = 1。每个线程需要8个VGPR来加载1个N的8xK数据。该算法使用global_load_dwordx4来获得最佳的缓存行命中率。下一条存储器读取指令读取同一高速缓存行的下4个DWORD。关于VGPR分配,每个线程需要V [2:3]作为矩阵B的每个线程偏移量。矩阵B的双缓冲区加载需要16倍VGPR。这样总共83个剩下的VGPR负责每个SIMD 3个Waves 得到了很好的性能表现。还有,这种先由先分配变量至寄存器再反编译到机器码(相当于inline 静态库)的方式使得完全没有调度器带来的barrier 和LDS(LDS访存慢于L1 和VGPR)。最终,完成这些操作能使gdx906最高达到77%的性能释放。

doc

编译运行

make
./lib/sgemm_strided_batch_final -m 512 -n 512 -k 256 --batch_count 10

生成

make compile_co

Creativity

深入研读gpu,也即是gcn架构的体系结构相关知识。用汇编和反编译代码的方式优化。主要为优化代码,尤其是gpu代码提供一种优化思路,即先编译分配好VGPR的inline函数和其他一些工具到机器码,再反编译到.co文件,被需要的cpp文件当作外部库来使用,可以极大地利用体系机构地优势从而加速sgemm。

Applications

SgemmBatchedStrided 的应用领域非常多。但是我认为最能体现本答卷价值的是CNN的Convolution,即用体系架构的优化代码方式优化现有CNN代码,用profile 和dump工具对现有的CNN Convolution 汇编分析,比如可以看的点主要有一/二级缓存命中延时、缓存行长度,接下来就可以用简单的汇编代码inlin 再反汇编进行优化。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment