When I was working on the implementation of deep learning framework, I was aware of a assembler called Maxas (https://github.com/NervanaSystems/maxas) which can generate the GPU machine code which outperforms the nVidia official GEMM library, and started to get interested in it.
The author of that project, Scott Gray, provided detailed documentations about it (https://github.com/NervanaSystems/maxas/wiki/SGEMM), however due to the complexity of the algorithm, the document is still hard to follow. This article can be regarded as the document of the original document, according to the author's own understanding, with as many details as possible covered, and intention of all the code explained. The main structure still follows the original document, so does all the figures.
Note that the Maxas algorithm depends heavily on the features of Maxwell architect, the project is out-of-date with the evolution of new GPU architect. However the ideas behind it are still insightful and worth being documented.
Single-precision general matrix multiply (SGEMM) is the most familiar case for any programmers who have learned CUDA. This is the only example in the official tutorial since nVidia release the first version of CUDA. It is not just because SGEMM is the most critical building block of almost and computational intensive software, but also because it is a good example to show the optimization tricks on GPU, especially those exploiting of memory hierarchy.
For the multiplication of two matrices A and B, the easiest way to parallelize is to launch as many threads as the number of elements of in the output matrix C. Each thread loads a line of A and a column of B and calculates the inner product of the two vectors. The problem is the latency accessing the main GPU memory is huge (~100 cycles). For a row in A it may be still possible to take advantage of the huge band width of GPU memory or caching to load many successive elements of A to amortize the latency. While for big matrix (N>1000) B, there can be thousands of elements between the successive element in its column vector, which means the results of each loading are all useless except the exact column element, and there will be not cache hit. In a word the efficiency of such memory access is horrible.
The optimization of the naïve method above needs shared memory, which can be regarded as on-chip cache of GPU. The latency of shared memory is as fast as first-level cache, and can be shared among all the threads in a thread block. The only shortcoming is that its size is limited. To take advantage of this small piece of high speed memory, there is some change to the native parallelization: The matrix needs to be divided into
If each block is regarded as an element, the scale of the matrix is reduced by
The algorithm is actually pretty fast, and more importantly, further optimization will be based on it. In order to follow the rest of this paper, the reader is encouraged to study the CUDA official tutorial to get familiar with the blocking method, and understand the CUDA programming model as deep as possible.
As described in the section above, after shared memory is used in blocking method, not only the speed of matrix multiplication (MM) of blocks matrices can be greatly improved, it is also possible to transfer the next blocks for calculation from main memory to shared memory while doing MM of blocks to. In another word, the time for IO can be fully covered by the MM of blocks. In order to further improve the performance, something has to be done on the multiplication of blocks.
Although it has already been fast to do MM in shared memory, there is still quite a lot of things to do to reach the hardware limitation. There are 2 main bottlenecks. Firstly the latency of shared memory is still higher than register. On Maxwell/Pascal GPU the latency of registers is 6 cycles, while that of shared memory is 23 cycles. In addition, the computation unit on GPU cannot work on data in shared memory directly, and therefore a transferring instruction is needed to move the data to register. The mov instruction can take as much time as the real calculation instructions, so the overhead is pretty huge. To reach the peak performance of hardware, it is required that all the computation units of GPU have their pipelines fully occupied by calculation instructions, and for each clock cycle there will be numerical result coming out of the pipeline. In order to achieve that, Maxas and some previous research it quoted propose the following method:
1. Taking advantage of the newly added vector instructions, which can transfer 4 successive float numbers between shared memory and registers in one instructions, therefore greatly reduce to number of transferring instructions, and easier to enable the calculation instructions to hide the transferring time.
2. Interleaving the calculation instructions and transferring instructions to implement the pipeline of data prefetching and calculation
3. Blocking algorithm make use of the fast shared memory to cache the data which needs to be accessed multiple times. If the idea is developed and do another level of blocking, each block matrix is further blocked into smaller sub-blocks, and cache the data in shared memory with registers, some additional speedup can be expected. Note that the lower level blocking method is different from the higher level blocking, and there is some addition difficulty to resolve for it.
The precise control for GPU instructions and registers to implement the ideas above has been beyond the expression capacity of CUDA, therefore these ideas have to be implemented with native assemble language of GPU (not even the pseudo assemble language like PTX). However, it is still possible to use some C-like pseudo-code which is more expressive to describe the implementation. There is straightforward conversion form pseudo code to assemble code, which is implemented will a perl script in Maxas
Here consider the multiplication of 2
However, instead from the perspective of output matrix C, considering the problem from the perspective of input matrices. It can be figured out that the k-th column (not row!) of A is only used to do multiplication with the k-th row (not column!) of B. In another word, taking the k-th column of A and the k-th row of B, we can multiply all the element pairs and add the result to the output matrix element it contributes:
$$ C = \left[
\begin{matrix}
C_{00} += A_{0k}*B_{k0} & C_{01}+= A_{0k}*B_{k1} & \cdots & C_{0N}+= A_{0N}*B_{k0} \
C_{10} += A_{1k}*B_{k0} & C_{11} += A_{1k}*B_{k1} & \cdots & C_{1N} += A_{1k}*B_{kN} \
\vdots & \vdots & \ddots & \vdots \
C_{N0} += A_{0k}*B_{kN} & C_{k1} += A_{Nk}*B_{k1} & \cdots & C_{NN} += A_{Nk}*B_{kN} \
\end{matrix}
\right] $$
The row and column then can be discarded as they have completed their task in MM an will no longer be needed. The method can greatly reduce to occupation of input matrices on registers. Moreover, the loading of
Maxas uses 64 threads to implement the parallelization of block matrix multiplication, each thread taking care of the multiplication of 4 (
The implementation of Maxas is just to serve the algorithm decribed in this section, the problems solved in the following sections are also what will be encountered during the implementation. The choice of the parameters, i.e. why 64 threads, are based on the hardware resource of GPU, to create as many thread wars as possible while each thread can still get enough registers requried. The purpose is to enable scheduler to launch some warp to do calculation while waiting for other warps to fetch the data.
The first thing for the 64 threads, described in the section above, is to load the data they need of the two input matrices from main memory. It is worthwhile pointing out an implicit assumption not mentioned in the original document, which is matrices are stored in row-major format in Maxas, namely the columns in matrix are continuous. Otherwise it is not possible the explain the algorithms in the following section. Compared with the loading method in CUDA official tutorial, the loading method here also need some change due to the change in calculation algorithm, together with some optimization:
1. Since row data is used for B matrix, which is not successive when the format of B is column-major, B matrix needs to be transposed to make column to row. Therefore A and B can be loaded with the same method to simplify the code
2. A and B matrices are loaded as 1D texture, so that not only the texture data can be cached with texture cache, but also can benefit from the feature of texture loading that all the data out of boundary are set to 0, so that the such data will have no effect on the MM result.
The knowledge of the preprocessing about will help understand the psedu-code in the following section. The creation of texture and transpose should have been done before the execution of GEMM kernel on GPU and will not affect the performance of the kernel.
The data in texture memory are also loaded into shared memory segment by segment. According to the original method, each segment should be a tile of
The loading method above is not unique for sure, and my understanding is since the loading methods for A and B are the same, except applying for different textures. Therefore compared with loading A and B with a single thread, the number of loading instructions, which is unrelated to computation, can be reduced.
The data loaded is staged in registers, waiting to be stored in shared memory. The data layout in shared memory is shown in the following figure:
Since the unit of offset in shared memory is 1 byte, we can get back to the straightforward representation. Therefore it can be seen that the data storage patterns in the 2 figures above are exactly the same, both are
There are still two problems in the code above worth clarification:
track0 = blk*64/4 + tid15 + (ldx * tid2);
where the first termblk*64/4
is used to select the 1D offset of the top-left corner ofblk
-th stripe with respect to the whole matrix of input matrix A or transposed matrix B in texture. Since the top-left corners of all stripes are located in the first column of input matrix, and in row-major storage the offset of any element in the first column is just the row number, therefore for the 'blk'-th stripe its top-left corner isblk*64
, and/4
comes from the factor of vector instruction. The meaning oftid15 + (ldx * tid2)
is more straightforward, which is the relative position of corresponding yellow lattice with respect to green lattice of the current thread in Fig.3.tid15
can be regarded as the column coordinator, andtid2
the row coordinator, which should be multiplied by the strideidx
at this dimension in 1D representation.- 4
track
variables has been used to record the offset of 4 loading instructions. The reader may wonder why not just just 1track
variable, and add the offset by 2 rows after each loading to do the some work: https://gist.github.com/d392c541e90636e8b5247a4daeae1b05
The problem is after tex.1d.v4.f32.s32
instruction is issued (and before completion) the its track
operand is not going to be saved. In order to ensure that it is not changed by the following increment instruction, some control code must be inserted to wait for the previous loading instrution to complete. In the worst case that instruction may take hundreds of clock cycles. Using 4 offset variables can avoid waiting for the completion of data transfer and issue all the 4 loading instruction one after another rightway, and therefore acts like a pipeline. The cost is for each thread an addition of 3 registers need to be occupied, and fortuanately there are still enough of them on GPU.
After the job above is done, there are 8 rows of data for both A and B, and each row contains 64 floats. Getting a row from each matrix, then we can do the fused-multiply-add operations with the elements in the 2 rows. Once completed getting another row until the 8 rows in shared memory are consumed, when the other warp should have completed the transferring from texture memory to the other group in shared memory, and can be switched to do computation on the data. As shown in Fig.2, each thread actually just needs 8 of the 64 floats in total, and their offsets in vector A and B can be calculated according the figure. The detailed process in the code is done by a series of bit manipulations, which can be explained here in advance: https://gist.github.com/9d4bcf438bb6e6849efb497e8b0a963f
Thread 2*i
and 2*i+1
in the figure use the same segment of A, which can be written as (i/2) % 8
. The series of bit manipulations is just the implementation of this expression, in which <<4
implements the interval of 16 bytes, which is the size of 4 floats loaded by vector loading, between the segment of column vector of A.
https://gist.github.com/d3c87fcf61055ed284af1b40b612eaf1
The selection in the row vector of B is more complicated. Firstly note that for threads with even index, every 16 threads the distance long the direction of B the distance is 2 segments (8 floats), therefore for the 64 threads which can be represented with 6 bits, tid & 0x30
indicates the last 4 bits representing tid mod 16
can be masked, and only the first 2 bits is meaningful for the selection of B. The following >>3
actually draws the first 2 bits to the lowest digits with >>3
firstly, then represent the distance of 2 segments with *2
.
| (tid & 1)
is equivalent to +(tid & 1)
, indicating thread 2*i+1
always selects the segment of data after thread 2*i
. That segment of data also fills the gap between the distance of 2 segements.
It also may have been noted that the layout of the threads in Fig.2 is really awkward, which is not sorted by thread index in either row or column direction. The reason to do that is to avoid bank conflict during access to shared memory. The definition of bank conflict and the condition for it to happen has been described in details in the CUDA official document. Briefly, the access to shared memory is divided into a couple of banks according to the address (the easiest method is by doing modulus), and if 2 threads need to access the shared memory whose addresses are in the same bank, then it is not possible to complete the access simultaneously, and they have to be access in series, therefore the access time is multiplied by the number of threads which need to access the same bank. However, that is just the most general case, and there are some optimization mechanisms provided by GPU, for example, broadcast, to reduce the negative effect of bank conflict. The other awkward thing is each thread does the calculation of 4
There is another trick used in the implementation. Although each thread requires only 16 input numbers, the number of registers allocated is actually twice of that number. The purpose of doing that is the same as before, to use two groups of registers to implement a pipeline, in which each thread can prefetch the next line of data while calculating the current line.
https://gist.github.com/7025ed8bc556333adf5f9530b5149898
Now all the data needed for calculation have been sent to the registers with high-efficiency method, and looks like it is the time to use FFMA instruction to manipulate them directly, which is what the GEMM kernel should do. Unfortunately before that there is one difficulty, which maybe the biggest trouble in this whole project, the bank conflict of register access.
In order to fill a stream process unit with a whole lot of threads, GPU contains as many as 32K register files, therefore the access to them cannot be as straightforward as on CPU, instead it is via bank (similar to shared memory access). As a result the bank conflict is inevitable, and the performance can degrade a lot once it happens. The register files on Maxas have 4 bank of 32-bits, and each register can be mapped to its corresponding bank with <register id> mod 4
. Bank conflict can happen in the following instruction during the calculation of C matrix:
https://gist.github.com/3f32a98bbed00f6e89456c82e63196bb
Note that the register bank of each new generation of GPU varies, like in Volta architecture there are 2 banks of 64-bits, and that the main reason Maxas cannot perform as well on the current mainstream GPU as on Maxwell.
One of the advantage of coding in assembly language is to make it possible to allocate registers manually to minimize the bank conflict: • Register 0-63 for C matrix • Register 64-71 and 80-87 for A matrix, 72-79 and 88-95 for B matrix (twice of actually needed registers are allocated for prefetch in pipeline) Since the loading is done with vector instruction, the 4 registers allocated for A and B must be successive, therefore all the four banks will be accessed simultaneously and there is no space to optimize it. All Maxas can do is to shuffle the order of 63 registers allocated, the numbering is illustrated in the figure below: :
Obviously that's the best result of shuffling the register indices. The bank conflict marked with black box is inevitable no matter how the indices of C matrix are shuffled, as the source of bank conflict happens in A and B, which need to use the same bank. The operands in A and B not only needs to occupy all the 4 banks (two element on each bank), but also need to be paired with all the other operands in the other matrix, therefore each register of A must bank conflict with 2 registers in B. Actually if the most naïve numbering method is used on C, for example, the first row is numbered as 0~7, the each register will bank conflict with its corresponding B operand, which is horrible numbering method.
In order to further reduce the bank conflict which cannot be eliminated by register allocation, the operand reuse feature in Maxwell has to be used. When an instruction is issued, it is possible to set some of the operands as reuse
, and hardware will send this operand to a reuse cache. If the following instruction is going to use the same operand in the same slot, then the instruction can get it directly from the reuse cache instead of via the register bank, and therefore bank conflict can be avoided.
https://gist.github.com/006c2a214bb94f4dbf37284a905875c6
If the 64 registers of matrix C in Fig.5 are traversed line-by-line or column-by-column, and the registers of A are set as reuse, the 14 of 16 bank conflicts between registers of A and B can be eliminated. The only remaining bank conflicts are between register R3 and R35, as they are the operands of matrix A which are used first instruction, and therefore haven't been saved in reuse cache. Once the cause is understood these 2 bank conflicts can also be easily resolved, simply by traversing the even lines from right to left (from 26 to 3 for row 0) and traversing the odd rows from left to right (from 7 to 30 for row 1). Moreover Maxas is still not satisfied with this result. It proposed a more tricky traversal method which applies an additional spiral on top of the back-and-forth traversal: https://gist.github.com/5e71bb97aed78027f631167ffbb5f277
According to the Maxas document, each operand has 8 bytes of reuse cache for 2 4-bit register, and line-by-line traversal only uses 1 of them to cache data in registers for A, therefore the usage of reuse cache is still low. My guess is it is considered that some B operands also has reuse cache but not utilized, therefore the usage of reuse cache of line-by-line traversal is 4/8/2=25%. The estimate of usage for back-and-forth traversal is not that straightforward. Maxas document gives the result 39% directly, also the usage of spiral traversal to be 49%. From the assembly code generated by Maxas, it is confirmed that there are instructions where reuse cache are used for both A and B matrices, and meanwhile caches 2 registers for each operand: https://gist.github.com/724c06bee0483ea70aa0a4a9af5faca3
However, the purpose of increasing the usage of reuse cache is not to increase the reuse frequency, given that back-and-forth traversal can fully resolve all the bank conflicts. It is actually to avoid the latency incurred by loading instruction being filled with the computation instructions which depend on the data of the loading instructions. The first 8 computation instructions and the loading instructions inserted between them to traverse the C matrix registers are as below: https://gist.github.com/814b9bcf786f9657c301ae48ff6d9702
Since loading instructions take >20 clock cycles (the latency of shared memory) to complete, during that time the all their 1st operand may be accessed via bank, it is possible that the following computational instructions are issued and need to access the same bank. This is the cause of delayed bank conflict. However this is just a principle, I'm not clear about how the traversal order in Maxas can avoid the delayed bank conflict.
This section shows that even if all the computation is conducted in registers, 2 tricks are still necessary to achieve the optimal performance:
- Optimal register numbering
- Optimal traversal order For the computation itself, it has become to trivial to be mentioned in the document of Maxas
After each thread block has completed the computation of the sub-block of matrix assigned to it, the last job is to transfer it from register to main memory. Since registers cannot be shared among threads (there are a series of _shfl instructions for the purpose but they are not applicable here), therefore each thread has to firstly transfer the 4
According to the thread layout in Fig.2, the continuous 64-bits of data in a column distribute in 8 threads, for example, the result of the first 4 rows of the first column are saved in registers controlled by thread 0,2,4,6,8,10,12,14, in which each thread has 8 registers in that row. And in order to avoid bank conflict the 8 registers are not continous, and therefore vector transferring instruction cannot be used here. Then it takes 8 times to complete the transfering of a warp, meanwhile the rest 24 threads in the warp will be idle as their data are not in the same row. To solve this problem we can first use shared memory to stage the data of all the threads, and then transfer the data in shared memory to main memroy with another thread layout.
Firstly still need to write the data in register to shared memory. The 4
The implementation code is shown below. Although this method needs to transfer data between registers and shared memory again and again, the latency of shared memory can be hidden by swapping tasks between 2 warps. Anyway it is still faster than accessing main memory for many times. Note that the method is implemented step-by-step, but there is no synchronization instructions in the code, as all the data exchange is done in the 32 threads of the same warp directly, and there is no data exchange among warps. The execution model of GPU can guarantee that the same instruction in the threads of a warp can be completed at the same time. No synchronization instructions can be regarded as an advantage of the parallelization method in Fig.2. https://gist.github.com/42d23e56527c43b824a794cb10e6ce2c
There is another Figure in Maxas document illustrating this process. However the author may not fully understand it and feels it is kind of confusing, therefore it is not quoted here. The code itself should be self-explained enough. The job of the GEMM kernel generate by Maxas is done after the completion of transfering data to main memory
Based on the implementation of use 64 threads per thread block as described above, it is possible to scale it for 4 times to 256 threads and the work each thread to do remains the same. Therefore the block matrix each thread block calculates will be enlarged for 2 times along each dimension of the block, becoming
Although the pseudocode in the original document has been further commented as detailed as possible, it is still a relative high-level implementation. There is still an important topic at the level of GPU machine code, control code, which has not been covered. Since the purpose is just to introduce some ideas and implementation method in GPU optimization, the part of Maxas document related to control code is also not covered here. In summary, the overall ideas of optimization used by Maxas is clear, which have already been proposed by other literatures according to the Maxas document. The most difficult part in it is that the author of Maxas has to do a lot of tough reverse engineering to derive the hardware implementation details, which nVidia refuses to disclose, to achieve the peak performance of hardware. It is quite possible that the author of Maxas built a test platform to figure out the performance impact of subtle difference between some instructions. Anyway it is a great piece of work, and is worthwhile for all the engineers aiming at the extreme performance to deep dive.