# Ultimate Optimization of GEMM on Maxwell GPU: the dissect of Maxas assembler

When I was working as a deep learning software engineer in Intel on a AI chip project, I was aware of a assembler called Maxas ( https://github.com/NervanaSystems/maxas) which can generate the GPU machine code which outperforms the nVidia official GEMM library, and started to get interested in it.

The author of that project, Scott Gray, provided detailed documentations about it ( https://github.com/NervanaSystems/maxas/wiki/SGEMM), however due to the complexity of the algorithm, the document is still hard to follow. This article can be regarded as the document of the original document, according to the author’s own understanding, with as many details as possible covered, and intention of all the code explained. The main structure still follows the original document, so does all the figures.

Note that the Maxas algorithm depends heavily on the features of Maxwell architect, the project is out-of-date with the evolution of new GPU architect. However the ideas behind it are still insightful and worth being documented.

# Background

Single-precision general matrix multiply (SGEMM) is the most familiar case for any programmers who have learned CUDA. This is the only example in the official tutorial since nVidia release the first version of CUDA. It is not just because SGEMM is the most critical building block of almost and computational intensive software, but also because it is a good example to show the optimization tricks on GPU, especially those exploiting of memory hierarchy.

For the multiplication of two matrices A and B, the easiest way to parallelize is to launch as many threads as the number of elements of in the output matrix C. Each thread loads a line of A and a column of B and calculates the inner product of the two vectors. The problem is the latency accessing the main GPU memory is huge (~100 cycles). For a row in A it may be still possible to take advantage of the huge band width of GPU memory or caching to load many successive elements of A to amortize the latency. While for big matrix (N>1000) B, there can be thousands of elements between the successive element in its column vector, which means the results of each loading are all useless except the exact column element, and there will be not cache hit. In a word the efficiency of such memory access is horrible.

The optimization of the naïve method above needs shared memory, which can be regarded as on-chip cache of GPU. The latency of shared memory is as fast as first-level cache, and can be shared among all the threads in a thread block. The only shortcoming is that its size is limited. To take advantage of this small piece of high speed memory, there is some change to the native parallelization: The matrix needs to be divided into $k$ blocks on each dimension，therefore the output matrix C can be written as:

The same division is applied for both A and B, in which $A_{ij}, B_{ij}, C_ij$ is no longer element but block. Of course when the size of block is 1 the block is reduced to element. Obviously the definition of matrix multiplication is still valid here: $C_{ij} = \sum_{k} A_{ik}B_{kj}$

If each block is regarded as an element, the scale of the matrix is reduced by $K$ times. For each block a threads block assigned for its calculation, in which one thread for one element. The thread block loads the row blocks of A and column blocks of B to shared memory one by one, do GEMM for each pair of row block and column block, and accumulate the result. Therefore each element in this block loaded to the shared memory can be accessed K time, which is much more efficient than the native method.

The algorithm is actually pretty fast, and more importantly, further optimization will be based on it. In order to follow the rest of this paper, the reader is encouraged to study the CUDA official tutorial to get familiar with the blocking method, and understand the CUDA programming model as deep as possible.

# Principle

As described in the section above, after shared memory is used in blocking method, not only the speed of matrix multiplication (MM) of blocks matrices can be greatly improved, it is also possible to transfer the next blocks for calculation from main memory to shared memory while doing MM of blocks to. In another word, the time for IO can be fully covered by the MM of blocks. In order to further improve the performance, something has to be done on the multiplication of blocks.

Although it has already been fast to do MM in shared memory, there is still quite a lot of things to do to reach the hardware limitation. There are 2 main bottlenecks. Firstly the latency of shared memory is still higher than register. On Maxwell/Pascal GPU the latency of registers is 6 cycles, while that of shared memory is 23 cycles. In addition, the computation unit on GPU cannot work on data in shared memory directly, and therefore a transferring instruction is needed to move the data to register. The mov instruction can take as much time as the real calculation instructions, so the overhead is pretty huge. To reach the peak performance of hardware, it is required that all the computation units of GPU have their pipelines fully occupied by calculation instructions, and for each clock cycle there will be numerical result coming out of the pipeline. In order to achieve that, Maxas and some previous research it quoted propose the following method:

1. Taking advantage of the newly added vector instructions, which can transfer 4 successive float numbers between shared memory and registers in one instructions, therefore greatly reduce to number of transferring instructions, and easier to enable the calculation instructions to hide the transferring time.

2. Interleaving the calculation instructions and transferring instructions to implement the pipeline of data prefetching and calculation

3. Blocking algorithm make use of the fast shared memory to cache the data which needs to be accessed multiple times. If the idea is developed and do another level of blocking, each block matrix is further blocked into smaller sub-blocks, and cache the data in shared memory with registers, some additional speedup can be expected. Note that the lower level blocking method is different from the higher level blocking, and there is some addition difficulty to resolve for it.

The precise control for GPU instructions and registers to implement the ideas above has been beyond the expression capacity of CUDA, therefore these ideas have to be implemented with native assemble language of GPU (not even the pseudo assemble language like PTX). However, it is still possible to use some C-like pseudo-code which is more expressive to describe the implementation. There is straightforward conversion form pseudo code to assemble code, which is implemented will a Perl script in Maxas

# Outline of maxas algorithms

Here consider the multiplication of 2 $64\times64$ matrices all on registers. In the previous straightforward method, each element of C matrix is calculated according to the definition of MM $C_{ij} = \sum_{k} A_{ik}B_{kj}$, which is the inner product of a row in A and a column in B. Therefore a row in A and a column in B will bu used 64 times. In order to make full use of the register, the 3 $64\times64$ matrix, each occupying 16KB, is very big pressure for register file, if they are all stored in registers. Another problem of doing on-register MM is the registers are not shared among threads. A thread not only need to allocate registers not only for the elements it is responsible for in output C, but all rows in A and columns in B needed. As a result, many registers for different threads are used to store duplicate data, which is totally unacceptable.

However, instead from the perspective of output matrix C, considering the problem from the perspective of input matrices. It can be figured out that the k-th column (not row!) of A is only used to do multiplication with the k-th row (not column!) of B. In another word, taking the k-th column of A and the k-th row of B, we can multiply all the element pairs and add the result to the output matrix element it contributes:

The row and column then can be discarded as they have completed their task in MM an will no longer be needed. The method can greatly reduce to occupation of input matrices on registers. Moreover, the loading of $2N$ elements ($N$ for A and $N$ for B) is accompanyed by $N²$ add-multiply operation, which is exactly the benefit of using registers to cache data in shared memory. The method can be easily extented to A and B whose have different number of rows and columns, if only the column index of A and the row index of B are the same.

Maxas uses 64 threads to implement the parallelization of block matrix multiplication, each thread taking care of the multiplication of 4 ($2\times2$) $4\times4$ matrices. The layout of the 64 threads are $8\times8$, therefore the block matrix whose length is $N=2\times4\times8=64$ is determined. This is the main difference from original blocking algorithm, where each thread calculates each element of the matrix, and is critical to make full use of the super low latenty of registers.

The vector on the left is a column of A, and the vector on the top is the corresponding row in B matrix. The green data (8 float number for each vector) is what thread 0 needs. It is easy to derived the data needed by the other threads)

The implementation of Maxas is just to serve the algorithm decribed in this section, the problems solved in the following sections are also what will be encountered during the implementation. The choice of the parameters, i.e. why 64 threads, are based on the hardware resource of GPU, to create as many thread wars as possible while each thread can still get enough registers requried. The purpose is to enable scheduler to launch some warp to do calculation while waiting for other warps to fetch the data.

# Load input matrix to shared memory

The first thing for the 64 threads, described in the section above, is to load the data they need of the two input matrices from main memory. It is worthwhile pointing out an implicit assumption not mentioned in the original document, which is matrices are stored in row-major format in Maxas, namely the columns in matrix are continuous. Otherwise it is not possible the explain the algorithms in the following section. Compared with the loading method in CUDA official tutorial, the loading method here also need some change due to the change in calculation algorithm, together with some optimization:

1. Since row data is used for B matrix, which is not successive when the format of B is column-major, B matrix needs to be transposed to make column to row. Therefore A and B can be loaded with the same method to simplify the code
2. A and B matrices are loaded as 1D texture, so that not only the texture data can be cached with texture cache, but also can benefit from the feature of texture loading that all the data out of boundary are set to 0, so that the such data will have no effect on the MM result.

The knowledge of the preprocessing about will help understand the psedu-code in the following section. The creation of texture and transpose should have been done before the execution of GEMM kernel on GPU and will not affect the performance of the kernel.

The data in texture memory are also loaded into shared memory segment by segment. According to the original method, each segment should be a tile of $64\times64$ elements. However in order to make full use of register resources, Maxas use some totally different calculation method. For a thread block responsible for $C_{ij}$, matrix A is first divided into $64 \times N$ stripes, each contains 64 rows, and all the data needed are located on $i$th stripe. However the amount of data on the stripe is still large and needs to be loaded by smaller segment. In Maxas each loading consumes $64\times8$ and, there are $\frac{N}{8}$ loadings to finish the work. The method for matrix B is similar, except that the stripes are divided into $N\times 64$ stripes by columns. After transposing the loading method is exactly the same as A. The memory layout is illustrated in the following figure.

In the figure each lattice is the data element for each thread which is responsible for loading it, in which green lattice is the 4 elements that thread 0 needs to load one by one, and yellow lattice are the data to be loaded by the rest 31 threads in the first iteration. The whole warp loads 2 rows each time, and finish after repeating for 4 times. The execution unit on GPU is called warp consisting of 32 threads, therefore the 64 threads are executed by 2 warps. One of the warp (thread 0–31) loads A while the other (thread 32–63) loads B. There is a confusing place in the figure, the dimensions of the matrix in the figure is $8\times16$ instead of $8\times64$. This is because each thread behind will use vector instruction to load 4 float number at the same time, therefore each lattice itself indicates 4 float numbers. In the following code it can be seen that when using vector instruction on texture memory, the offset is the actual number of elements divided by 4.

The loading method above is not unique for sure, and my understanding is since the loading methods for A and B are the same, except applying for different textures. Therefore compared with loading A and B with a single thread, the number of loading instructions, which is unrelated to computation, can be reduced. The data loaded is staged in registers, waiting to be stored in shared memory. The data layout in shared memory is shown in the following figure:

Since the unit of offset in shared memory is 1 byte, we can get back to the straightforward representation. Therefore it can be seen that the data storage patterns in the 2 figures above are exactly the same, both are $8\times64$ column major storage. The only difference is in shared memory the data of B follow right after A.

There are still two problems in the code above worth clarification:

1. track0 = blk*64/4 + tid15 + (ldx * tid2); where the first term blk*64/4 is used to select the 1D offset of the top-left corner of blk-th stripe with respect to the whole matrix of input matrix A or transposed matrix B in texture. Since the top-left corners of all stripes are located in the first column of input matrix, and in row-major storage the offset of any element in the first column is just the row number, therefore for the 'blk'-th stripe its top-left corner is blk*64, and /4 comes from the factor of vector instruction. The meaning of tid15 + (ldx * tid2) is more straightforward, which is the relative position of corresponding yellow lattice with respect to green lattice of the current thread in Fig.3. tid15 can be regarded as the column coordinator, and tid2 the row coordinator, which should be multiplied by the stride idx at this dimension in 1D representation.
2. 4 track variables has been used to record the offset of 4 loading instructions. The reader may wonder why not just just 1 track variable, and add the offset by 2 rows after each loading to do the same thing:

The problem is after tex.1d.v4.f32.s32 instruction is issued (and before completion) the its track operand is not going to be saved. In order to ensure that it is not changed by the following increment instruction, some control code must be inserted to wait for the previous loading instruction to complete. In the worst case that instruction may take hundreds of clock cycles. Using 4 offset variables can avoid waiting for the completion of data transfer and help GPU issue all the 4 loading instruction one after another, and therefore acts like a pipeline. The cost is for each thread an addition of 3 registers need to be occupied, and fortunately there are still enough of them on GPU.

# Load data in shared memory to registers

After the job above is done, there are 8 rows of data for both A and B, and each row contains 64 floats. Getting a row from each matrix, then we can do the fused-multiply-add operations with the elements in the 2 rows. Once completed getting another row until the 8 rows in shared memory are consumed, when the other warp should have completed the transferring from texture memory to the other group in shared memory, and can be switched to do computation on the data. As shown in Fig.2, each thread actually just needs 8 of the 64 floats in total, and their offsets in vector A and B can be calculated according the figure. The detailed process in the code is done by a series of bit manipulations, which can be explained here in advance:

Thread 2*i and 2*i+1 in the figure use the same segment of A, which can be written as (i/2) % 8. The series of bit manipulations is just the implementation of this expression, in which <<4 implements the interval of 16 bytes, which is the size of 4 floats loaded by vector loading, between the segment of column vector of A.

The selection in the row vector of B is more complicated. Firstly note that for threads with even index, every 16 threads the distance long the direction of B the distance is 2 segments (8 floats), therefore for the 64 threads which can be represented with 6 bits, tid & 0x30 indicates the last 4 bits representing tid mod 16 can be masked, and only the first 2 bits is meaningful for the selection of B. The following >>3 actually draws the first 2 bits to the lowest digits with >>3 firstly, then represent the distance of 2 segments with *2.| (tid & 1) is equivalent to +(tid & 1), indicating thread 2*i+1 always selects the segment of data after thread 2*i. That segment of data also fills the gap between the distance of 2 segments. It also may have been noted that the layout of the threads in Fig.2 is really awkward, which is not sorted by thread index in either row or column direction. The reason to do that is to avoid bank conflict during access to shared memory. The definition of bank conflict and the condition for it to happen has been described in details in the CUDA official document. Briefly, the access to shared memory is divided into a couple of banks according to the address (the easiest method is by doing modulus), and if 2 threads need to access the shared memory whose addresses are in the same bank, then it is not possible to complete the access simultaneously, and they have to be access in series, therefore the access time is multiplied by the number of threads which need to access the same bank. However, that is just the most general case, and there are some optimization mechanisms provided by GPU, for example, broadcast, to reduce the negative effect of bank conflict. The other awkward thing is each thread does the calculation of 4 $4\times4$ blocks instead of a single $8\times8$ directly. This is actually another trick to avoid bank conflict, the calculation process of 4 $4\times4$ blocks for each thread is equivalent to a single $8\times8$.

There is another trick used in the implementation. Although each thread requires only 16 input numbers, the number of registers allocated is actually twice of that number. The purpose of doing that is the same as before, to use two groups of registers to implement a pipeline, in which each thread can prefetch the next line of data while calculating the current line.

# Calculation of C matrix: register allocation and order of calculation

Now all the data needed for calculation have been sent to the registers with high-efficiency method, and looks like it is the time to use FFMA instruction to manipulate them directly, which is what the GEMM kernel should do. Unfortunately before that there is one difficulty, which maybe the biggest trouble in this whole project, the bank conflict of register access.

In order to fill a stream process unit with a whole lot of threads, GPU contains as many as 32K register files, therefore the access to them cannot be as straightforward as on CPU, instead it is via bank (similar to shared memory access). As a result the bank conflict is inevitable, and the performance can degrade a lot once it happens. The register files on Maxas have 4 bank of 32-bits, and each register can be mapped to its corresponding bank with <register id> mod 4. Bank conflict can happen in the following instruction during the calculation of C matrix:

Note that the register bank of each new generation of GPU varies, like in Volta architecture there are 2 banks of 64-bits, and that the main reason Maxas cannot perform as well on the current mainstream GPU as on Maxwell.

One of the advantage of coding in assembly language is to make it possible to allocate registers manually to minimize the bank conflict: * Register 0–63 for C matrix * Register 64–71 and 80–87 for A matrix, 72–79 and 88–95 for B matrix (twice of actually needed registers are allocated for prefetch in pipeline) Since the loading is done with vector instruction, the 4 registers allocated for A and B must be successive, therefore all the four banks will be accessed simultaneously and there is no space to optimize it. All Maxas can do is to shuffle the order of 63 registers allocated, the numbering is illustrated in the figure below: ：

Obviously that’s the best result of shuffling the register indices. The bank conflict marked with black box is inevitable no matter how the indices of C matrix are shuffled, as the source of bank conflict happens in A and B, which need to use the same bank. The operands in A and B not only needs to occupy all the 4 banks (two element on each bank), but also need to be paired with all the other operands in the other matrix, therefore each register of A must bank conflict with 2 registers in B. Actually if the most naïve numbering method is used on C, for example, the first row is numbered as 0~7, the each register will bank conflict with its corresponding B operand, which is horrible numbering method.

In order to further reduce the bank conflict which cannot be eliminated by register allocation, the operand reuse feature in Maxwell has to be used. When an instruction is issued, it is possible to set some of the operands as reuse, and hardware will send this operand to a reuse cache. If the following instruction is going to use the same operand in the same slot, then the instruction can get it directly from the reuse cache instead of via the register bank, and therefore bank conflict can be avoided.

If the 64 registers of matrix C in Fig.5 are traversed line-by-line or column-by-column, and the registers of A are set as reuse, the 14 of 16 bank conflicts between registers of A and B can be eliminated. The only remaining bank conflicts are between register R3 and R35, as they are the operands of matrix A which are used first instruction, and therefore haven’t been saved in reuse cache. Once the cause is understood these 2 bank conflicts can also be easily resolved, simply by traversing the even lines from right to left (from 26 to 3 for row 0) and traversing the odd rows from left to right (from 7 to 30 for row 1). Moreover Maxas is still not satisfied with this result. It proposed a more tricky traversal method which applies an additional spiral on top of the back-and-forth traversal:

According to the Maxas document, each operand has 8 bytes of reuse cache for 2 4-bit register, and line-by-line traversal only uses 1 of them to cache data in registers for A, therefore the usage of reuse cache is still low. My guess is it is considered that some B operands also has reuse cache but not utilized, therefore the usage of reuse cache of line-by-line traversal is 4/8/2=25%. The estimate of usage for back-and-forth traversal is not that straightforward. Maxas document gives the result 39% directly, also the usage of spiral traversal to be 49%. From the assembly code generated by Maxas, it is confirmed that there are instructions where reuse cache are used for both A and B matrices, and meanwhile caches 2 registers for each operand:

However, the purpose of increasing the usage of reuse cache is not to increase the reuse frequency, given that back-and-forth traversal can fully resolve all the bank conflicts. It is actually to avoid the latency incurred by loading instruction being filled with the computation instructions which depend on the data of the loading instructions. The first 8 computation instructions and the loading instructions inserted between them to traverse the C matrix registers are as below:

Since loading instructions take >20 clock cycles (the latency of shared memory) to complete, during that time the all their 1st operand may be accessed via bank, it is possible that the following computational instructions are issued and need to access the same bank. This is the cause of delayed bank conflict. However this is just a principle, I’m not clear about how the traversal order in Maxas can avoid the delayed bank conflict.

This section shows that even if all the computation is conducted in registers, 2 tricks are still necessary to achieve the optimal performance:

1. Optimal register numbering
2. Optimal traversal order

For the computation itself, it has become to trivial to be mentioned in the document of Maxas

# Transfer C matrix to main memory

After each thread block has completed the computation of the sub-block of matrix assigned to it, the last job is to transfer it from register to main memory. Since registers cannot be shared among threads (there are a series of _shfl_sync() instructions for the purpose but they are not applicable here), therefore each thread has to firstly transfer the 4 $4\times4$ matrices it calculated to shared memory. The reason not to transfer them directly from register to main memory is the current thread layout cannot make use of the huge bandwidth of GPU. In order to make full use of it we wish the data transferred by the 32 threads in each warp are continuous, so that in one clock cycle 128 bytes of data can be transferred at the same time. If the data are separated, they need to be transferred for 32 times in the worst case.

According to the thread layout in Fig.2, the continuous 64-bits of data in a column distribute in 8 threads, for example, the result of the first 4 rows of the first column are saved in registers controlled by thread 0,2,4,6,8,10,12,14, in which each thread has 8 registers in that row. And in order to avoid bank conflict the 8 registers are not continuous, and therefore vector transferring instruction cannot be used here. Then it takes 8 times to complete the transferring of a warp, meanwhile the rest 24 threads in the warp will be idle as their data are not in the same row. To solve this problem we can first use shared memory to stage the data of all the threads, and then transfer the data in shared memory to main memory with another thread layout.

Firstly still need to write the data in register to shared memory. The 4 $4\times4$ matrices saved in the registers of each thread are divided into two pairs aligned on column. According the the register allocation illustrated by Fig.5, there are 8 registers on each row. For example, register 3,7,1,5 are on the first row, belonging to a column of a $4\times4$ matrix, and register 35,39,33,37 for a column in another $4\times4$ matrix. As the result matrix C is also column major, if copy register 3,7,1,5 to 4 continuous registers (named as cs<0–3> in Maxas), and register 35,39,33,37 to cs<4–7>, then it is possible to use only 2 vector saving instruction to copy the 8 numbers to the corresponding position in shared memory. The left figure in Fig.6, which can be regarded as extracting a slice in every 4 columns from the the $64\times64$ in Fig.2 and concatenate them together, illustrates this process. After completion there will be a $64\times8$ matrix in shared memory, in which each column is continuous and corresponds to a row in matrix C. Then change the order of threads, and let one thread in the warp to transfer a number on this column to finish the 32 continuous float numbers simultaneously. The buffer in shared memory can take the space allocated for loading data in A and B, which are no longer useful after C has been calculated.

The implementation code is shown below. Although this method needs to transfer data between registers and shared memory again and again, the latency of shared memory can be hidden by swapping tasks between 2 warps. Anyway it is still faster than accessing main memory for many times. Note that the method is implemented step-by-step, but there is no synchronization instructions in the code, as all the data exchange is done in the 32 threads of the same warp directly, and there is no data exchange among warps. The execution model of GPU can guarantee that the same instruction in the threads of a warp can be completed at the same time. No synchronization instructions can be regarded as an advantage of the parallelization method in Fig.2.

There is another Figure in Maxas document illustrating this process. However the author may not fully understand it and feels it is kind of confusing, therefore it is not quoted here. The code itself should be self-explained enough. The job of the GEMM kernel generate by Maxas is done after the completion of transferring data to main memory.

Based on the implementation of use 64 threads per thread block as described above, it is possible to scale it for 4 times to 256 threads and the work each thread to do remains the same. Therefore the block matrix each thread block calculates will be enlarged for 2 times along each dimension of the block, becoming $128\times128$. The loading of input matrix and the output of the result will have some corresponding change, which should be straightforward after understanding the 64-thread implementation and omitted here. For large matrix the 256-thread implementation has some advantage on performance. See the detailed performance test result in Maxas document.

# Conclusion

Although the pseudo-code in the original document has been further commented as detailed as possible, it is still a relative high-level implementation. There is still an important topic at the level of GPU machine code, control code, which has not been covered. Since the purpose is just to introduce some ideas and implementation method in GPU optimization, the part of Maxas document related to control code is also not covered here. In summary, the overall ideas of optimization used by Maxas is clear, which have already been proposed by other literature according to the Maxas document. The most difficult part in it is that the author of Maxas has to do a lot of tough reverse engineering to derive the hardware implementation details, which nVidia refuses to disclose, to achieve the peak performance of hardware. It is quite possible that the author of Maxas built a test platform to figure out the performance impact of subtle difference between some instructions. Anyway it is a great piece of work, and is worthwhile for all the engineers aiming at the extreme performance to deep dive.

--

--

## More from Wang Xiaoyu

Computer Vision Engineer