Paper Presentation – EATSS

Author

Michael Maurer

Introduction

I reviewed and presented on a paper recently, titled Energy-Aware Tile Size Selection for Affine Programs on GPUs (EATSS), authored by Malith Jayaweera, Martin Kong, Yanzhi Wang, and David Kaeli. This paper developed a technique by which tile sizes are selected for affine programs, with power consumption and performance as optimization targets.

Relevance

Power consumption in recent years has become a critical consideration when designing not only hardware systems, but also algorithms. As hardware power usage has improved, large scale software has yet to do much to make similar improvements. This paper is one of the first to explore this field of optimizing GPU programs for both power and performance, which is a field ripe for future work.

Background

Affine Kernels

Affine kernels are chunks of code which contain nested loops with affine bounds. These types of loops are amenable to polyhedral optimization techniques (one of which is loop tiling).

Loop Tiling

Loop tiling is an optimization technique which aims to increase data locality, and take advantage of the system memory heirarchy, generally to improve system performance (i.e. execution time). This is a common optimization technique in affine programs.

GPUs

This technique is designed with GPUs in mind, and was examined on a set of NVIDIA GPUs. More specifically, these GPUs all share the same memory heirarchy, and enable something called “coalesced memory accesses” or “CMAs”. CMAs are a particular phenomenon, by which if there are many threads in a warp acessing contiguous elements, then fewer memory accesses are required, as nearby elements are pulled in with a single memory transaction as opposed to multiple. This is a core tenet of this algorithm.

The Algorithm

EATSS aims to find tile sizes which maximize the “performance per watt” (PPW) of a given workload. Typically, optimization strategies optimize for performance (i.e. runtime) or code size, and performance per watt is just another optimization target. In order to determine the performance per watt of a workload, one must understand the speed at which the workload can be completed, and also the amount of energy to do so. However, the authors do not do any power measurements in optimizing their program. Instead, they use a proxy, which is the following objective function:

\[OBJ = \prod_{\text{i is par}} T_{i} + \sum(H_{i} \times T_{i}) \]

The idea here is that this objective function, when maximized, results in the high inter-thread data sharing. This leads to more coalesced memory accesses. Why is this useful? Coalesced memory accesses result in fewer overall memory transactions, and less data movement. The authors claim, and demonstrate, that this results in less power usage. This is a neat idea, and makes a lot of sense to the hardware-aware reader.

In \(OBJ\), there are a handful of terms, which define the solution space. The authors use a SMT solver (Z3) to search this space for an optimal solution.

This compiler chooses tile sizes for affine programs, but has some additional nice features. One of which is it supports single precision and double precision data types. It also allows for varying splits between how much of combined memory is allocated to the L1-cache, and how much is allocated to shared memory.

Results

The authors test their optimization technique on two NVIDIA GPUs, a GA100, and a Xavier. The authors determine that their system performs very well on many benchmarks tested, though by varying degrees on each of the GPUs. Many of the comparisons the authors make are against “PPCG” (the polyhedral parallel code generator). Their optimizer operates on the code generated by PPCG by default, and looks to improve results. One comparison the authors conduct is their performance results against those generated by PPCG with a varying set of constraints. Namely, they run between 200-800 different configurations of PPCG for given kernels, and compare the results developed by EATSS against the entire space. The authors find that on some kernels, such as 2mm, 3mm, and gemm, the technique developed by the authors is better than any that PPCG finds in the search space. However, this is not true on all benchmarks, and the geometric mean PPW achieved by EATSS is lower than the average-best performance achieved by PPCG (which is to say, searching with PPCG) generally delivers better results.

The authors report additional results, such as that certain kernels (BLAS3-like kernels for example) perform better with more shared memory, but low dimensional kernels benefit from a higher proportion of allocation to the L1-cache. This shows how the ability to allocate cache and shared memory is a useful feature of the compiler. The authors find that their metric is a good approximation for power usage, as they find a high correlation between L2 cache reads and average power consumption (0.85 on 2mm, 0.75 on gemm). However, they also find that this correlation does not hold on kernels with O(1) memory reuse (it does hold if the memory reuse is ~O(n)).

Additionally, the authors compare their results against cuDNN and cuBLAS, two libraries with highly optimized affine kernels. They show that EATSS delivers 75% of cuBLAS and cuDNN’s PPW on the GA100, and performs significantly better than these libraries on Xavier (2.1x PPW).

And finally, the authors compare against an autotuner ytop. This autotuner evelops worse solutions in every case observed than EATSS, and takes 17 minutes to get a solution, whereas EATSS takes on the order 1 second to converge to a solution.

What Could be Improved?

There are some areas of improvement in this paper. One is that the authors only run this technique on a small set of GPUs, and they are both NVIDIA GPUs. Their results demonstrate that performance actually varies quite a bit between the two, so I would argue that in order to really understand the impact of this technique, a wider set of hardare is required. For example, the memory heirarchy on AMD GPUs is markedly different from that on NVIDIA, and it’s not clear whether the proxy for energy usage really holds across GPU types. If I were to give one piece of advice to these authors, it would be to run experiments on more GPUs. During discussion of this work with others, this point seemed to resonate quite strongly as something that could have been improved.

Another shortcoming is that some of their comparisons are not really apples-to-apples comparisons. cuBLAS and cuDNN both use tensor cores in their implementation, whereas EATSS does not (beacuse PPCG does not). If EATSS could be modified to do so, that could bolster the usefulness of their results significantly.

Two other changes I would have liked to see, but are less critical, are with respect to their power evaluation, and lack of analysis as to why EATSS does not find the best result.

The measurement of how much power the GPUs under test are using is based on averages of measurements collected from nvidia-smi and tegrastats as opposed to physical measurements. It has been shown in the past that nvidia-smi undersamples with respect to power consumption, and so the power metrics under test could be (though unlikely) missing useful information.

This paper also does not perform further analysis as to why EATSS does not find the optimal solution in all cases. In some contexts, this is clear, such as for programs with limited CMAs. However, it seems as though an analysis of what other factors are at play here would be useful for the reader, and for those developing future techniques in this space.

An interesting comment made by a colleague was that the runtime of their algorithm should blow up with greater loop nesting, and this is correct. The authors show that with 5 loop nests, convergence time is 2.2s, whereas it is only 1.4s with 3 loop nests. That said, this increase in runtime does not seem crazy to me, but it would have been nice for the authors to comment on it further.

What’s Good?

Now that I’ve complained sufficiently, I’ll say that I think this is quite a useful paper, and the technique propsed looks very strong. EATSS is able to find a solution very quickly (in 1.4s), which is really quite fast for what it’s doing, and it is a much faster optimization technique than running PPCG 200-800 times, or using ytop (which takes ~17s). It even seems as though this technique could be used in JIT compilers, which is fantastic.

The authors also run their optimizer on a very large set (21) of benchmarks, of varying types. This means that the results they present are likely quite reliable on the GA100 and Xavier GPUs.

The contribution of this paper is also quite novel, and the performance of the optimized code is qutie strong, even on par with the highly optimized libraries of cuBLAS and cuDNN (though, these libraries are not optimized for PPW, but just performance).

Conclusion

I think this paper is quite good, and I would say it’s worth reading for those working in the GPU or compilers space. The results are quite clearly strong, and the tool is sufficiently validated with respect to hardware. I would like to see additional work done to expand the space of hardware explored, and also a more rigorous analysis of their optimization function, but that seems like feasible future work. Overall, I am glad I read this paper and am aware of the work, and I would reccomend it to others.

Additionally, I also found that this paper motivated a useful and hearty discussion, and so I would reccomend it to professors interested in talking about GPU optimization techniques with their students.