Traditionally, the fastest supercomputers in the world are ranked according to their performance on HPL, a portable implementation of the High-Performance Computing Linpack Benchmark for distributed memory systems. This software package measures the rate of execution, expressed in binary64 floating-point operations per second, at which a computer solves a large dense linear system using LU factorization with row partial pivoting. This metric is an excellent predictor of the performance a machine will achieve on compute-bound tasks executed in binary64 arithmetic, but the test problem is not representative of typical artificial intelligence computations, where lower precision is typically considered satisfactory.
In an endeavour to consider both workloads at once, the Innovative Computing Laboratory (ICL) at the University of Tennesse, Knoxville, developed the HPL-AI Mixed-Precision Benchmark. The reference implementation of the benchmark solves a dense linear system of order to binary64 accuracy by complementing a low-precision direct solver with a high-precision refinement strategy. The code generates and in binary64, computes the LU factorization without pivoting of using only binary32 arithmetic, finds an approximate binary32 solution using forward and back substitution, and finally solves the linear system with GMRES in binary64 arithmetic, using as starting vector and the low-precision LU factors of as preconditioners.
In order to obtain the best possible rate of execution, the coefficient matrix must be dense and, as the benchmark is expected to be needed for , it should also be easy to generate at scale on a distributed memory environment. Furthermore, we require that have growth factor of order 1, so as to guarantee the stability of Gaussian elimination without pivoting, and be not too ill-conditioned, in order to ensure the convergence of GMRES. Currently, the reference implementation relies on a randomly generated matrix that is diagonally dominant by rows. This choice ensures the stability of LU factorization without pivoting, but the matrix thus constructed requires an amount of communication that depends on , and turns out to be extremely well conditioned, with an -norm condition number just above 4 for large .
In our EPrint Matrices with Tunable Infinity-Norm Condition Number and No Need for Pivoting in LU Factorization, Nick Higham and I present a new parametric class of dense square matrices for which LU factorization without pivoting is stable. The parameters and that yield a specified -norm condition number can be computed efficiently, and once they have been chosen the matrix can be formed from an explicit formula in floating-point operations.
We also discuss several adaptations aimed at further improving the suitability of the generated matrix as a test problem for the HPL-AI benchmark, as our main goal is to provide a robust and efficient matrix generator for it. In particular, we explain how scaling the matrix can help transition from binary32 to binary16, in order to take advantage of the hardware accelerators for low precision such as the tensor cores that equip recent NVIDIA GPUs, and we discuss how the matrices can be tweaked to make the entries in the LU factors less predictable and slow down the convergence of GMRES.