## Mixed Precision LU Factorization on GPU Tensor Cores

Modern GPUs are equipped with mixed precision units called tensor cores that offer the capability of computing matrix–matrix products both at very high performance and with high accuracy. GPU tensor cores have been used to accelerate various numerical linear algebra algorithms. Among these, LU factorization is a natural candidate, since it can be written so as to heavily rely on matrix–matrix operations.

Haidar et al. were the first to propose an algorithm to compute the LU factorization of a general dense matrix exploiting GPU tensor cores, and obtained up to a 4x-5x speedup in the solution of various linear systems. Moreover, they also showed that the accuracy boost delivered by tensor cores could significantly improve the convergence of iterative methods preconditioned by the computed LU factors.

In a recent paper, Pierre Blanchard, Nick Higham, Florent Lopez, Srikara Pranesh, and I performed the rounding error analysis of an LU factorization algorithm exploiting what we called block FMAs—a generalization of tensor cores that also includes similar mixed precision units such as Google’s TPUs. We explained how block FMAs can be chained together to accumulate the result of intermediate computations in the higher precision (single precision for tensor cores), which allows for a significant accuracy boost.

However, these algorithms shared a common limitation: the matrix was stored in single, rather than half, precision. Indeed, if the matrix is stored in half precision, the accuracy boost is lost and the computed LU factors are essentially identical to those obtained by a standard LU factorization run entirely in half precision. This represents a lost opportunity, because storing the matrix in half precision not only halves the memory footprint, but also greatly reduces data movement, which brings a significant performance boost.

In a recent preprint, Florent Lopez and I propose a new LU factorization algorithm that is able to store the matrix in half precision without incurring such a significant accuracy loss. There are two main ingredients to this new algorithm. First, we switch from the standard right-looking scheme to a left-looking one, which, together with single precision buffers of controlled size, allows us to perform the update operations accurately. Second, we doubly partition the matrix so as to accurately perform the panel factorizations by exploiting tensor cores.

In our experiments, we factorize a range of test matrices using an NVIDIA V100 GPU. The two figures below summarize our main results by comparing three LU factorization algorithms: the standard algorithm (of Haidar et al. and Blanchard et al.) with the matrix stored in either single or half precision, and the new algorithm.

The figure on the left reports the backward error obtained by solving a linear system $Ax=b$ for a dense $n\times n$ matrix with random entries. The figure on the right displays the performance in GigaFLOPS. Unlike the standard algorithm, the new one does not incur a significant loss of accuracy by storing the matrix in half precision. In doing so, it greatly reduces data movement which leads to significant speedup: the new algorithm can be up to twice faster than the standard one.