Fast Solution of Linear Systems via GPU Tensor Cores’ FP16 Arithmetic and Iterative Refinement

Over the last 30 years, hierarchical computer memories, multicore processors and graphical processing units (GPUs) have all necessitated the redesign of numerical linear algebra algorithms, and in doing so have led to algorithmic innovations. Mixed precision arithmetic—a concept going back to the earliest computers, which had the ability to accumulate inner products in extra precision—attracted renewed interest in the late 1990s once Intel chips were able to execute single precision at twice the rate of double precision. Now the increasing availability of low precision arithmetic is offering new opportunities.

In the paper Harnessing GPU Tensor Cores for Fast FP16 Arithmetic to Speed up Mixed-Precision Iterative Refinement Solvers presented at SC18 (the leading supercomputing conference), Azzam Haidar, Stanimire Tomov, Jack Dongarra and Nick Higham show how to exploit the half precision (fp16) arithmetic that is now available in hardware. Whereas fp16 arithmetic can be expected to run at twice the rate of fp32 (single precision) arithmetic, the NVIDIA V100 GPU has tensor cores that can execute half precision at up to eight times the speed of single precision and can deliver the results to single precision accuracy. Developing algorithms that can exploit half precision arithmetic is important both for a workstation connected to a single V100 GPU and the world’s fastest computer (as of November 2018): Summit at Oak Ridge National Laboratory, which contains 27,648 V100 GPUs.

The paper shows that a dense n-by-n double precision linear system $Ax = b$ can be solved using mixed precision iterative refinement at a rate up to four times faster than a highly optimized double precision solver and with a reduction in energy consumption by a factor five.

The key idea is to LU factorize the matrix $A$ in a mix of half precision and single precision then apply iterative refinement. The update equations in the refinement process are solved by an inner GMRES iteration that uses the LU factors as preconditioners. This GMRES-IR algorithm was proposed by Carson and Higham in two (open access) papers in SIAM J. Sci. Comput. (2017 and 2018). In the form used here, the algorithm converges for matrices with condition numbers up to about $10^8$. It provides a backward stable, double precision solution while carrying out almost all the flops at lower precision.

Codes implementing this work will be released through the open-source MAGMA library.