Author Archives: Theo Mary

Mixed Precision LU Factorization on GPU Tensor Cores

Modern GPUs are equipped with mixed precision units called tensor cores that offer the capability of computing matrix–matrix products both at very high performance and with high accuracy. GPU tensor cores have been used to accelerate various numerical linear algebra algorithms. Among these, LU factorization is a natural candidate, since it can be written so as to heavily rely on matrix–matrix operations.

Haidar et al. were the first to propose an algorithm to compute the LU factorization of a general dense matrix exploiting GPU tensor cores, and obtained up to a 4x-5x speedup in the solution of various linear systems. Moreover, they also showed that the accuracy boost delivered by tensor cores could significantly improve the convergence of iterative methods preconditioned by the computed LU factors.

In a recent paper, Pierre Blanchard, Nick Higham, Florent Lopez, Srikara Pranesh, and I performed the rounding error analysis of an LU factorization algorithm exploiting what we called block FMAs—a generalization of tensor cores that also includes similar mixed precision units such as Google’s TPUs. We explained how block FMAs can be chained together to accumulate the result of intermediate computations in the higher precision (single precision for tensor cores), which allows for a significant accuracy boost.

However, these algorithms shared a common limitation: the matrix was stored in single, rather than half, precision. Indeed, if the matrix is stored in half precision, the accuracy boost is lost and the computed LU factors are essentially identical to those obtained by a standard LU factorization run entirely in half precision. This represents a lost opportunity, because storing the matrix in half precision not only halves the memory footprint, but also greatly reduces data movement, which brings a significant performance boost.

In a recent preprint, Florent Lopez and I propose a new LU factorization algorithm that is able to store the matrix in half precision without incurring such a significant accuracy loss. There are two main ingredients to this new algorithm. First, we switch from the standard right-looking scheme to a left-looking one, which, together with single precision buffers of controlled size, allows us to perform the update operations accurately. Second, we doubly partition the matrix so as to accurately perform the panel factorizations by exploiting tensor cores.

In our experiments, we factorize a range of test matrices using an NVIDIA V100 GPU. The two figures below summarize our main results by comparing three LU factorization algorithms: the standard algorithm (of Haidar et al. and Blanchard et al.) with the matrix stored in either single or half precision, and the new algorithm.

The figure on the left reports the backward error obtained by solving a linear system Ax=b for a dense n\times n matrix with random entries. The figure on the right displays the performance in GigaFLOPS. Unlike the standard algorithm, the new one does not incur a significant loss of accuracy by storing the matrix in half precision. In doing so, it greatly reduces data movement which leads to significant speedup: the new algorithm can be up to twice faster than the standard one.

Sharper Probabilistic Backward Error Bounds

Most backward error bounds for numerical linear algebra algorithms are of the form nu, for a machine precision u and a problem size n. The dependence on n of these bounds is known to be pessimistic: together with Nick Higham, our recent probabilistic analysis [SIAM J. Sci. Comput., 41 (2019), pp. A2815–A2835], which assumes rounding errors to be independent random variables of mean zero, proves that n can be replaced by a small multiple of \sqrt{n} with high probability. However, even these smaller bounds can still be pessimistic, as the figure below illustrates.

The figure plots the backward error for summation (in single precision) of n floating-point numbers randomly sampled from a uniform distribution. For numbers in the [0,1] distribution, the bound \sqrt{n}u is almost sharp and accurately predicts the error growth. However, for the [-1,1] distribution, the error is much smaller, seemingly not growing with n. This strong dependence of the backward error on the data cannot be explained by the existing bounds, which do not depend on the values of the data.

In our recent preprint, we perform a new probabilistic analysis that combines a probabilistic model of the rounding errors with a second probabilistic model of the data. Our analysis reveals a strong dependence of the backward error on the mean of the data \mu: indeed, our new backward error bounds are proportional to \mu\sqrt{n}u + u. Therefore, for data with small or zero mean, these new bounds are much sharper as they bound the backward error by a small multiple of the machine precision independent of the problem size n.

Motivated by this observation, we also propose new algorithms that transform the data to have zero mean, so as to benefit from these more favorable bounds. We implement this idea for matrix multiplication and show that our new algorithm can produce significantly more accurate results than standard matrix multiplication.

Solving Block Low-Rank Linear Systems by LU Factorization is Numerically Stable

A BLR matrix associated with the root separator of a 128^3 Poisson problem. The darker the blocks, the higher their numerical rank.

In many applications requiring the solution of a linear system Ax=b, the matrix A possesses a block low-rank (BLR) property: most of its off-diagonal blocks are of low numerical rank and can therefore be well approximated by low-rank matrices. This property arises for example in the solution of discretized partial differential equations, because the numerical rank of a given off-diagonal block is closely related to the distance between the two subdomains associated with this block. BLR matrices have been exploited to significantly reduce the cost of solving Ax=b, in particular in the context of sparse direct solvers such as MUMPS, PaStiX, and STRUMPACK.

However, the impact of these low-rank approximations on the numerical stability of the solution in floating-point arithmetic has not been previously analyzed. The difficulty of such an analysis lies in the fact that, unlike for classical algorithms without low-rank approximations, there are two kinds of errors to analyze: floating-point errors (which depend on the unit roundoff u), and low-rank truncation errors (which depend on the the low-rank threshold \varepsilon, the parameter controlling the accuracy of the blockwise low-rank approximations). Moreover, the two kinds of error cannot easily be isolated in the analysis, because the BLR compression and factorization stages are often interlaced.

In our recent preprint, with Nick Higham, we present rounding error analysis for the solution of a linear system by LU factorization of BLR matrices. Assuming that a stable pivoting scheme is used, we prove backward stability: the relative backward error is bounded by a modest constant times \varepsilon, and does not depend on the unit roundoff u as long as u is safely smaller than \varepsilon. This is a very desirable theoretical guarantee that can now be given to the users, who can therefore control the numerical behavior of BLR solvers simply by setting \varepsilon to the target accuracy. We illustrate this key result in the figure below, which shows a strong correlation between the backward error and the \varepsilon threshold for 26 matrices from the SuiteSparse collection coming from a wide range of real-life applications.

A Class of Fast and Accurate Summation Algorithms

Summing n numbers is a key computational task at the heart of many numerical algorithms. When performed in floating-point arithmetic, summation is subject to rounding errors: for a machine precision u, the error bound for the most basic summation algorithms, such as recursive summation, is proportional to nu.

Nowadays, with the growing interest in low floating-point precisions and ever increasing n in applications, such error bounds have become unacceptably large. While summation algorithms leading to smaller error bounds are known (compensated summation is an example), they are computationally expensive.

In our recent preprint, Pierre Blanchard, Nick Higham and I propose a class of fast and accurate summation algorithms called FABsum. We show that FABsum has an error bound of the form bu+O(u^2), where b is a block size, which is independent of n to first order. As illustrated by the figure below, which plots the measured error using single precision as a function of n, FABsum can deliver substantially more accurate results than recursive summation. Moreover, FABsum can be easily incorporated in high-performance numerical linear algebra kernels in order to boost accuracy with only a modest drop in performance, as we demonstrate in the paper with the PLASMA library.

SIAM CSE19 Minisymposium on “Advances in Analyzing Floating-point Errors in Computational Science”

by Pierre Blanchard, Nick Higham, and Theo Mary

Last February the SIAM Computational Science and Engineering (CSE19) conference took place in Spokane, WA, USA. We organized a two-part minisymposium on recent Advances in Analyzing Floating-point Errors in Computational Science (see links to part 1 and part 2). Below is the list of the eight talks along with the slides which we have made available.

For a Few Equations More

On the occasion of the Gilles Kahn prize award ceremony, I was asked to write an article about my PhD thesis for the popular science blog Binaire from the French newspaper Le Monde (the French version of the article is available here). You can read the English translation below.

Can you tell what the following problems have in common: predict tomorrow’s weather, build crash-resisting cars, scan the bottom of the oceans searching for oil? These are all difficult problems that are too costly to be tackled physically. Importantly, they can also be described by a fundamental tool of mathematics: linear equations. Therefore, the solution of these physical problems can be numerically simulated by solving instead systems of linear equations.

A system of linear equations (under matrix form).

You probably remember from math class in high school how tedious solving these systems could get, even when they had a small number of equations. In practice, it is actually quite common to face systems with thousands or even millions or equations. While computers can fortunately solve these systems for us, the computational cost of the solution can become very high for such large numbers of equations.

To respond to this need, great quantities of resources and money have been dedicated to the construction of supercomputers of great computational power, equipped with a large number of computing units called “cores”. For example, while your personal computer is likely to have less than a dozen cores, the most powerful supercomputer in the world has several millions of these cores. Nevertheless, the size of the problems that we must tackle today is so great that even these supercomputers are not sufficient.

The Inteprid supercomputer, equipped with 164000 cores © Argonne National Laboratory

To take up this challenge, I have worked during my PhD thesis on new algorithms to solve systems of linear equations whose computational cost is greatly reduced. More precisely, a crucial property of these algorithms is that their cost grows slowly with the number of equations: this is referred to as their complexity. Methods of very low complexity (so-called “hierarchical”) have been proposed since the 2000s. However, these hierarchical methods are quite complex and sophisticated, which makes them unable to attain high performance on supercomputers: that is, their high reduction of the theoretical complexity is translated into only very modest gains in terms of actual computing time.

For this reason, my PhD thesis focused on another method (so-called “Block Low Rank”), that is better suited than hierarchical methods for high performance computing. My first achievement was to compute the complexity of this method, which was previously unknown. I proved that, even if its complexity is slightly higher than than of hierarchical methods, it is still low enough to tackle systems of very large dimensions. In the second part of my thesis, I worked on efficiently implementing this method on supercomputers, so as to translate this theoretical reduction into actual time gains.

By significantly reducing the cost of solving systems of linear equations, this work allowed us to solve several physical problems that were previously too large to be tackled. For instance, it took less than an hour to solve a system of 130 million equations arising in a geophysical application, using a supercomputer equipped with 2400 cores.

A new preconditioner exploiting low-rank factorization error

The solution of a linear system Ax = b is a fundamental task in scientific computing. Two main classes of methods to solve such a system exist.

  • Direct methods compute a factorization of matrix A, such as LU factorization, to then directly obtain the solution x=U^{-1}L^{-1}b by triangular substitution; they are very reliable but also possess a high computational cost, which limits the size of problems that can be tackled.
  • Iterative methods compute a sequence of iterates x_k converging towards the solution x; they are inexpensive but their convergence and thus reliability strongly depends on the matrix properties, which limits the scope of problems that can be tackled.

A current major challenge in the field of numerical linear algebra is therefore to develop methods that are able to tackle a large scope of problems of large size.

To accelerate the convergence of iterative methods, one usually uses a preconditioner, that is, a matrix M ideally satisfying three conditions: (1) M is cheap to compute; (2) M can be easily inverted; (3) M^{-1} is a good approximation to A^{-1}. With such a matrix M, the preconditioned system M^{-1}Ax=M^{1}b is then cheap to solve with an iterative method and often requires a small number of iterations only. An example of a widely used class of preconditioners is when M is computed as a low-accuracy LU factorization.

Unfortunately, for many important problems it is quite difficult to find a preconditioner that is both of good quality and cheap to compute, especially when the matrix A is ill conditioned, that is, when the ratio between its largest and smallest singular values is large.

In our paper A New Preconditioner that Exploits Low-rank Approximations to Factorization Error, with Nick Higham, which recently appeared in SIAM Journal of Scientific Computing, we propose a novel class of general preconditioners that builds on an existing, low-accuracy preconditioner M=A-\Delta A.

This class of preconditioners is based on the following key observation: ill-conditioned matrices that arise in practice often have a small number of small singular values. The inverse of such a matrix has a small number of large singular values and so is numerically low rank. This observation suggests that the error matrix E = M^{-1}A - I = M^{-1}\Delta A \approx A^{-1}\Delta A is of interest, because we may expect E to retain the numerically low-rank property of A^{-1}.

In the paper, we first investigate theoretically and experimentally whether E is indeed numerically low rank; we then describe how to exploit this property to accelerate the convergence of iterative methods by building an improved preconditioner M(I+\widetilde{E}), where \widetilde{E} is a low-rank approximation to E. This new preconditioner is equal to A-M(E-\widetilde{E}) and is thus almost a perfect preconditioner if \widetilde{E}\approx E. Moreover, since \widetilde{E} is a low-rank matrix, (I+\widetilde{E})^{-1} can be cheaply computed via the Sherman–Morrison–Woodbury formula, and so the new preconditioner can be easily inverted.

We apply this new preconditioner to three different types of approximate LU factorizations: half-precision LU factorization, incomplete LU factorization (ILU), and block low-rank (BLR) LU factorization. In our experiments with GMRES-based iterative refinement, we show that the new preconditioner can achieve a significant reduction in the number of iterations required to solve a variety of real-life Ax=b problems.