Summing numbers is a key computational task at the heart of many numerical algorithms. When performed in floating-point arithmetic, summation is subject to rounding errors: for a machine precision , the error bound for the most basic summation algorithms, such as recursive summation, is proportional to .
Nowadays, with the growing interest in low floating-point precisions and ever increasing in applications, such error bounds have become unacceptably large. While summation algorithms leading to smaller error bounds are known (compensated summation is an example), they are computationally expensive.
In our recent preprint, Pierre Blanchard, Nick Higham and I propose a class of fast and accurate summation algorithms called FABsum. We show that FABsum has an error bound of the form , where is a block size, which is independent of to first order. As illustrated by the figure below, which plots the measured error using single precision as a function of , FABsum can deliver substantially more accurate results than recursive summation. Moreover, FABsum can be easily incorporated in high-performance numerical linear algebra kernels in order to boost accuracy with only a modest drop in performance, as we demonstrate in the paper with the PLASMA library.
Last February the SIAM Computational Science and Engineering (CSE19) conference took place in Spokane, WA, USA. We organized a two-part minisymposium on recent Advances in Analyzing Floating-point Errors in Computational Science (see links to part 1 and part 2). Below is the list of the eight talks along with the slides which we have made available.
On the occasion of the Gilles Kahn prize award ceremony, I was asked to write an article about my PhD thesis for the popular science blog Binaire from the French newspaper Le Monde (the French version of the article is available here). You can read the English translation below.
Can you tell what the following problems have in common: predict tomorrow’s weather, build crash-resisting cars, scan the bottom of the oceans searching for oil? These are all difficult problems that are too costly to be tackled physically. Importantly, they can also be described by a fundamental tool of mathematics: linear equations. Therefore, the solution of these physical problems can be numerically simulated by solving instead systems of linear equations.
You probably remember from math class in high school how tedious solving these systems could get, even when they had a small number of equations. In practice, it is actually quite common to face systems with thousands or even millions or equations. While computers can fortunately solve these systems for us, the computational cost of the solution can become very high for such large numbers of equations.
To respond to this need, great quantities of resources and money have been dedicated to the construction of supercomputers of great computational power, equipped with a large number of computing units called “cores”. For example, while your personal computer is likely to have less than a dozen cores, the most powerful supercomputer in the world has several millions of these cores. Nevertheless, the size of the problems that we must tackle today is so great that even these supercomputers are not sufficient.
To take up this challenge, I have worked during my PhD thesis on new algorithms to solve systems of linear equations whose computational cost is greatly reduced. More precisely, a crucial property of these algorithms is that their cost grows slowly with the number of equations: this is referred to as their complexity. Methods of very low complexity (so-called “hierarchical”) have been proposed since the 2000s. However, these hierarchical methods are quite complex and sophisticated, which makes them unable to attain high performance on supercomputers: that is, their high reduction of the theoretical complexity is translated into only very modest gains in terms of actual computing time.
For this reason, my PhD thesis focused on another method (so-called “Block Low Rank”), that is better suited than hierarchical methods for high performance computing. My first achievement was to compute the complexity of this method, which was previously unknown. I proved that, even if its complexity is slightly higher than than of hierarchical methods, it is still low enough to tackle systems of very large dimensions. In the second part of my thesis, I worked on efficiently implementing this method on supercomputers, so as to translate this theoretical reduction into actual time gains.
By significantly reducing the cost of solving systems of linear equations, this work allowed us to solve several physical problems that were previously too large to be tackled. For instance, it took less than an hour to solve a system of 130 million equations arising in a geophysical application, using a supercomputer equipped with 2400 cores.
The solution of a linear system is a fundamental task in scientific computing. Two main classes of methods to solve such a system exist.
Direct methods compute a factorization of matrix , such as LU factorization, to then directly obtain the solution by triangular substitution; they are very reliable but also possess a high computational cost, which limits the size of problems that can be tackled.
Iterative methods compute a sequence of iterates converging towards the solution ; they are inexpensive but their convergence and thus reliability strongly depends on the matrix properties, which limits the scope of problems that can be tackled.
A current major challenge in the field of numerical linear algebra is therefore to develop methods that are able to tackle a large scope of problems of large size.
To accelerate the convergence of iterative methods, one usually uses a preconditioner, that is, a matrix ideally satisfying three conditions: (1) is cheap to compute; (2) can be easily inverted; (3) is a good approximation to . With such a matrix , the preconditioned system is then cheap to solve with an iterative method and often requires a small number of iterations only. An example of a widely used class of preconditioners is when is computed as a low-accuracy LU factorization.
Unfortunately, for many important problems it is quite difficult to find a preconditioner that is both of good quality and cheap to compute, especially when the matrix is ill conditioned, that is, when the ratio between its largest and smallest singular values is large.
This class of preconditioners is based on the following key observation: ill-conditioned matrices that arise in practice often have a small number of small singular values. The inverse of such a matrix has a small number of large singular values and so is numerically low rank. This observation suggests that the error matrix is of interest, because we may expect to retain the numerically low-rank property of .
In the paper, we first investigate theoretically and experimentally whether is indeed numerically low rank; we then describe how to exploit this property to accelerate the convergence of iterative methods by building an improved preconditioner , where is a low-rank approximation to . This new preconditioner is equal to and is thus almost a perfect preconditioner if . Moreover, since is a low-rank matrix, can be cheaply computed via the Sherman–Morrison–Woodbury formula, and so the new preconditioner can be easily inverted.
We apply this new preconditioner to three different types of approximate LU factorizations: half-precision LU factorization, incomplete LU factorization (ILU), and block low-rank (BLR) LU factorization. In our experiments with GMRES-based iterative refinement, we show that the new preconditioner can achieve a significant reduction in the number of iterations required to solve a variety of real-life problems.