Summing numbers is a key computational task at the heart of many numerical algorithms. When performed in floating-point arithmetic, summation is subject to rounding errors: for a machine precision , the error bound for the most basic summation algorithms, such as recursive summation, is proportional to .

Nowadays, with the growing interest in low floating-point precisions and ever increasing in applications, such error bounds have become unacceptably large. While summation algorithms leading to smaller error bounds are known (compensated summation is an example), they are computationally expensive.

In our recent preprint, Pierre Blanchard, Nick Higham and I propose a class of fast and accurate summation algorithms called FABsum. We show that FABsum has an error bound of the form , where is a block size, which is independent of to first order. As illustrated by the figure below, which plots the measured error using single precision as a function of , FABsum can deliver substantially more accurate results than recursive summation. Moreover, FABsum can be easily incorporated in high-performance numerical linear algebra kernels in order to boost accuracy with only a modest drop in performance, as we demonstrate in the paper with the PLASMA library.

In earlier blog posts, I wrote about the benefits of using half precision arithmetic (fp16) and about the problems of overflow and underflow in fp16 and how to avoid them. But how can one experiment with fp16, or other low precision formats such as bfloat16, in order to study how algorithms behave in these arithmetics? (For an accessible introduction to fp16 and bfloat16 see the blog post by Nick Higham.)

As of now, fp16 is supported by several GPUs, but these are specialist devices and they can be very expensive. Moreover, architectures that support bfloat16 have not yet not been released. Therefore software that simulates these floating-point formats is needed.

In our latest EPrint, Nick Higham and I investigate algorithms for simulating fp16, bfloat16 and other low precision formats. We have also written a MATLAB function chop that can be incorporated into other MATLAB codes to simulate low precision arithmetic. It can easily be used to study the effect of low precision formats on various algorithms.

Imagine a hypothetical situation where the computer can just represent integers. Then the question is how do we represent numbers like 4/3? An obvious answer would be to represent it via the integer closest to it, 1 in this case. However, one will have to come with a convention to handle the case where the number is in the centre. Now replace the integer in the example with floating-point numbers, and a similar question arises. This process of converting any given number to a floating-point number is called rounding. If we adopt a rule where we choose the closest floating-point number (as above), then we formally call it as ‘round to nearest’. There are other ways to round as well, and different rounding modes can yield different results for the same code. However meddling with the parameters of a floating-point format without a proper understanding of their consequences can be a recipe for disaster. Cleve Moler in his blog on sub-normal numbers makes this point by warning ‘don’t try this at home’. The MATLAB software we have written provides a safe environment to experiment with the effects of changing any parameter of a floating-point format (such as rounding modes and support of subnormal numbers) on the output of a code. All the technical details can be found in the Eprint and our MATLAB codes.

The wave-kernel functions and arise in the solution of second order differential equations such as with initial conditions at . Here, is an arbitrary square matrix and . The square root in these formulas is illusory, as both functions can be expressed as power series in , so there are no questions about existence of the functions.

How can these functions be computed efficiently? In Computing the Wave-Kernel Matrix Functions (SIAM J. Sci. Comput., 2018) Prashanth Nadukandi and I develop an algorithm based on Padé approximation and the use of double angle formulas. The amount of scaling and the degree of the Padé approximant are chosen to minimize the computational cost subject to achieving backward stability for in exact arithmetic.

In the derivation we show that the backward error of any approximation to can be explicitly expressed in terms of a hypergeometric function. To bound the backward error we derive and exploit a new bound for in terms of the norms of lower powers of ; this bound is sharper than one previously obtained by Al-Mohy and Higham.

Numerical experiments show that the algorithm behaves in a forward stable manner in floating-point arithmetic and is superior in this respect to the general purpose Schur–Parlett algorithm applied to these functions.

The fundamental regions of the

function cosh(sqrt(z)), needed for the backward error analysis

The solution of a linear system is a fundamental task in scientific computing. Two main classes of methods to solve such a system exist.

Direct methods compute a factorization of matrix , such as LU factorization, to then directly obtain the solution by triangular substitution; they are very reliable but also possess a high computational cost, which limits the size of problems that can be tackled.

Iterative methods compute a sequence of iterates converging towards the solution ; they are inexpensive but their convergence and thus reliability strongly depends on the matrix properties, which limits the scope of problems that can be tackled.

A current major challenge in the field of numerical linear algebra is therefore to develop methods that are able to tackle a large scope of problems of large size.

To accelerate the convergence of iterative methods, one usually uses a preconditioner, that is, a matrix ideally satisfying three conditions: (1) is cheap to compute; (2) can be easily inverted; (3) is a good approximation to . With such a matrix , the preconditioned system is then cheap to solve with an iterative method and often requires a small number of iterations only. An example of a widely used class of preconditioners is when is computed as a low-accuracy LU factorization.

Unfortunately, for many important problems it is quite difficult to find a preconditioner that is both of good quality and cheap to compute, especially when the matrix is ill conditioned, that is, when the ratio between its largest and smallest singular values is large.

This class of preconditioners is based on the following key observation: ill-conditioned matrices that arise in practice often have a small number of small singular values. The inverse of such a matrix has a small number of large singular values and so is numerically low rank. This observation suggests that the error matrix is of interest, because we may expect to retain the numerically low-rank property of .

In the paper, we first investigate theoretically and experimentally whether is indeed numerically low rank; we then describe how to exploit this property to accelerate the convergence of iterative methods by building an improved preconditioner , where is a low-rank approximation to . This new preconditioner is equal to and is thus almost a perfect preconditioner if . Moreover, since is a low-rank matrix, can be cheaply computed via the Sherman–Morrison–Woodbury formula, and so the new preconditioner can be easily inverted.

We apply this new preconditioner to three different types of approximate LU factorizations: half-precision LU factorization, incomplete LU factorization (ILU), and block low-rank (BLR) LU factorization. In our experiments with GMRES-based iterative refinement, we show that the new preconditioner can achieve a significant reduction in the number of iterations required to solve a variety of real-life problems.

According to Moler and Van Loan, truncating the Taylor series expansion to the exponential at 0 is the least effective of their nineteen dubious ways to compute the exponential of a matrix. Such an undoubtedly questionable approach can, however, become a powerful tool for evaluating matrix functions when used in conjunction with other strategies, such as, for example, the scaling and squaring technique. In fact, truncated Taylor series are just a special case of a much larger class of rational approximants, known as the Padé family, on which many state-of-the-art algorithms rely in order to compute matrix functions.

A customary choice for evaluating these approximants at a matrix argument is the Paterson–Stockmeyer method, an evaluation scheme that was originally proposed as an asymptotically optimal algorithm for evaluating polynomials of matrices, but generalizes quite naturally to rational functions, which are nothing but the solutions of a linear system whose coefficients and right-hand side are polynomials of the same matrix. This technique exploits a clever rewriting of a polynomial in as a polynomials in , for some positive integer , and overall requires about matrix products to evaluate a polynomial of degree .

As shown in the figure, when the Paterson–Stockmeyer scheme is used the number of matrix multiplications required to evaluate a polynomial of degree grows slower than itself, with the result that evaluating polynomials of different degree will asymptotically have the same cost. For example, evaluating a polynomial of any degree between 43 and 49 requires 12 matrix multiplications, thus there is little point in considering an approximant of degree 43 when evaluating that of degree 49 has roughly the same cost but will in all likelihood deliver a more accurate result.

When designing algorithms to evaluate functions of matrices, one is interested in finding the optimal degrees, those marked with a red circle in the figure above, since they guarantee maximal accuracy for a given computational cost. When fixed precision is considered, finding all such degrees is not a problem: a backward error analysis can be used to determine the maximum degree that will ever be needed, say, and then looking at the plot is enough to find all the optimal degrees smaller than . In order to deal with arbitrary precision arithmetic, however, a different strategy is needed, as depending on the working precision and the required tolerance, approximants of arbitrarily high degree may be needed. The new Eprint Optimality of the Paterson–Stockmeyer Method for Evaluating Matrix Polynomials and Rational Matrix Functions studies the cost of the Paterson–Stockmeyer method for polynomial evaluation and shows that a degree is optimal if and only if it is a quarter-square, that is, a number of the form for some nonnegative integer , where is the floor function.

Similar results can be obtained for Paterson–Stockmeyer-like algorithms for evaluating diagonal Padé approximants, rational functions whose numerator and denominator have same degree. In that case, one can show that an approximant is optimal if and only if the degree of numerator and denominator is an eight-square, an integer of the form for some . In particular, for diagonal Padé approximants to the exponential, due to a symmetry of the coefficients of numerator and denominator, faster algorithms can be developed, and an explicit formula—not as nice as that in the two previous cases—can be derived for the optimal orders of these approximants.

Edmond Laguerre was the first to discuss, en passant in a paragraph of a long letter to Hermite, the exponential of a matrix. In fact, the French mathematician confined himself to defining via the Taylor series expansion of at and remarking that the scalar identity does not generalise to matrices. Surprisingly perhaps, truncating the Taylor expansion is still today, just over 150 years later, one of the most effective way of approximating the matrix exponential, along with the use of a family of rational approximants discovered a couple of decades later by Henri Padé, a student of Hermite’s.

Despite the fact that the exponential is an entire function, Taylor and Padé approximants converge only locally, and may require an approximant of exceedingly high order for matrices of large norm. Therefore, scaling the input is often necessary, in order to ensure fast and accurate computations. This can be effectively achieved by means of a technique called scaling and squaring, first proposed in 1967 by Lawson, who noticed that by exploiting the identity one can compute the matrix exponential by evaluating one of the approximants above at and then squaring the result times.

The choice of and of the order of the approximant hinges on a delicate trade-off that has been thoroughly studied in the last 50 years. The most efficient algorithms are tailored to double precision, and rely on the computation of precision-dependent constants, a relatively expensive task that needs to be performed only once during the design stage of the algorithm, and produces algorithms that are very efficient at runtime.

These techniques, however, do not readily extend to arbitrary precision environments, where the working precision at which the computation will be performed is known only at runtime. In the EPrint An arbitrary precision scaling and squaring algorithm for the matrix exponential, Nick Higham and I show how the scaling and squaring method can be adapted to the computation of the matrix exponential in precisions higher than double.

The algorithm we propose combines a new analysis of the forward truncation error of Padé approximants to the exponential with an improved strategy for selecting the parameters required to perform the scaling and squaring step. We evaluate at runtime a bound that arises naturally from our analysis in order to check whether a certain choice of those parameters will deliver an accurate evaluation of the approximant or not. Moreover, we discuss how this technique can be implemented efficiently for truncated Taylor series and diagonal Padé approximants.

Top left: forward error of several algorithms on a test set of non-Hermitian matrices. Top right: corresponding performance profile. Bottom: legend for the other two plots. Variants of our algorithm are in the top row, existing implementations for computing the matrix exponential in arbitrary precision are in the bottom row.

We evaluate experimentally several variants of the proposed algorithm in a wide range of precisions, and find that they all behave in a forward stable fashion. In particular, the most accurate of our implementations outperforms existing algorithms for computing the matrix exponential in high precision.

In my previous post, I gave a general introduction to the importance of using half precision (fp16) in the solution of linear systems of equations. In this post I will focus on one specific obstacle in using fp16 for general scientific computing: handling very large numbers and very small numbers.

To clarify the meaning of very large and very small it is helpful to draw an analogy with a ruler, which we use it on daily basis to measure length.

The minimum length which one can measure with the ruler shown in the picture (Image credits splashmath) is 1 mm (millimeter), and this is referred to as the least count of this measuring instrument. The maximum length that can be measured is 10.5 cm (centimeters). If we have a pencil whose length falls between 5.6 cm and 5.7 cm, then we decide if it is 5.6 or 5.7 based on to which it is closer to. A similar process also happens in a computer!

Drawing parallels with the example above, we use a similar ruler to measure in a computer, but we measure numbers rather than lengths, and this ruler is called a “floating point system”. Just like a ruler there is a minimum number which the floating point system can measure, and any number less than that is treated as zero. This process of numbers becoming zero in a floating point system because they are very small is called “underflow”. Next, any number which is too large to be measured by a floating point system is made infinity, and this process is called “overflow”. Finally just like a scale, any number is represented by a number closest to it in the floating point format, and this process is called as ‘rounding’. For a detailed rigorous and accessible introduction to floating point arithmetic, I would refer interested readers to this blog post by Cleve Moler.

There are four standard ways of measuring numbers, and they are called half precision, single precision, double precision, and quadruple precision. They are in the increasing order of maximum number they can represent, and decreasing order of the minimum number they can represent. For the sake of concreteness, lets consider half precision and double precision. The maximum number that can be represented by double precision is 1.80 × 10^{308}, and for half precision is 65500. The maximum number which can be represented in double precision is enough for most of the problems arising in scientific computing. On the other hand 65500 is extremely small! For example, the modulus of elasticity of many metals is in the order of 10^{9}, and this is infinity in fp16! Similarly the minimum positive (and normalized) numbers are 1.18 × 10^{-38 }for double precision, and 6.10 x 10^{-5} for fp16. This limitation in the range of numbers poses a serious limitation for using fp16 in scientific computing.

Summarising, we have a dichotomy between the computational efficiency and the limitation in representing large and small numbers in fp16. To address this, Prof. Nick Higham and I have developed an algorithm that squeezes a matrix into the range of numbers that can be represented by fp16. We use the well known technique of diagonal scaling to scale all the matrix entries between -1 and +1, and next we multiply the matrix by θ x 65500 to make complete use of the range of numbers in fp16. θ < 1, and is used to avoid overflow in subsequent computation. The scaling algorithm proposed is not restricted to any particular application, but we concentrate on solution of system of linear equations. We employ GMRES-based iterative refinement, which has generated a lot of interest in the scientific computing community. The main contribution of this scaling algorithm is that it greatly expands the class of problems in which fp16 can be used. All the technical details and results of numerical experiments can be found in the following EPrint of the manuscript.

Backward error for random linear system of dimension n.

James Wilkinson developed a systematic way of analyzing rounding errors in numerical algorithms in the 1960s—a time when computers typically used double precision arithmetic and a matrix of dimension was considered large. As a result, an error bound with a dominant term , for a low degree polynomial and the machine precision, was acceptably small.

Today, the supercomputers heading the TOP500 list solve linear systems of equations of dimension and half precision arithmetic (accurate to about 4 significant decimal digits) is increasingly available in hardware, notably on graphical processing units (GPUs) from AMD and NVIDIA. Traditional rounding error bounds cannot guarantee accurate results when the dimension is very large or the precision is low. Yet computations are often successful in such cases—for example in machine learning algorithms making use of half, or even lower, precision.

This discrepancy between theory and practice stems from the fact that traditional rounding error bounds are worst-case bounds and so tend to be pessimistic. Indeed, while a single floating-point operation incurs a rounding error bounded in modulus by , the composition of operations leads to a worst-case error bound with dominant term proportional to . But this worst-case error bound is attained only when each rounding error is of maximal magnitude and identical sign, which is very unlikely. Since the beginning of the digital computer era many researchers have modelled rounding errors as random variables in an attempt to obtain better estimates of how the error behaves on average. This line of thinking has led to the well-known rule of thumb, based on informal arguments and assumptions, that constants in rounding error bounds can be replaced by their square roots.

In our EPrint A New Approach to Probabilistic Rounding Error Analysis we make minimal probabilistic assumptions on the rounding errors and make use of a tool from probability theory called a concentration inequality. We show that in several core linear algebra algorithms, including matrix-vector products and LU factorization, the backward error can be bounded with high probability by a relaxed constant proportional to instead of . Our analysis provides the first rigorous justification of the rule of thumb.

This new bound is illustrated in the figure above, where we consider the solution of a linear system by LU factorization. The matrix and vector have entries from the random uniform [0,1] distribution and is formed as . We compare the backward error with its traditional worst-case bound and our relaxed probabilistic bound. The figure shows that the probabilistic bound is in very good agreement with the actual backward error and is much smaller than the traditional bound. Moreover, it successfully captures the asymptotic behavior of the error growth, which follows rather than .

The assumptions underlying our analysis—that the rounding errors are independent random variables of mean zero—do not always hold, as we illustrate with examples in the paper. Nevertheless, our experiments show that the bounds do correctly predict the backward error for a selection of real-life matrices from the SuiteSparse collection.

NVIDIA Founder & CEO Jensen Huang talking about the work reported here in his special address at Supercomputing 2018 (8:30 onwards).

Over the last 30 years, hierarchical computer memories, multicore processors and graphical processing units (GPUs) have all necessitated the redesign of numerical linear algebra algorithms, and in doing so have led to algorithmic innovations. Mixed precision arithmetic—a concept going back to the earliest computers, which had the ability to accumulate inner products in extra precision—attracted renewed interest in the late 1990s once Intel chips were able to execute single precision at twice the rate of double precision. Now the increasing availability of low precision arithmetic is offering new opportunities.

In the paper Harnessing GPU Tensor Cores for Fast FP16 Arithmetic to Speed up Mixed-Precision Iterative Refinement Solvers presented at SC18 (the leading supercomputing conference), Azzam Haidar, Stanimire Tomov, Jack Dongarra and Nick Higham show how to exploit the half precision (fp16) arithmetic that is now available in hardware. Whereas fp16 arithmetic can be expected to run at twice the rate of fp32 (single precision) arithmetic, the NVIDIA V100 GPU has tensor cores that can execute half precision at up to eight times the speed of single precision and can deliver the results to single precision accuracy. Developing algorithms that can exploit half precision arithmetic is important both for a workstation connected to a single V100 GPU and the world’s fastest computer (as of November 2018): Summit at Oak Ridge National Laboratory, which contains 27,648 V100 GPUs.

The paper shows that a dense n-by-n double precision linear system can be solved using mixed precision iterative refinement at a rate up to four times faster than a highly optimized double precision solver and with a reduction in energy consumption by a factor five.

The key idea is to LU factorize the matrix in a mix of half precision and single precision then apply iterative refinement. The update equations in the refinement process are solved by an inner GMRES iteration that uses the LU factors as preconditioners. This GMRES-IR algorithm was proposed by Carson and Higham in two (open access) papers in SIAM J. Sci. Comput. (2017 and 2018). In the form used here, the algorithm converges for matrices with condition numbers up to about . It provides a backward stable, double precision solution while carrying out almost all the flops at lower precision.

Codes implementing this work will be released through the open-source MAGMA library.

Max Fasi and Bruno Iannazzo have shown how to compute all primary solutions of a matrix equation for rational functions . A primary solution is one that can be written as a polynomial in . The proposed algorithm exploits the Schur decomposition and generalizes earlier algorithms for matrix roots.