Author Archives: Massimiliano Fasi

CPFloat: A C library for emulating low-precision arithmetic

Detail of a replica of the mechanical computer Z1 on display at the German Museum of Techonolgy in Berlin. This machine, developed by Konrad Zuse between 1936 and 1938, could perform floating-point computations using a 24-bit representation. “Mechanischer Computer Z1, Nachbildung” by Stiftung Deutsches Technikmuseum Berlin is licensed under CC BY 4.0.

Computing machines with floating-point capabilities started appearing in the late 1940s, and despite being regarded as an optional feature for decades to come, floating-point hardware equipped a number of the mainframe computers produced in the 1960s and 1970s. At the time, there was no consensus on what features a number system ought to have, and the arithmetics implemented by any two vendors could differ widely in terms of word sizes, floating-point representations, rounding behaviour, and accuracy of the operations implemented in hardware. This lack of standardization was a major hindrance to the portability of the source code, as it made it significantly harder to write and maintain architecture-independent code.

The way out of this predicament was the establishment of the IEEE 754-1985 standard, which dictated word sizes, representations, and accuracy requirements for two default floating-point formats: single and double precision. For about three decades, hardware vendors conformed with the standard, and the large majority of general-purpose processing units came equipped with IEEE-compliant arithmetic units. Motivated by the low-precision requirements of machine learning applications, however, in recent years hardware manufacturers have started exploring the realm of low-precision arithmetics, commercialising a variety of highly optimised hardware units that pursue efficiency at the expense of accuracy. The high-performance computing community has attempted to channel the extreme performance of this hardware and utilise it in more traditional scientific computing applications, where more stringent accuracy needs typically require the use of higher precision.

The broad range of different formats available poses a serious challenge to those trying to develop mixed-precision algorithms, since studying the numerical behaviour of an algorithm in different working precisions may require access to a number of high-end hardware platforms, including supercomputer-grade CPUs and GPUs. In order to alleviate the problem, several software packages for simulating low-precision floating-point arithmetics in software have been developed. Some of these are described in the table below. All these solutions execute each arithmetic operation in hardware, and use the software layer only to round the results to the desired number of significant bits of precision.

Available software packages for simulating low-precision floating-pointarithmetic. The first three columns reports the name of the package, the main pro-gramming language in which the software is written, and what storage formats are supported. The following three columns describe the parameters of the target formats: whether the number of bits of precision in the significand is arbitrary (A) or limited to the number of bits in the significand of the storage format (R); whether the exponentrange can be arbitrary (A), must be the same as the storage format (R), or a sub-range thereof (S); whether the target format supports subnormal numbers (S), does not support them (N), supports them only for builtin types (P), or supports them but allows the user to switch the functionality off (F). The following column lists the floating-point formats that are built into the system. The last five columns indicate whetherthe software supports round-to-nearest wih ties-to-even (RNE), ties-to-zero (RNZ),or ties-to-away (RNA), the three directed rounding modes of the IEEE 754 standardround-toward-zero (RZ), round-to-+∞ and round-to-−∞ (RUD), round-to-odd (RO), and the two variants of stochastic rounding discussed in Section 3 (SR). The abbreviations bf16, tf32, fp16, fp32, and fp64 denote the formats bfloat16, TensorFloat-32, binary16, binary32, and binary64, respectively.

If the hardware and the software-simulated floating-point format are both IEEE compliant, rounding requires only standard mathematical library functions, but a straightforward implementation of the rounding formulae may potentially cause two kinds of issues. From a theoretical point of view, handling subnormals, underflow, and overflow demands special attention, and numerical errors can cause mathematically correct formulae to behave incorrectly when evaluated in finite arithmetic. In terms of performance, on the other hand, the algorithms obtained in this way are not necessarily efficient, as they build upon library functions that, being designed to handle a broad range of cases, might not be optimised for the specific needs of floating-point rounding functions.

In MATLAB, low-precision arithmetics can be simulated using the chop function, developed by Higham and Pranesh and discussed in a previous post. Mantas Mikaitis and I decided to port the functionalities of this MATLAB implementation to a lower level language, in order to obtain better performance. The scope of the project has progressively broadened, and we have recently released the first version of CPFloat, an open-source C library that offers a wide range of routines for rounding arrays of binary32 or binary64 numbers to lower precision. Our code exploits the bit-level representation of the underlying floating-point formats, and performs only low-level bit manipulation and integer arithmetic without relying on costly library calls.

The project is hosted in a GitHub repository where, along with the C code, we provide a MEX interface for MATLAB and Octave which is fully compatible with the chop function. The algorithms that underlie our implementation, as well as the design choices and the testing infrastructure of CPFloat are discussed in detail in a recent EPrint. In the numerical experiments shown there, CPFloat brings a considerable speedup (typically one order of magnitude or more) over existing alternatives in C, C++, and MATLAB. To the best of our knowledge, our library is currently the most efficient and complete software for experimenting with custom low-precision floating-point arithmetic available in any language.

Related articles

Pivoted Matrices with Tunable Condition Number and the HPL-AI Benchmark

The RIKEN Center for Computational Science in Kobe, Japan, home to the supercomputer Fugaku. This machine, scheduled to become fully operational in 2021, heads the June 2020 edition of the TOP500 list and achieved an execution rate of 1.421 x 1018 (mixed-precision) floating-point operations per second (1.421 exaflops) on the HPL-AI Mixed-Precision Benchmark by solving a system of order 13,516,800.

Traditionally, the fastest supercomputers in the world are ranked according to their performance on HPL, a portable implementation of the High-Performance Computing Linpack Benchmark for distributed memory systems. This software package measures the rate of execution, expressed in binary64 floating-point operations per second, at which a computer solves a large dense linear system using LU factorization with row partial pivoting. This metric is an excellent predictor of the performance a machine will achieve on compute-bound tasks executed in binary64 arithmetic, but the test problem is not representative of typical artificial intelligence computations, where lower precision is typically considered satisfactory.

In an endeavour to consider both workloads at once, the Innovative Computing Laboratory (ICL) at the University of Tennesse, Knoxville, developed the HPL-AI Mixed-Precision Benchmark. The reference implementation of the benchmark solves a dense linear system Ax = b of order n to binary64 accuracy by complementing a low-precision direct solver with a high-precision refinement strategy. The code generates A and b in binary64, computes the LU factorization without pivoting of A using only binary32 arithmetic, finds an approximate binary32 solution \widetilde x using forward and back substitution, and finally solves the linear system with GMRES in binary64 arithmetic, using \widetilde x as starting vector and the low-precision LU factors of A as preconditioners.

In order to obtain the best possible rate of execution, the coefficient matrix A must be dense and, as the benchmark is expected to be needed for n > 10^7, it should also be easy to generate at scale on a distributed memory environment. Furthermore, we require that A have growth factor of order 1, so as to guarantee the stability of Gaussian elimination without pivoting, and be not too ill-conditioned, in order to ensure the convergence of GMRES. Currently, the reference implementation relies on a randomly generated matrix that is diagonally dominant by rows. This choice ensures the stability of LU factorization without pivoting, but the matrix thus constructed requires an amount of communication that depends on n, and turns out to be extremely well conditioned, with an \infty-norm condition number just above 4 for large n.

In our EPrint Matrices with Tunable Infinity-Norm Condition Number and No Need for Pivoting in LU Factorization, Nick Higham and I present a new parametric class of dense square matrices A(\alpha,\beta) for which LU factorization without pivoting is stable. The parameters \alpha and \beta that yield a specified \infty-norm condition number can be computed efficiently, and once they have been chosen the matrix A(\alpha,\beta) can be formed from an explicit formula in O(n^2) floating-point operations.

We also discuss several adaptations aimed at further improving the suitability of the generated matrix as a test problem for the HPL-AI benchmark, as our main goal is to provide a robust and efficient matrix generator for it. In particular, we explain how scaling the matrix can help transition from binary32 to binary16, in order to take advantage of the hardware accelerators for low precision such as the tensor cores that equip recent NVIDIA GPUs, and we discuss how the matrices can be tweaked to make the entries in the LU factors less predictable and slow down the convergence of GMRES.

Extreme-Scale Test Matrices with Specified 2-Norm Condition Number

Row H of the compute racks of Summit, the most powerful supercompter in the TOP500 list since June 2018.

The supercomputers in the TOP500 list are ranked using the High-Performance Linpack (HPL) Benchmark, which gauges the performance of a computer by solving a large dense linear system by Gaussian elimination with partial pivoting. The size of the coefficient matrix depends on the computational power of the machine being assessed, because more powerful systems require larger benchmark problems in order to reach their peak performance.

The test matrices used by HPL have random entries uniformly distributed on the interval (-1/2,1/2]. The 2-norm condition number of such matrices depends on their size, and can potentially be very large for the matrices required by today’s most powerful computers: the largest linear systems solved on Summit, the machines that leads the November 2019 TOP500 ranking, have order 10^7, and even larger systems will be needed to benchmark the coming generations of exascale systems.

An n \times n matrix with specified 2-norm condition number can be generated as A := U \varSigma V^T, where U and V are random real orthogonal n \times n matrices from the Haar distribution (a natural uniform probability distribution on the orthogonal matrices) and \varSigma is diagonal with nonnegative entries \sigma_1\ge \cdots \ge \sigma_n \ge 0. It is well known that A has 2-norm condition number \kappa_2(A) = \sigma_1/\sigma_n if \sigma_n \neq 0 and \kappa(A) = \infty otherwise. This technique, which is used by the gallery('randsvd', ...) function, requires 2n^3 floating-point operations to generate a test matrix of order n, and would thus be impractical in an extreme-scale setting.

In our recent EPrint Generating extreme-scale matrices with specified singular values or condition numbers, Nick Higham and I present four methods that, by giving up the requirement that the matrices U and V be from the Haar distribution, reduce the cost of the approach above from cubic to quadratic. The matrices generated retain a number of desirable properties that make them a suitable choice for testing linear solvers at scale.

These cheaper algorithms are particularly well suited to distributed-memory environments, since all communication between the nodes can be avoided at the price of a negligible increase in the overall computational cost. They are appealing to MATLAB users too, as the following example demostrates.

n = 10000; kappa = 1e6; mode = 2; rng(1)
% gallery('randsvd',...)
fprintf('gallery(''randsvd'',...):                %5.2f seconds elapsed.\n',...
% Algorithm 3.1 in our EPrint.
method = 1; matrix = 0;
fprintf('Alg. 3.1 with Haar U:                  %5.2f seconds elapsed.\n',...
matrix = 2;
fprintf('Alg. 3.1 with U=gallery(''orthog'',n,2): %5.2f seconds elapsed.\n',...
% Algorithm 4.1 in our EPrint.
method = 3; matrix = 0;
fprintf('Alg. 4.1 with Haar U:                  %5.2f seconds elapsed.\n',...
matrix = 2;
fprintf('Alg. 4.1 with U=gallery(''orthog'',n,2): %5.2f seconds elapsed.\n',...

In the listing above, randsvdfast is an implementation of our algorithms available on the MATLAB Central File Exchange. Setting the argument matrix to 0 tells our function to pick U from the Haar distribution, whereas setting it to 2 causes U to be chosen in a non-random way. Running this code in MATLAB 2019b on a machine equipped with an Intel processor I5-6500 running at 3.20GHz the script produces the output

gallery('randsvd',...):                79.52 seconds elapsed.
Alg. 3.1 with Haar U:                  19.70 seconds elapsed.
Alg. 3.1 with U=gallery('orthog',n,2):  1.90 seconds elapsed.
Alg. 4.1 with Haar U:                  19.28 seconds elapsed.
Alg. 4.1 with U=gallery('orthog',n,2):  1.43 seconds elapsed.

Therefore, for n = 10{,}000 randsvdfast is up to 56 times faster than gallery('randsvd',...).

Determinants of Bohemian matrix families

Density plots of eigenvalues of matrices from various Bohemian families. Collection of images by Steven Thornton. To find out more about these figures, or to create your own, check out the gallery at the Bohemian Matrices website.

The adjective “Bohemian” was used for the first time in a linear algebra context by Robert Corless and Steven Thornton to describe the eigenvalues of matrices whose entries are taken from a finite discrete set, usually of integers. The term is a partial acronym for “BOunded Height Matrix of Integers”, but the origin of the term was soon forgotten, and the expression “Bohemian matrix” is now widely accepted.

As Olga Taussky observed already in 1960, the study of matrices with integer elements is “very vast and very old”, with early work of Sylvester and Hadamard that dates back to the second half of the nineteenth century. These names are the first two in a long list of mathematicians that worked on what is now known as the “Hadamard conjecture”: for any positive integer n multiple of 4, there exists an n by n matrix H, with entries -1 and +1, such that HH^T = nI.

If this is the best-known open problem surrounding Bohemian matrices, it is far from being the only one. During the 3-day workshop “Bohemian Matrices and Applications” that our group hosted in June last year, Steven Thornton released the Characteristic Polynomial Database, which collects the determinants and characteristic polynomials of billions of samples from certain families of structured as well as unstructured Bohemian matrices. All the available data led Steven to formulate a number of conjectures regarding the determinants of several families of Bohemian upper Hessenberg matrices.

Gian Maria Negri Porzio and I attended the workshop, and set ourselves the task of solving at least one of these open problems. In our recent preprint, we enumerate all the possible determinants of Bohemian upper Hessenberg matrices with ones on the subdiagonal. We consider also the special case of families with main diagonal fixed to zero, whose determinants turn out to be related to some generalizations of Fibonacci numbers. Many of the conjectures stated in the Characteristic Polynomial Database follow from our results.

The Paterson–Stockmeyer method for evaluating polynomials and rational functions of matrices

According to Moler and Van Loan, truncating the Taylor series expansion to the exponential at 0 is the least effective of their nineteen dubious ways to compute the exponential of a matrix. Such an undoubtedly questionable approach can, however, become a powerful tool for evaluating matrix functions when used in conjunction with other strategies, such as, for example, the scaling and squaring technique. In fact, truncated Taylor series are just a special case of a much larger class of rational approximants, known as the Padé family, on which many state-of-the-art algorithms rely in order to compute matrix functions.

A customary choice for evaluating these approximants at a matrix argument is the Paterson–Stockmeyer method, an evaluation scheme that was originally proposed as an asymptotically optimal algorithm for evaluating polynomials of matrices, but generalizes quite naturally to rational functions, which are nothing but the solutions of a linear system whose coefficients and right-hand side are polynomials of the same matrix. This technique exploits a clever rewriting of a polynomial in A as a polynomials in A^s, for some positive integer s, and overall requires about 2\sqrt{k} matrix products to evaluate a polynomial of degree k.

Number of matrix multiplications required to evaluate a polynomial of degree between 0 and 50.

As shown in the figure, when the Paterson–Stockmeyer scheme is used the number of matrix multiplications required to evaluate a polynomial of degree k grows slower than k itself, with the result that evaluating polynomials of different degree will asymptotically have the same cost. For example, evaluating a polynomial of any degree between 43 and 49 requires 12 matrix multiplications, thus there is little point in considering an approximant of degree 43 when evaluating that of degree 49 has roughly the same cost but will in all likelihood deliver a more accurate result.

When designing algorithms to evaluate functions of matrices, one is interested in finding the optimal degrees, those marked with a red circle in the figure above, since they guarantee maximal accuracy for a given computational cost. When fixed precision is considered, finding all such degrees is not a problem: a backward error analysis can be used to determine the maximum degree that will ever be needed, m_{max} say, and then looking at the plot is enough to find all the optimal degrees smaller than m_{max}. In order to deal with arbitrary precision arithmetic, however, a different strategy is needed, as depending on the working precision and the required tolerance, approximants of arbitrarily high degree may be needed. The new Eprint Optimality of the Paterson–Stockmeyer Method for Evaluating Matrix Polynomials and Rational Matrix Functions studies the cost of the Paterson–Stockmeyer method for polynomial evaluation and shows that a degree is optimal if and only if it is a quarter-square, that is, a number of the form \lfloor n^2/4 \rfloor for some nonnegative integer n, where \lfloor \cdot \rfloor is the floor function.

Similar results can be obtained for Paterson–Stockmeyer-like algorithms for evaluating diagonal Padé approximants, rational functions whose numerator and denominator have same degree. In that case, one can show that an approximant is optimal if and only if the degree of numerator and denominator is an eight-square, an integer of the form \lfloor n^2/8 \rfloor for some n \in \mathbb{N}. In particular, for diagonal Padé approximants to the exponential, due to a symmetry of the coefficients of numerator and denominator, faster algorithms can be developed, and an explicit formula—not as nice as that in the two previous cases—can be derived for the optimal orders of these approximants.

Computing the matrix exponential in arbitrary precision

Edmond Laguerre was the first to discuss, en passant in a paragraph of a long letter to Hermite, the exponential of a matrix. In fact, the French mathematician confined himself to defining e^X via the Taylor series expansion of e^z at 0 and remarking that the scalar identity e^{x+y} = e^x e^y does not generalise to matrices. Surprisingly perhaps, truncating the Taylor expansion is still today, just over 150 years later, one of the most effective way of approximating the matrix exponential, along with the use of a family of rational approximants discovered a couple of decades later by Henri Padé, a student of Hermite’s.

Despite the fact that the exponential is an entire function, Taylor and Padé approximants converge only locally, and may require an approximant of exceedingly high order for matrices of large norm. Therefore, scaling the input is often necessary, in order to ensure fast and accurate computations. This can be effectively achieved by means of a technique called scaling and squaring, first proposed in 1967 by Lawson, who noticed that by exploiting the identity e^A = \bigl(e^{2^{-s}A}\bigr)^{2^s} one can compute the matrix exponential by evaluating one of the approximants above at 2^{-s}A and then squaring the result s times.

The choice of s and of the order of the approximant hinges on a delicate trade-off that has been thoroughly studied in the last 50 years. The most efficient algorithms are tailored to double precision, and rely on the computation of precision-dependent constants, a relatively expensive task that needs to be performed only once during the design stage of the algorithm, and produces algorithms that are very efficient at runtime.

These techniques, however, do not readily extend to arbitrary precision environments, where the working precision at which the computation will be performed is known only at runtime. In the EPrint An arbitrary precision scaling and squaring algorithm for the matrix exponential, Nick Higham and I show how the scaling and squaring method can be adapted to the computation of the matrix exponential in precisions higher than double.

The algorithm we propose combines a new analysis of the forward truncation error of Padé approximants to the exponential with an improved strategy for selecting the parameters required to perform the scaling and squaring step. We evaluate at runtime a bound that arises naturally from our analysis in order to check whether a certain choice of those parameters will deliver an accurate evaluation of the approximant or not. Moreover, we discuss how this technique can be implemented efficiently for truncated Taylor series and diagonal Padé approximants.

Forward error in 1024 digits and performance profile
Top left: forward error of several algorithms on a test set of non-Hermitian matrices. Top right: corresponding performance profile. Bottom: legend for the other two plots. Variants of our algorithm are in the top row, existing implementations for computing the matrix exponential in arbitrary precision are in the bottom row.

We evaluate experimentally several variants of the proposed algorithm in a wide range of precisions, and find that they all behave in a forward stable fashion. In particular, the most accurate of our implementations outperforms existing algorithms for computing the matrix exponential in high precision.