## Computational Graphs for Matrix Functions

Numerical algorithms for evaluating a function $f$ at an $n \times n$ matrix $A$ are based on a variety of different approaches, but for a large class of methods the approximation is formed by relying on three basic operations:

• linear combination of matrices $Z \gets c_X X + c_Y Y$,
• matrix multiplication $Z \gets X \cdot Y$, and
• solution of a linear system with $n$ right-hand sides $X^{-1} \cdot Y$,

where $X$ and $Y$ are $n \times n$ matrices, and $c_X$ and $c_Y$ are scalars.

Algorithms that combine only these three basic building blocks are particularly attractive, as they correspond to functions that are easy to work with: if an expression for the scalar ($n = 1$) function $g$ features only linear combinations, multiplications, and inversions, and $g$ is defined on the spectrum of $A$, then a formula for $g(A)$ can be obtained by replacing all occurrences of $z$ in the formula for $g(z)$ with $A$.

The choice of considering only three operations may seem restrictive, but, in fact, we are looking at a large class of methods—a class that comprises most algorithms based on polynomial or rational approximation and includes, among others, the scaling and squaring algorithm for the matrix exponential and many iterative algorithms for matrix roots.

These algorithms can be interpreted as graphs, and in particular as directed acyclic graphs (DAGs). Each node in the graph represents a matrix: the inputs are stored in the leaves (nodes without incoming edges), the internal nodes (with both incoming and outgoing edges) hold the intermediate results, and the output of the computation can be read off the root nodes (nodes without outgoing edges). In such a graph, the arcs represent data dependencies and tell us in what order the matrices ought to be computed for the algorithm to return the desired result. Two examples of such computational graphs are in the figure below.

Computational graphs are arguably an elegant way of looking at matrix functions and at algorithms to compute them. In recent work, we have shown that they can also be a practical tool to study existing techniques, improve them, and ultimately derive new ones. The accuracy of an algorithm for evaluating a matrix function is determined by the accuracy of its scalar counterpart, thus the design of new methods for functions of matrices can be seen as a scalar-valued optimization problem. Computational graphs provide a particularly compelling way of carrying out the optimization, as the scalar derivatives can be evaluated automatically by exploiting the structure of the graph. This technique is akin to backpropagation, an approach for training neural networks customarily used in machine learning.

The workflow to design by optimization a new algorithm to approximate $f$ consists of three steps.

First, we choose the topology of the graph, that is, we fix how many linear combinations, matrix multiplications, and matrix inversions the algorithm is allowed to use, and in which order these operations are to be combined. In practice, we take an existing algorithm for evaluating $f$ at a matrix argument, we recast it as a computational graph, and we add nodes representing linear combinations wherever possible. These extra nodes are beneficial, because they increase the number of available degrees of freedom without changing the computational cost of the underlying algorithm.

Next, we optimize the coefficients of the graph. Let us denote by $g$ the function represented by the computational graph. We write $g(z; c)$ to stress that the value of $g$ at $z$ depends on the set of parameters $c$, which is a vector containing the coefficients of all the linear combinations that appear in the computational graph. These coefficients are the scalars $c_X$ and $c_Y$ that appear in all linear combinations of the form $c_X X + c_Y Y$ in the graph, and they are the quantities that we want to optimize. The goal of the optimization is to find a vector of coefficients $c^*$ such that $g(\cdot; c^*)$ approximates a given function $f$ in a domain $\Omega \subset \mathbb C$. In practice, we set $\Omega$ to a disc where $f$ and $g$ are both analytic, so that we only need to minimize the error $|g(z,c) - f(z)|$ over (a discretization of) the boundary of $\Omega$. This produces a surrogate nonlinear least-square problem, which we solve by means of the Gauss–Newton algorithm, a simple choice that already produces good results.

Finally, we use error analysis to bound the error of $g$ as an approximant to $f$. By relying on higher precision, we are able to guarantee that within $\Omega$ the forward and backward error are below a tolerance parameter $\varepsilon$. By choosing a suitable set of domains $\Omega_1 \subset \ldots \subset \Omega_\ell$, we can obtain a set of functions $g_1$, , $g_\ell$ that approximate $f$ to a specified accuracy and at increasing computational cost. Given these functions and a matrix $A$, we can then choose, inexpensively and at runtime, the function $g_i$ that represents the best trade-off between accuracy and speed, and we can use it to compute $f(A) \approx g_i(A)$.

Numerical results show that this process has the potential to improve state-of-the-art algorithms when $f$ is the matrix exponential or the square root of a matrix. But care is needed in assessing the quality of the new algorithms, as the accuracy of the new algorithms depends on both the topology of the graph and the quality of the starting guess. This is illustrated in the figure below for two algorithms for the computation of $\sqrt{z+1}$ within the disk of radius $1/2$ centered at 0: optimizing the coefficients of a graph representing four steps of the Denman–Beavers iteration brings down the maximum absolute approximation error by over nine orders of magnitude, but for a graph constructed using the Paterson–Stockmeyer evaluation of the degree-13 truncated Taylor approximant the error only improves by three orders of magnitude.

As part of this project, we developed a Julia package, GraphMatFun.jl, which offers tools to generate and manipulate computational graphs, modify their topology, optimize their coefficients, and generate C, Julia, and MATLAB code to evaluate them efficiently at scalar and matrix arguments. The software is released under the terms of the MIT license and is freely available on GitHub.

## CPFloat: A C library for emulating low-precision arithmetic

Computing machines with floating-point capabilities started appearing in the late 1940s, and despite being regarded as an optional feature for decades to come, floating-point hardware equipped a number of the mainframe computers produced in the 1960s and 1970s. At the time, there was no consensus on what features a number system ought to have, and the arithmetics implemented by any two vendors could differ widely in terms of word sizes, floating-point representations, rounding behaviour, and accuracy of the operations implemented in hardware. This lack of standardization was a major hindrance to the portability of the source code, as it made it significantly harder to write and maintain architecture-independent code.

The way out of this predicament was the establishment of the IEEE 754-1985 standard, which dictated word sizes, representations, and accuracy requirements for two default floating-point formats: single and double precision. For about three decades, hardware vendors conformed with the standard, and the large majority of general-purpose processing units came equipped with IEEE-compliant arithmetic units. Motivated by the low-precision requirements of machine learning applications, however, in recent years hardware manufacturers have started exploring the realm of low-precision arithmetics, commercialising a variety of highly optimised hardware units that pursue efficiency at the expense of accuracy. The high-performance computing community has attempted to channel the extreme performance of this hardware and utilise it in more traditional scientific computing applications, where more stringent accuracy needs typically require the use of higher precision.

The broad range of different formats available poses a serious challenge to those trying to develop mixed-precision algorithms, since studying the numerical behaviour of an algorithm in different working precisions may require access to a number of high-end hardware platforms, including supercomputer-grade CPUs and GPUs. In order to alleviate the problem, several software packages for simulating low-precision floating-point arithmetics in software have been developed. Some of these are described in the table below. All these solutions execute each arithmetic operation in hardware, and use the software layer only to round the results to the desired number of significant bits of precision.

If the hardware and the software-simulated floating-point format are both IEEE compliant, rounding requires only standard mathematical library functions, but a straightforward implementation of the rounding formulae may potentially cause two kinds of issues. From a theoretical point of view, handling subnormals, underflow, and overflow demands special attention, and numerical errors can cause mathematically correct formulae to behave incorrectly when evaluated in finite arithmetic. In terms of performance, on the other hand, the algorithms obtained in this way are not necessarily efficient, as they build upon library functions that, being designed to handle a broad range of cases, might not be optimised for the specific needs of floating-point rounding functions.

In MATLAB, low-precision arithmetics can be simulated using the chop function, developed by Higham and Pranesh and discussed in a previous post. Mantas Mikaitis and I decided to port the functionalities of this MATLAB implementation to a lower level language, in order to obtain better performance. The scope of the project has progressively broadened, and we have recently released the first version of CPFloat, an open-source C library that offers a wide range of routines for rounding arrays of binary32 or binary64 numbers to lower precision. Our code exploits the bit-level representation of the underlying floating-point formats, and performs only low-level bit manipulation and integer arithmetic without relying on costly library calls.

The project is hosted in a GitHub repository where, along with the C code, we provide a MEX interface for MATLAB and Octave which is fully compatible with the chop function. The algorithms that underlie our implementation, as well as the design choices and the testing infrastructure of CPFloat are discussed in detail in a recent EPrint. In the numerical experiments shown there, CPFloat brings a considerable speedup (typically one order of magnitude or more) over existing alternatives in C, C++, and MATLAB. To the best of our knowledge, our library is currently the most efficient and complete software for experimenting with custom low-precision floating-point arithmetic available in any language.

## Pivoted Matrices with Tunable Condition Number and the HPL-AI Benchmark

Traditionally, the fastest supercomputers in the world are ranked according to their performance on HPL, a portable implementation of the High-Performance Computing Linpack Benchmark for distributed memory systems. This software package measures the rate of execution, expressed in binary64 floating-point operations per second, at which a computer solves a large dense linear system using LU factorization with row partial pivoting. This metric is an excellent predictor of the performance a machine will achieve on compute-bound tasks executed in binary64 arithmetic, but the test problem is not representative of typical artificial intelligence computations, where lower precision is typically considered satisfactory.

In an endeavour to consider both workloads at once, the Innovative Computing Laboratory (ICL) at the University of Tennesse, Knoxville, developed the HPL-AI Mixed-Precision Benchmark. The reference implementation of the benchmark solves a dense linear system $Ax = b$ of order $n$ to binary64 accuracy by complementing a low-precision direct solver with a high-precision refinement strategy. The code generates $A$ and $b$ in binary64, computes the LU factorization without pivoting of $A$ using only binary32 arithmetic, finds an approximate binary32 solution $\widetilde x$ using forward and back substitution, and finally solves the linear system with GMRES in binary64 arithmetic, using $\widetilde x$ as starting vector and the low-precision LU factors of $A$ as preconditioners.

In order to obtain the best possible rate of execution, the coefficient matrix $A$ must be dense and, as the benchmark is expected to be needed for $n > 10^7$, it should also be easy to generate at scale on a distributed memory environment. Furthermore, we require that $A$ have growth factor of order 1, so as to guarantee the stability of Gaussian elimination without pivoting, and be not too ill-conditioned, in order to ensure the convergence of GMRES. Currently, the reference implementation relies on a randomly generated matrix that is diagonally dominant by rows. This choice ensures the stability of LU factorization without pivoting, but the matrix thus constructed requires an amount of communication that depends on $n$, and turns out to be extremely well conditioned, with an $\infty$-norm condition number just above 4 for large $n$.

In our EPrint Matrices with Tunable Infinity-Norm Condition Number and No Need for Pivoting in LU Factorization, Nick Higham and I present a new parametric class of dense square matrices $A(\alpha,\beta)$ for which LU factorization without pivoting is stable. The parameters $\alpha$ and $\beta$ that yield a specified $\infty$-norm condition number can be computed efficiently, and once they have been chosen the matrix $A(\alpha,\beta)$ can be formed from an explicit formula in $O(n^2)$ floating-point operations.

We also discuss several adaptations aimed at further improving the suitability of the generated matrix as a test problem for the HPL-AI benchmark, as our main goal is to provide a robust and efficient matrix generator for it. In particular, we explain how scaling the matrix can help transition from binary32 to binary16, in order to take advantage of the hardware accelerators for low precision such as the tensor cores that equip recent NVIDIA GPUs, and we discuss how the matrices can be tweaked to make the entries in the LU factors less predictable and slow down the convergence of GMRES.

## Extreme-Scale Test Matrices with Specified 2-Norm Condition Number

The supercomputers in the TOP500 list are ranked using the High-Performance Linpack (HPL) Benchmark, which gauges the performance of a computer by solving a large dense linear system by Gaussian elimination with partial pivoting. The size of the coefficient matrix depends on the computational power of the machine being assessed, because more powerful systems require larger benchmark problems in order to reach their peak performance.

The test matrices used by HPL have random entries uniformly distributed on the interval $(-1/2,1/2]$. The 2-norm condition number of such matrices depends on their size, and can potentially be very large for the matrices required by today’s most powerful computers: the largest linear systems solved on Summit, the machines that leads the November 2019 TOP500 ranking, have order $10^7$, and even larger systems will be needed to benchmark the coming generations of exascale systems.

An $n \times n$ matrix with specified 2-norm condition number can be generated as $A := U \varSigma V^T$, where $U$ and $V$ are random real orthogonal $n \times n$ matrices from the Haar distribution (a natural uniform probability distribution on the orthogonal matrices) and $\varSigma$ is diagonal with nonnegative entries $\sigma_1\ge \cdots \ge \sigma_n \ge 0$. It is well known that $A$ has 2-norm condition number $\kappa_2(A) = \sigma_1/\sigma_n$ if $\sigma_n \neq 0$ and $\kappa(A) = \infty$ otherwise. This technique, which is used by the gallery('randsvd', ...) function, requires $2n^3$ floating-point operations to generate a test matrix of order $n$, and would thus be impractical in an extreme-scale setting.

In our recent EPrint Generating extreme-scale matrices with specified singular values or condition numbers, Nick Higham and I present four methods that, by giving up the requirement that the matrices $U$ and $V$ be from the Haar distribution, reduce the cost of the approach above from cubic to quadratic. The matrices generated retain a number of desirable properties that make them a suitable choice for testing linear solvers at scale.

These cheaper algorithms are particularly well suited to distributed-memory environments, since all communication between the nodes can be avoided at the price of a negligible increase in the overall computational cost. They are appealing to MATLAB users too, as the following example demostrates.

n = 10000; kappa = 1e6; mode = 2; rng(1)
% gallery('randsvd',...)
fprintf('gallery(''randsvd'',...):                %5.2f seconds elapsed.\n',...
timeit(@()gallery('randsvd',n,kappa,mode,[],[],1)));
% Algorithm 3.1 in our EPrint.
method = 1; matrix = 0;
fprintf('Alg. 3.1 with Haar U:                  %5.2f seconds elapsed.\n',...
timeit(@()randsvdfast(n,kappa,mode,method,matrix)));
matrix = 2;
fprintf('Alg. 3.1 with U=gallery(''orthog'',n,2): %5.2f seconds elapsed.\n',...
timeit(@()randsvdfast(n,kappa,mode,method,matrix)));
% Algorithm 4.1 in our EPrint.
method = 3; matrix = 0;
fprintf('Alg. 4.1 with Haar U:                  %5.2f seconds elapsed.\n',...
timeit(@()randsvdfast(n,kappa,mode,method,matrix)));
matrix = 2;
fprintf('Alg. 4.1 with U=gallery(''orthog'',n,2): %5.2f seconds elapsed.\n',...
timeit(@()randsvdfast(n,kappa,mode,method,matrix)));


In the listing above, randsvdfast is an implementation of our algorithms available on the MATLAB Central File Exchange. Setting the argument matrix to 0 tells our function to pick $U$ from the Haar distribution, whereas setting it to 2 causes $U$ to be chosen in a non-random way. Running this code in MATLAB 2019b on a machine equipped with an Intel processor I5-6500 running at 3.20GHz the script produces the output

gallery('randsvd',...):                79.52 seconds elapsed.
Alg. 3.1 with Haar U:                  19.70 seconds elapsed.
Alg. 3.1 with U=gallery('orthog',n,2):  1.90 seconds elapsed.
Alg. 4.1 with Haar U:                  19.28 seconds elapsed.
Alg. 4.1 with U=gallery('orthog',n,2):  1.43 seconds elapsed.


Therefore, for $n = 10{,}000$ randsvdfast is up to 56 times faster than gallery('randsvd',...).

## Determinants of Bohemian matrix families

The adjective “Bohemian” was used for the first time in a linear algebra context by Robert Corless and Steven Thornton to describe the eigenvalues of matrices whose entries are taken from a finite discrete set, usually of integers. The term is a partial acronym for “BOunded Height Matrix of Integers”, but the origin of the term was soon forgotten, and the expression “Bohemian matrix” is now widely accepted.

As Olga Taussky observed already in 1960, the study of matrices with integer elements is “very vast and very old”, with early work of Sylvester and Hadamard that dates back to the second half of the nineteenth century. These names are the first two in a long list of mathematicians that worked on what is now known as the “Hadamard conjecture”: for any positive integer $n$ multiple of 4, there exists an $n$ by $n$ matrix $H$, with entries $-1$ and $+1$, such that $HH^T = nI$.

If this is the best-known open problem surrounding Bohemian matrices, it is far from being the only one. During the 3-day workshop “Bohemian Matrices and Applications” that our group hosted in June last year, Steven Thornton released the Characteristic Polynomial Database, which collects the determinants and characteristic polynomials of billions of samples from certain families of structured as well as unstructured Bohemian matrices. All the available data led Steven to formulate a number of conjectures regarding the determinants of several families of Bohemian upper Hessenberg matrices.

Gian Maria Negri Porzio and I attended the workshop, and set ourselves the task of solving at least one of these open problems. In our recent preprint, we enumerate all the possible determinants of Bohemian upper Hessenberg matrices with ones on the subdiagonal. We consider also the special case of families with main diagonal fixed to zero, whose determinants turn out to be related to some generalizations of Fibonacci numbers. Many of the conjectures stated in the Characteristic Polynomial Database follow from our results.

## The Paterson–Stockmeyer method for evaluating polynomials and rational functions of matrices

According to Moler and Van Loan, truncating the Taylor series expansion to the exponential at 0 is the least effective of their nineteen dubious ways to compute the exponential of a matrix. Such an undoubtedly questionable approach can, however, become a powerful tool for evaluating matrix functions when used in conjunction with other strategies, such as, for example, the scaling and squaring technique. In fact, truncated Taylor series are just a special case of a much larger class of rational approximants, known as the Padé family, on which many state-of-the-art algorithms rely in order to compute matrix functions.

A customary choice for evaluating these approximants at a matrix argument is the Paterson–Stockmeyer method, an evaluation scheme that was originally proposed as an asymptotically optimal algorithm for evaluating polynomials of matrices, but generalizes quite naturally to rational functions, which are nothing but the solutions of a linear system whose coefficients and right-hand side are polynomials of the same matrix. This technique exploits a clever rewriting of a polynomial in $A$ as a polynomials in $A^s$, for some positive integer $s$, and overall requires about $2\sqrt{k}$ matrix products to evaluate a polynomial of degree $k$.

As shown in the figure, when the Paterson–Stockmeyer scheme is used the number of matrix multiplications required to evaluate a polynomial of degree $k$ grows slower than $k$ itself, with the result that evaluating polynomials of different degree will asymptotically have the same cost. For example, evaluating a polynomial of any degree between 43 and 49 requires 12 matrix multiplications, thus there is little point in considering an approximant of degree 43 when evaluating that of degree 49 has roughly the same cost but will in all likelihood deliver a more accurate result.

When designing algorithms to evaluate functions of matrices, one is interested in finding the optimal degrees, those marked with a red circle in the figure above, since they guarantee maximal accuracy for a given computational cost. When fixed precision is considered, finding all such degrees is not a problem: a backward error analysis can be used to determine the maximum degree that will ever be needed, $m_{max}$ say, and then looking at the plot is enough to find all the optimal degrees smaller than $m_{max}$. In order to deal with arbitrary precision arithmetic, however, a different strategy is needed, as depending on the working precision and the required tolerance, approximants of arbitrarily high degree may be needed. The new Eprint Optimality of the Paterson–Stockmeyer Method for Evaluating Matrix Polynomials and Rational Matrix Functions studies the cost of the Paterson–Stockmeyer method for polynomial evaluation and shows that a degree is optimal if and only if it is a quarter-square, that is, a number of the form $\lfloor n^2/4 \rfloor$ for some nonnegative integer $n$, where $\lfloor \cdot \rfloor$ is the floor function.

Similar results can be obtained for Paterson–Stockmeyer-like algorithms for evaluating diagonal Padé approximants, rational functions whose numerator and denominator have same degree. In that case, one can show that an approximant is optimal if and only if the degree of numerator and denominator is an eight-square, an integer of the form $\lfloor n^2/8 \rfloor$ for some $n \in \mathbb{N}$. In particular, for diagonal Padé approximants to the exponential, due to a symmetry of the coefficients of numerator and denominator, faster algorithms can be developed, and an explicit formula—not as nice as that in the two previous cases—can be derived for the optimal orders of these approximants.

## Computing the matrix exponential in arbitrary precision

Edmond Laguerre was the first to discuss, en passant in a paragraph of a long letter to Hermite, the exponential of a matrix. In fact, the French mathematician confined himself to defining $e^X$ via the Taylor series expansion of $e^z$ at $0$ and remarking that the scalar identity $e^{x+y} = e^x e^y$ does not generalise to matrices. Surprisingly perhaps, truncating the Taylor expansion is still today, just over 150 years later, one of the most effective way of approximating the matrix exponential, along with the use of a family of rational approximants discovered a couple of decades later by Henri Padé, a student of Hermite’s.

Despite the fact that the exponential is an entire function, Taylor and Padé approximants converge only locally, and may require an approximant of exceedingly high order for matrices of large norm. Therefore, scaling the input is often necessary, in order to ensure fast and accurate computations. This can be effectively achieved by means of a technique called scaling and squaring, first proposed in 1967 by Lawson, who noticed that by exploiting the identity $e^A = \bigl(e^{2^{-s}A}\bigr)^{2^s}$ one can compute the matrix exponential by evaluating one of the approximants above at $2^{-s}A$ and then squaring the result $s$ times.

The choice of $s$ and of the order of the approximant hinges on a delicate trade-off that has been thoroughly studied in the last 50 years. The most efficient algorithms are tailored to double precision, and rely on the computation of precision-dependent constants, a relatively expensive task that needs to be performed only once during the design stage of the algorithm, and produces algorithms that are very efficient at runtime.

These techniques, however, do not readily extend to arbitrary precision environments, where the working precision at which the computation will be performed is known only at runtime. In the EPrint An arbitrary precision scaling and squaring algorithm for the matrix exponential, Nick Higham and I show how the scaling and squaring method can be adapted to the computation of the matrix exponential in precisions higher than double.

The algorithm we propose combines a new analysis of the forward truncation error of Padé approximants to the exponential with an improved strategy for selecting the parameters required to perform the scaling and squaring step. We evaluate at runtime a bound that arises naturally from our analysis in order to check whether a certain choice of those parameters will deliver an accurate evaluation of the approximant or not. Moreover, we discuss how this technique can be implemented efficiently for truncated Taylor series and diagonal Padé approximants.

Top left: forward error of several algorithms on a test set of non-Hermitian matrices. Top right: corresponding performance profile. Bottom: legend for the other two plots. Variants of our algorithm are in the top row, existing implementations for computing the matrix exponential in arbitrary precision are in the bottom row.

We evaluate experimentally several variants of the proposed algorithm in a wide range of precisions, and find that they all behave in a forward stable fashion. In particular, the most accurate of our implementations outperforms existing algorithms for computing the matrix exponential in high precision.