## Computational Graphs for Matrix Functions

Numerical algorithms for evaluating a function $f$ at an $n \times n$ matrix $A$ are based on a variety of different approaches, but for a large class of methods the approximation is formed by relying on three basic operations:

• linear combination of matrices $Z \gets c_X X + c_Y Y$,
• matrix multiplication $Z \gets X \cdot Y$, and
• solution of a linear system with $n$ right-hand sides $X^{-1} \cdot Y$,

where $X$ and $Y$ are $n \times n$ matrices, and $c_X$ and $c_Y$ are scalars.

Algorithms that combine only these three basic building blocks are particularly attractive, as they correspond to functions that are easy to work with: if an expression for the scalar ($n = 1$) function $g$ features only linear combinations, multiplications, and inversions, and $g$ is defined on the spectrum of $A$, then a formula for $g(A)$ can be obtained by replacing all occurrences of $z$ in the formula for $g(z)$ with $A$.

The choice of considering only three operations may seem restrictive, but, in fact, we are looking at a large class of methods—a class that comprises most algorithms based on polynomial or rational approximation and includes, among others, the scaling and squaring algorithm for the matrix exponential and many iterative algorithms for matrix roots.

These algorithms can be interpreted as graphs, and in particular as directed acyclic graphs (DAGs). Each node in the graph represents a matrix: the inputs are stored in the leaves (nodes without incoming edges), the internal nodes (with both incoming and outgoing edges) hold the intermediate results, and the output of the computation can be read off the root nodes (nodes without outgoing edges). In such a graph, the arcs represent data dependencies and tell us in what order the matrices ought to be computed for the algorithm to return the desired result. Two examples of such computational graphs are in the figure below.

Computational graphs are arguably an elegant way of looking at matrix functions and at algorithms to compute them. In recent work, we have shown that they can also be a practical tool to study existing techniques, improve them, and ultimately derive new ones. The accuracy of an algorithm for evaluating a matrix function is determined by the accuracy of its scalar counterpart, thus the design of new methods for functions of matrices can be seen as a scalar-valued optimization problem. Computational graphs provide a particularly compelling way of carrying out the optimization, as the scalar derivatives can be evaluated automatically by exploiting the structure of the graph. This technique is akin to backpropagation, an approach for training neural networks customarily used in machine learning.

The workflow to design by optimization a new algorithm to approximate $f$ consists of three steps.

First, we choose the topology of the graph, that is, we fix how many linear combinations, matrix multiplications, and matrix inversions the algorithm is allowed to use, and in which order these operations are to be combined. In practice, we take an existing algorithm for evaluating $f$ at a matrix argument, we recast it as a computational graph, and we add nodes representing linear combinations wherever possible. These extra nodes are beneficial, because they increase the number of available degrees of freedom without changing the computational cost of the underlying algorithm.

Next, we optimize the coefficients of the graph. Let us denote by $g$ the function represented by the computational graph. We write $g(z; c)$ to stress that the value of $g$ at $z$ depends on the set of parameters $c$, which is a vector containing the coefficients of all the linear combinations that appear in the computational graph. These coefficients are the scalars $c_X$ and $c_Y$ that appear in all linear combinations of the form $c_X X + c_Y Y$ in the graph, and they are the quantities that we want to optimize. The goal of the optimization is to find a vector of coefficients $c^*$ such that $g(\cdot; c^*)$ approximates a given function $f$ in a domain $\Omega \subset \mathbb C$. In practice, we set $\Omega$ to a disc where $f$ and $g$ are both analytic, so that we only need to minimize the error $|g(z,c) - f(z)|$ over (a discretization of) the boundary of $\Omega$. This produces a surrogate nonlinear least-square problem, which we solve by means of the Gauss–Newton algorithm, a simple choice that already produces good results.

Finally, we use error analysis to bound the error of $g$ as an approximant to $f$. By relying on higher precision, we are able to guarantee that within $\Omega$ the forward and backward error are below a tolerance parameter $\varepsilon$. By choosing a suitable set of domains $\Omega_1 \subset \ldots \subset \Omega_\ell$, we can obtain a set of functions $g_1$, , $g_\ell$ that approximate $f$ to a specified accuracy and at increasing computational cost. Given these functions and a matrix $A$, we can then choose, inexpensively and at runtime, the function $g_i$ that represents the best trade-off between accuracy and speed, and we can use it to compute $f(A) \approx g_i(A)$.

Numerical results show that this process has the potential to improve state-of-the-art algorithms when $f$ is the matrix exponential or the square root of a matrix. But care is needed in assessing the quality of the new algorithms, as the accuracy of the new algorithms depends on both the topology of the graph and the quality of the starting guess. This is illustrated in the figure below for two algorithms for the computation of $\sqrt{z+1}$ within the disk of radius $1/2$ centered at 0: optimizing the coefficients of a graph representing four steps of the Denman–Beavers iteration brings down the maximum absolute approximation error by over nine orders of magnitude, but for a graph constructed using the Paterson–Stockmeyer evaluation of the degree-13 truncated Taylor approximant the error only improves by three orders of magnitude.

As part of this project, we developed a Julia package, GraphMatFun.jl, which offers tools to generate and manipulate computational graphs, modify their topology, optimize their coefficients, and generate C, Julia, and MATLAB code to evaluate them efficiently at scalar and matrix arguments. The software is released under the terms of the MIT license and is freely available on GitHub.