Computational Graphs for Matrix Functions

Posted on November 8, 2022 by Massimiliano Fasi Leave a comment

Numerical algorithms for evaluating a function $f$ at an $n \times n$ matrix $A$ are based on a variety of different approaches, but for a large class of methods the approximation is formed by relying on three basic operations:

linear combination of matrices $Z \gets c_X X + c_Y Y$ ,
matrix multiplication $Z \gets X \cdot Y$ , and
solution of a linear system with $n$ right-hand sides $X^{-1} \cdot Y$ ,

where $X$ and $Y$ are $n \times n$ matrices, and $c_X$ and $c_Y$ are scalars.

Algorithms that combine only these three basic building blocks are particularly attractive, as they correspond to functions that are easy to work with: if an expression for the scalar ( $n = 1$ ) function $g$ features only linear combinations, multiplications, and inversions, and $g$ is defined on the spectrum of $A$ , then a formula for $g(A)$ can be obtained by replacing all occurrences of $z$ in the formula for $g(z)$ with $A$ .

The choice of considering only three operations may seem restrictive, but, in fact, we are looking at a large class of methods—a class that comprises most algorithms based on polynomial or rational approximation and includes, among others, the scaling and squaring algorithm for the matrix exponential and many iterative algorithms for matrix roots.

These algorithms can be interpreted as graphs, and in particular as directed acyclic graphs (DAGs). Each node in the graph represents a matrix: the inputs are stored in the leaves (nodes without incoming edges), the internal nodes (with both incoming and outgoing edges) hold the intermediate results, and the output of the computation can be read off the root nodes (nodes without outgoing edges). In such a graph, the arcs represent data dependencies and tell us in what order the matrices ought to be computed for the algorithm to return the desired result. Two examples of such computational graphs are in the figure below.

Two examples of computational graphs. — Examples of computational graphs: ten steps of the Denman–Beavers iteration for the square root (left) and the scaling and squaring algorithm for the matrix exponential using a degree-13 rational approximant and five squaring steps (right). Blue arcs represent linear combinations, red arcs represent matrix multiplications, and black dash-dotted arcs represent system solves. The input nodes have a white background, and all other nodes have a the same color as the arc of the operation that produces them. The output node (result of the computation) has a double border. Nodes with only one visible parent are the result of either a squaring ( $A^2$ , $A^4$ , $A^6$ , $Z^{2}$ , $Z^{4}$ , $Z^{8}$ , $Z^{16}$ , and $Z^{32}$ ) or a trivial operation involving the identity matrix: a matrix inversion, represented as $C=A^{-1}I$ , or a multiplication by a scalar, represented as $C = \alpha A + 0 I$

Examples of computational graphs: ten steps of the Denman–Beavers iteration for the square root (left) and the scaling and squaring algorithm for the matrix exponential using a degree-13 rational approximant and five squaring steps (right). Blue arcs represent linear combinations, red arcs represent matrix multiplications, and black dash-dotted arcs represent system solves. The input nodes have a white background, and all other nodes have a the same color as the arc of the operation that produces them. The output node (result of the computation) has a double border. Nodes with only one visible parent are the result of either a squaring ( $A^2$ , $A^4$ , $A^6$ , $Z^{2}$ , $Z^{4}$ , $Z^{8}$ , $Z^{16}$ , and $Z^{32}$ ) or a trivial operation involving the identity matrix: a matrix inversion, represented as $C=A^{-1}I$ , or a multiplication by a scalar, represented as $C = \alpha A + 0 I$

Computational graphs are arguably an elegant way of looking at matrix functions and at algorithms to compute them. In recent work, we have shown that they can also be a practical tool to study existing techniques, improve them, and ultimately derive new ones. The accuracy of an algorithm for evaluating a matrix function is determined by the accuracy of its scalar counterpart, thus the design of new methods for functions of matrices can be seen as a scalar-valued optimization problem. Computational graphs provide a particularly compelling way of carrying out the optimization, as the scalar derivatives can be evaluated automatically by exploiting the structure of the graph. This technique is akin to backpropagation, an approach for training neural networks customarily used in machine learning.

The workflow to design by optimization a new algorithm to approximate $f$ consists of three steps.

First, we choose the topology of the graph, that is, we fix how many linear combinations, matrix multiplications, and matrix inversions the algorithm is allowed to use, and in which order these operations are to be combined. In practice, we take an existing algorithm for evaluating $f$ at a matrix argument, we recast it as a computational graph, and we add nodes representing linear combinations wherever possible. These extra nodes are beneficial, because they increase the number of available degrees of freedom without changing the computational cost of the underlying algorithm.

Next, we optimize the coefficients of the graph. Let us denote by $g$ the function represented by the computational graph. We write $g(z; c)$ to stress that the value of $g$ at $z$ depends on the set of parameters $c$ , which is a vector containing the coefficients of all the linear combinations that appear in the computational graph. These coefficients are the scalars $c_X$ and $c_Y$ that appear in all linear combinations of the form $c_X X + c_Y Y$ in the graph, and they are the quantities that we want to optimize. The goal of the optimization is to find a vector of coefficients $c^*$ such that $g(\cdot; c^*)$ approximates a given function $f$ in a domain $\Omega \subset \mathbb C$ . In practice, we set $\Omega$ to a disc where $f$ and $g$ are both analytic, so that we only need to minimize the error $|g(z,c) - f(z)|$ over (a discretization of) the boundary of $\Omega$ . This produces a surrogate nonlinear least-square problem, which we solve by means of the Gauss–Newton algorithm, a simple choice that already produces good results.

Finally, we use error analysis to bound the error of $g$ as an approximant to $f$ . By relying on higher precision, we are able to guarantee that within $\Omega$ the forward and backward error are below a tolerance parameter $\varepsilon$ . By choosing a suitable set of domains $\Omega_1 \subset \ldots \subset \Omega_\ell$ , we can obtain a set of functions $g_1$ , …, $g_\ell$ that approximate $f$ to a specified accuracy and at increasing computational cost. Given these functions and a matrix $A$ , we can then choose, inexpensively and at runtime, the function $g_i$ that represents the best trade-off between accuracy and speed, and we can use it to compute $f(A) \approx g_i(A)$ .

Numerical results show that this process has the potential to improve state-of-the-art algorithms when $f$ is the matrix exponential or the square root of a matrix. But care is needed in assessing the quality of the new algorithms, as the accuracy of the new algorithms depends on both the topology of the graph and the quality of the starting guess. This is illustrated in the figure below for two algorithms for the computation of $\sqrt{z+1}$ within the disk of radius $1/2$ centered at 0: optimizing the coefficients of a graph representing four steps of the Denman–Beavers iteration brings down the maximum absolute approximation error by over nine orders of magnitude, but for a graph constructed using the Paterson–Stockmeyer evaluation of the degree-13 truncated Taylor approximant the error only improves by three orders of magnitude.

Absolute forward error of four schemes for approximating the function $f(x) = \sqrt{x+1}$ . The plots on the left correspond to four steps of the Denman–Beavers iteration (top) and to the Paterson–Stockmeyer evaluation of the degree-13 truncated Taylor approximants (bottom). The plots on the right correspond to graphs with optimized coefficients (the computational graphs on the left in the same row were used as starting value for the optimization). The blue circles have radius $1/2$ and $z$ -value equal to the maximum relative error within the disk: this value drops from $1.5 \cdot 10^{-6}$ to $1.4 \cdot 10^{-15}$ for the graphs in the first line, but only from $1.9 \cdot 10^{-6}$ to $1.5 \cdot 10^{-9}$ for those in the second.

Absolute forward error of four schemes for approximating the function $f(x) = \sqrt{x+1}$ . The plots on the left correspond to four steps of the Denman–Beavers iteration (top) and to the Paterson–Stockmeyer evaluation of the degree-13 truncated Taylor approximants (bottom). The plots on the right correspond to graphs with optimized coefficients (the computational graphs on the left in the same row were used as starting value for the optimization). The blue circles have radius $1/2$ and $z$ -value equal to the maximum relative error within the disk: this value drops from $1.5 \cdot 10^{-6}$ to $1.4 \cdot 10^{-15}$ for the graphs in the first line, but only from $1.9 \cdot 10^{-6}$ to $1.5 \cdot 10^{-9}$ for those in the second.

As part of this project, we developed a Julia package, GraphMatFun.jl, which offers tools to generate and manipulate computational graphs, modify their topology, optimize their coefficients, and generate C, Julia, and MATLAB code to evaluate them efficiently at scalar and matrix arguments. The software is released under the terms of the MIT license and is freely available on GitHub.

tagged with julia, MATLAB, matrix function

publications

Numerical Linear Algebra Group

Department of Mathematics, The University of Manchester

Computational Graphs for Matrix Functions

Leave a comment Cancel reply

Computational Graphs for Matrix Functions

Share this:

Related

Leave a comment Cancel reply