Category Archives: publications

Computational Graphs for Matrix Functions

Numerical algorithms for evaluating a function f at an n \times n matrix A are based on a variety of different approaches, but for a large class of methods the approximation is formed by relying on three basic operations:

  • linear combination of matrices Z \gets c_X X + c_Y Y,
  • matrix multiplication Z \gets X \cdot Y, and
  • solution of a linear system with n right-hand sides X^{-1} \cdot Y,

where X and Y are n \times n matrices, and c_X and c_Y are scalars.

Algorithms that combine only these three basic building blocks are particularly attractive, as they correspond to functions that are easy to work with: if an expression for the scalar (n = 1) function g features only linear combinations, multiplications, and inversions, and g is defined on the spectrum of A, then a formula for g(A) can be obtained by replacing all occurrences of z in the formula for g(z) with A.

The choice of considering only three operations may seem restrictive, but, in fact, we are looking at a large class of methods—a class that comprises most algorithms based on polynomial or rational approximation and includes, among others, the scaling and squaring algorithm for the matrix exponential and many iterative algorithms for matrix roots.

These algorithms can be interpreted as graphs, and in particular as directed acyclic graphs (DAGs). Each node in the graph represents a matrix: the inputs are stored in the leaves (nodes without incoming edges), the internal nodes (with both incoming and outgoing edges) hold the intermediate results, and the output of the computation can be read off the root nodes (nodes without outgoing edges). In such a graph, the arcs represent data dependencies and tell us in what order the matrices ought to be computed for the algorithm to return the desired result. Two examples of such computational graphs are in the figure below.

Two examples of computational graphs.
Examples of computational graphs: ten steps of the Denman–Beavers iteration for the square root (left) and the scaling and squaring algorithm for the matrix exponential using a degree-13 rational approximant and five squaring steps (right). Blue arcs represent linear combinations, red arcs represent matrix multiplications, and black dash-dotted arcs represent system solves. The input nodes have a white background, and all other nodes have a the same color as the arc of the operation that produces them. The output node (result of the computation) has a double border. Nodes with only one visible parent are the result of either a squaring (A^2, A^4, A^6, Z^{2}, Z^{4}, Z^{8}, Z^{16}, and Z^{32}) or a trivial operation involving the identity matrix: a matrix inversion, represented as C=A^{-1}I, or a multiplication by a scalar, represented as C = \alpha A + 0 I

Computational graphs are arguably an elegant way of looking at matrix functions and at algorithms to compute them. In recent work, we have shown that they can also be a practical tool to study existing techniques, improve them, and ultimately derive new ones. The accuracy of an algorithm for evaluating a matrix function is determined by the accuracy of its scalar counterpart, thus the design of new methods for functions of matrices can be seen as a scalar-valued optimization problem. Computational graphs provide a particularly compelling way of carrying out the optimization, as the scalar derivatives can be evaluated automatically by exploiting the structure of the graph. This technique is akin to backpropagation, an approach for training neural networks customarily used in machine learning.

The workflow to design by optimization a new algorithm to approximate f consists of three steps.

First, we choose the topology of the graph, that is, we fix how many linear combinations, matrix multiplications, and matrix inversions the algorithm is allowed to use, and in which order these operations are to be combined. In practice, we take an existing algorithm for evaluating f at a matrix argument, we recast it as a computational graph, and we add nodes representing linear combinations wherever possible. These extra nodes are beneficial, because they increase the number of available degrees of freedom without changing the computational cost of the underlying algorithm.

Next, we optimize the coefficients of the graph. Let us denote by g the function represented by the computational graph. We write g(z; c) to stress that the value of g at z depends on the set of parameters c, which is a vector containing the coefficients of all the linear combinations that appear in the computational graph. These coefficients are the scalars c_X and c_Y that appear in all linear combinations of the form c_X X + c_Y Y in the graph, and they are the quantities that we want to optimize. The goal of the optimization is to find a vector of coefficients c^* such that g(\cdot; c^*) approximates a given function f in a domain \Omega \subset \mathbb C. In practice, we set \Omega to a disc where f and g are both analytic, so that we only need to minimize the error |g(z,c) - f(z)| over (a discretization of) the boundary of \Omega. This produces a surrogate nonlinear least-square problem, which we solve by means of the Gauss–Newton algorithm, a simple choice that already produces good results.

Finally, we use error analysis to bound the error of g as an approximant to f. By relying on higher precision, we are able to guarantee that within \Omega the forward and backward error are below a tolerance parameter \varepsilon. By choosing a suitable set of domains \Omega_1 \subset \ldots \subset \Omega_\ell, we can obtain a set of functions g_1, , g_\ell that approximate f to a specified accuracy and at increasing computational cost. Given these functions and a matrix A, we can then choose, inexpensively and at runtime, the function g_i that represents the best trade-off between accuracy and speed, and we can use it to compute f(A) \approx g_i(A).

Numerical results show that this process has the potential to improve state-of-the-art algorithms when f is the matrix exponential or the square root of a matrix. But care is needed in assessing the quality of the new algorithms, as the accuracy of the new algorithms depends on both the topology of the graph and the quality of the starting guess. This is illustrated in the figure below for two algorithms for the computation of \sqrt{z+1} within the disk of radius 1/2 centered at 0: optimizing the coefficients of a graph representing four steps of the Denman–Beavers iteration brings down the maximum absolute approximation error by over nine orders of magnitude, but for a graph constructed using the Paterson–Stockmeyer evaluation of the degree-13 truncated Taylor approximant the error only improves by three orders of magnitude.

Absolute forward error of four schemes for approximating the function f(x) = \sqrt{x+1}. The plots on the left correspond to four steps of the Denman–Beavers iteration (top) and to the Paterson–Stockmeyer evaluation of the degree-13 truncated Taylor approximants (bottom). The plots on the right correspond to graphs with optimized coefficients (the computational graphs on the left in the same row were used as starting value for the optimization). The blue circles have radius 1/2 and z-value equal to the maximum relative error within the disk: this value drops from 1.5 \cdot 10^{-6} to 1.4 \cdot 10^{-15} for the graphs in the first line, but only from 1.9 \cdot 10^{-6} to 1.5 \cdot 10^{-9} for those in the second.

As part of this project, we developed a Julia package, GraphMatFun.jl, which offers tools to generate and manipulate computational graphs, modify their topology, optimize their coefficients, and generate C, Julia, and MATLAB code to evaluate them efficiently at scalar and matrix arguments. The software is released under the terms of the MIT license and is freely available on GitHub.

Exploiting the MATLAB Language in Implementing a Matrix Collection

“Concise code” may not be commonly used or defined, but as in languages, conciseness—the quality of communicating information in as few words as is feasible—can be a good guide in programming. Conciseness of code, which could mean short and self-explanatory lines or blocks of code, is easier to achieve in higher level languages such as Python or MATLAB, and more difficult when writing, for example, in RISC assembly languages with short instructions each doing very little.

For developing the recently released matrix collection, Anymatrix [1,2] (with Nick Higham), we used MATLAB and exploited its string handling routines to do complicated string manipulations in a few lines of code. This post describes some details of the implementation and further information can be found in the user’s guide [3].

Groups of Matrices

Anymatrix enforces the organization of matrices into groups, which means a new matrix has to be placed either into one of the existing groups or a new group. For example, the built-in groups are contest, core, gallery, hadamard, matlab, nessie, and regtools. Each of these come with around 20 matrices, which can be fixed-sized matrices or matrix generators.

Matrices are identified by <group_name>/<matrix_name>. For example, core/beta is a matrix called beta in the core group. Matrix names are their file names minus the extension .m. Group names are directory names of the groups. Since the IDs of matrices are derived from the directory and file names, uniqueness is guaranteed. This also means that multiple matrices can have the same name as long as they are in different groups.

Around the time we started working on this collection, MATLAB R2020b introduced pattern, to search and match strings conveniently. An example usage in Anymatrix is as follows.

% Matrix ID pattern
matrix_ID_pat = ...
  asManyOfPattern(alphanumericsPattern | ...
    characterListPattern("_") | ...
    characterListPattern("-")) + ...
  '/' + ...
  asManyOfPattern(alphanumericsPattern | ...
    characterListPattern("_") | ...

This code creates a pattern for matrix IDs, which matches sequences of letter, digit, underscore or hyphen characters, followed by a single forward slash, followed again by the sequence. This sort of pattern can then be used, for example, to match strings (using matches) and extract substrings (using extract):

>> S = '...core/beta...';
>> extract(S, matrix_ID_pat)
ans =
  1×1 cell array

The pattern functionality helped in writing concise code. We used it in various places of Anymatrix, for example in detecting matrix IDs in the commands and extracting substrings in the Boolean search expressions to separate properties from brackets and operators and/or/not.

Properties of Matrices

Matrix properties can be specified in two ways, either as an assignment properties = {...} inside the matrix .m files, or in the am_properties.m files in group folders. With the latter option we may specify properties of multiple matrices in one assignment. Anymatrix contains a list of recognized properties, such as ill conditioned or banded. When Anymatrix scans the file structure, it gives a warning if a matrix contains an unrecognized property.

Properties can be searched by Boolean expressions with operators and/or/not and brackets for specifying precedence. For example:

>> anymatrix('p', 'integer and ill conditioned')
ans =
  3×1 cell array
    {'core/wilson'     }
    {'matlab/pascal'   }

The following function that transforms the supplied Boolean expression into an expression that can be evaluated in MATLAB, demonstrates a lot of MATLAB’s features that we made use of.

function IDs = search_by_properties(expression)
  IDs = {};
  % Replace 'and', 'or', and 'not' by corresponding MATLAB symbols.
  expression = replace(expression, ' and ', ' & ');
  expression = replace(expression, ' or ', ' | ');
  expression = replace(expression, ' not ', ' ~');
  expression = replace(expression, '(not ', '(~');
  if startsWith(expression, 'not')
    expression = expression(4:length(expression));
    expression(1) = '~';
  % Assume properties are made up letters, can include a hyphen
  % or a white space character, and there is no case sensitivity.
  pat = (lettersPattern + whitespacePattern + lettersPattern) ...
        | (lettersPattern + characterListPattern('-') ...
          + lettersPattern) ...
        | lettersPattern;
  % Extract properties from the logical expression and replace
  % them by function calls to test for membership.
  ex_props = extract(expression, pat);
  ind = 1;
  new_expression = '';
  for p = ex_props.'
    mod_prop = strcat('ismember(''', ...
        strrep(lower(p{1}), '-', ' '), ...
        ''', strrep(lower(matrix_properties{1}), ''-'', '' ''))');
    trunc_exp = expression(ind:end);
    % Find where the property is in the expression.
    prop_index = strfind(trunc_exp, p{1});
    % Take anything before the property and append the modified
    % version of the property.
    new_expression = strcat(new_expression, ...
        trunc_exp(1:prop_index(1)-1), ...
    % Move the index after the property that was replaced.
    ind = ind + prop_index(1) + length(p{1}) - 1;
  new_expression = strcat(new_expression, expression(ind:end));

  % Find matrices whose properties satisfy the specified logical
  % expression.
  k = 0;
  for matrix_properties = properties.'
    k = k + 1;
    % Test if the expression is true for this matrix
    % and add it's ID.
    if eval(new_expression)
      IDs = [IDs; matrix_IDs{k}];

On lines 4 to 11 we replace the operators and/or/not by the corresponding MATLAB ones &/|/~. We then extract properties from the expression, making use of patterns, modify them to test for membership of the properties and place them back into the expression (lines 14 to 38). Finally, on lines 42 to 50 we use eval to run the Boolean expression as MATLAB code for every matrix, and return a list of suitable matrices.

In the example command anymatrix('p', 'integer and ill conditioned'), the expression 'integer and ill conditioned' is transformed internally to 'ismember('integer', strrep(lower(matrix_properties{1}), '-', ' ')) & ismember('ill conditioned', strrep(lower(matrix_properties{1}), '-', ' '))' which can then be evaluated using eval. Notice that Anymatrix replaces hyphens by white spaces to allow two ways to specify properties, and converts properties to lower case to avoid case sensitivity. This way, ill conditioned and Ill-conditioned, for example, are equivalent properties in Anymatrix.

Testing in Anymatrix

There are two ways to implement testing in Anymatrix. One way is to add a function in file test_run.m in the group folder which can be invoked by an appropriate Anymatrix command. Another way is to test matrices for their properties. This is provided in the directory testing/ in the root folder of Anymatrix.

It is worth noting that not all supported properties can be tested, so only a subset in the built-in collection are. Each property that is feasible to test has a corresponding test_<property>.m file. Functions in these files must return true or false, given a matrix. Anymatrix utilizes MATLAB’s unit testing framework and automatically generates function-based unit tests for every matrix in the collection. When a new matrix or a new property to an existent matrix is added, the unit testing framework picks that up automatically when run. The following is an example function-based unit test that is automatically generated by Anymatrix. It tests the properties of a set of the core/fourier matrices with arbitrary dimensions.

function test_core_fourier(testcase)
  A = anymatrix('core/fourier',3);
  anymatrix_check_props(A, 'core/fourier', testcase);
  A = anymatrix('core/fourier',5);
  anymatrix_check_props(A, 'core/fourier', testcase);
  A = anymatrix('core/fourier',8);
  anymatrix_check_props(A, 'core/fourier', testcase);
  A = anymatrix('core/fourier',10);
  anymatrix_check_props(A, 'core/fourier', testcase);
  A = anymatrix('core/fourier',15);
  anymatrix_check_props(A, 'core/fourier', testcase);
  A = anymatrix('core/fourier',24);
  anymatrix_check_props(A, 'core/fourier', testcase);
  A = anymatrix('core/fourier',25);
  anymatrix_check_props(A, 'core/fourier', testcase);
  A = anymatrix('core/fourier',30);
  anymatrix_check_props(A, 'core/fourier', testcase);
  A = anymatrix('core/fourier',31);
  anymatrix_check_props(A, 'core/fourier', testcase);

Above, the function anymatrix_check_props looks for files test_<property>.m and runs those tests that it finds on the given matrix:

for prop = P.'
  test_func_name = strcat('test_', ...
    strrep(strrep(lower(prop{1}), ...
    '-', '_'), ' ', '_'));
  if isfile(strcat(root_path, '/private/', test_func_name, '.m'))
    handle = str2func(test_func_name);
    verifyTrue(testcase, handle(M), ...
      strcat("Matrix ", matrix_ID, " is not ", prop{1}, "."));

Using this, Anymatrix adds a level of almost automatic testing invoked by adding new matrices or properties to matrices. A good example of this is the MATLAB built-in gallery group, which we made available in Anymatrix. By appending each matrix with properties we can check that MATLAB’s gallery matrices have the expected properties.


Anymatrix is both a collection of matrices and a tool to organize them. It is worth noting that much of the infrastructure is not specific to organizing matrices and so it can be reused to organize into groups, search, and access through IDs any kind of objects appended with properties.

Anymatrix 1.0 was released in October 2021, but the development continues. We welcome contributions to the collection either in the form of remote groups that can be downloaded into Anymatrix given a git url (an example is matrices-mp-cosm group by X. Liu), or suggestions to include matrices or groups in the built-in collection. We also welcome reports of bugs or recommendations for improvements to the clarity, correctness and conciseness of the source code.


[1] N. J. Higham and M. Mikaitis. Anymatrix: An Extensible MATLAB Matrix Collection. MIMS EPrint 2021.16, Manchester Institute for Mathematical Sciences, The University of Manchester, UK. Oct. 2021 (to appear in Numer. Algorithms).

[2] N. J. Higham. Anymatrix: An Extensible MATLAB Matrix Collection. Nov. 2021.

[3] N. J. Higham and M. Mikaitis. Anymatrix: An Extensible MATLAB Matrix Collection. Users’ Guide. MIMS EPrint 2021.15, Manchester Institute for Mathematical Sciences, The University of Manchester, UK. Oct. 2021.

Plotting Circles to Test Low Precision Arithmetics and Rounding Modes

As part of the 28th IEEE Symposium of Computer Arithmetic (held in July 2021), a special issue of IEEE Transactions on Emerging Topics in Computing on “Emerging and Impacting Trends on Computer Arithmetic” was organized, which has recently been published [1]. With Massimiliano Fasi, we have contributed an article [2] (available Open Access) on simulating stochastic rounding [3] in floating-point arithmetics, and about which we wrote in this blog before [4].

One interesting numerical example which we used to test various arithmetics and rounding modes [Sec. VIII C2, 4] is an ordinary differential equation (ODE)

u'(t)=v(t), \\ v'(t)=-u(t).

With the initial conditions u(0)=1 and v(0)=0, the solution of this ODE lies on the unit circle in the uv-plane [p. 51, 5]. When solving it numerically our is aim to obtain the solution that lies as close as possible to the unit circle. A similar example is approximating orbits of asteroids in Physics [Sec. II A, 8], although it is more complicated since it additionally includes velocity and acceleration.

Using a forward Euler scheme we obtain:

u_{k+1}=u_k+hv_k, \\ v_{k+1}=v_k-hu_k,

where h=\frac{2\pi}{n} is a step size. Now, solving this with n=32 in bfloat16 arithmetic [6] with round to nearest (RN) and stochastic rounding in arithmetic operations, we obtain the solution in Figure 1 (dotted line—exact unit circle, dashed—RN, solid—SR), which spirals out as previously shown [p. 51, 5]. To improve this we could use a different solver, such as Trapezium or Leapfrog [p.51, 5], but for the purposes of testing arithmetics rather than solvers we stick with Euler and try different step sizes.

Figure 1: Unit circle ODE solutions with n=32. The x-axis represents u and the y-axis represents v. Dotted line—exact unit circle, dashed—RN, and solid—SR.

Next, we tried n=2^9 and the plot is shown on the left hand side of Figure 2. Visually, this timestep size for the bfloat16 arithmetic with RN looks ideal since it quite closely approximates the unit circle and the solution does not spiral out as before. SR performs quite well here as well but noticeably worse than RN.

Next, we further reduce the timestep to n=2^{11} and the solution with that is plotted on the right hand side of Figure 2. In this case the solution with RN has been affected by rounding errors—the approximation of the unit circle looks visually as an octagon rather than a circle!

So what happened there? In the paper, we explain this through the problem of stagnation in floating-point arithmetic. It happens when many addends to some relatively large value are small enough so that they are all lost to rounding and the sum stays at some initial or intermediately reached value. In this case, once we start at an initial point u=1 and v=0, we expect that both u and v will start to decrease. However, only v is doing that since in u_{k+1}=u_k+hv_k the term hv_k is too small initially to change u_k.

The same pattern repeats during the whole solution and u and v keep switching places in suffering from stagnation. Since SR is immune to stagnation [7], this issue does not appear.

Figure 2: Unit circle ODE solutions with n=2^{9} (left) and n=2^{11}.

Finally, we reduce the timestep by a further factor of 100 (Figure 3). This revealed that the ODE solution solved in bfloat16 hardly moves away from the initial conditions with RN, but is still quite accurately computing the approximation of the unit circle with SR.

Figure 3: Unit circle ODE solutions with n=2^{13}.

The unit circle ODE constitutes an easy to run experiment to observe stagnation in floating-point arithmetic, which is usually done through recursive summation (also in our paper, [Sec. VIII A-B, 2]), and is a good visually convenient benchmark for testing low precision arithmetics and alternative rounding modes, such as stochastic rounding, and perhaps for teaching students about floating point and stagnation.

Further detail on the unit circle experiments, as well as other experiments with different ODEs and solvers, can be found in Section VIII of our paper [2]. The MATLAB code for the experiments is available at


[1] M. Joldes, F. Lamberti, and A. Nannarelli, Special Section on “Emerging and Impacting Trends on Computer Arithmetic”. IEEE Trans. Emerg. Topics Comput., 9:3. Sep. 2021

[2] M. Fasi and M. Mikaitis, Algorithms for Stochastically Rounded Elementary Arithmetic Operations in IEEE 754 Floating-Point Arithmetic. IEEE Trans. Emerg. Topics Comput., 9:3. Sep. 2021

[3] N. J. Higham, What is Stochastic Rounding?. Jul. 2020

[4] M. Mikaitis, Simulating Stochastically Rounded Floating-Point Arithmetic Efficiently. Nov. 2020

[5] N. J. Higham, “Goals of applied mathematical research” in The Princeton Companion to Applied Mathematics, N. J. Higham, M. R. Dennis, P. Glen- dinning, P. A. Martin, F. Santosa, and J. Tanner, Eds. Princeton, NJ, USA: Princeton Univ. Press, 2015, pp. 48–55.

[6] N. J. Higham, What is Bfloat16 Arithmetic?. Jun. 2020

[7] M. P. Connolly, N. J. Higham, and T. Mary, Stochastic rounding and its probabilistic backward error analysis. SIAM J. Sci. Comput., 43(1), A566–A585. Feb. 2021

[8] D. A. Fauxa and J. Godolphin, The floating point: Tales of the unexpected. Am. J. Phys., 89 (8). Aug. 2021

CPFloat: A C library for emulating low-precision arithmetic

Detail of a replica of the mechanical computer Z1 on display at the German Museum of Techonolgy in Berlin. This machine, developed by Konrad Zuse between 1936 and 1938, could perform floating-point computations using a 24-bit representation. “Mechanischer Computer Z1, Nachbildung” by Stiftung Deutsches Technikmuseum Berlin is licensed under CC BY 4.0.

Computing machines with floating-point capabilities started appearing in the late 1940s, and despite being regarded as an optional feature for decades to come, floating-point hardware equipped a number of the mainframe computers produced in the 1960s and 1970s. At the time, there was no consensus on what features a number system ought to have, and the arithmetics implemented by any two vendors could differ widely in terms of word sizes, floating-point representations, rounding behaviour, and accuracy of the operations implemented in hardware. This lack of standardization was a major hindrance to the portability of the source code, as it made it significantly harder to write and maintain architecture-independent code.

The way out of this predicament was the establishment of the IEEE 754-1985 standard, which dictated word sizes, representations, and accuracy requirements for two default floating-point formats: single and double precision. For about three decades, hardware vendors conformed with the standard, and the large majority of general-purpose processing units came equipped with IEEE-compliant arithmetic units. Motivated by the low-precision requirements of machine learning applications, however, in recent years hardware manufacturers have started exploring the realm of low-precision arithmetics, commercialising a variety of highly optimised hardware units that pursue efficiency at the expense of accuracy. The high-performance computing community has attempted to channel the extreme performance of this hardware and utilise it in more traditional scientific computing applications, where more stringent accuracy needs typically require the use of higher precision.

The broad range of different formats available poses a serious challenge to those trying to develop mixed-precision algorithms, since studying the numerical behaviour of an algorithm in different working precisions may require access to a number of high-end hardware platforms, including supercomputer-grade CPUs and GPUs. In order to alleviate the problem, several software packages for simulating low-precision floating-point arithmetics in software have been developed. Some of these are described in the table below. All these solutions execute each arithmetic operation in hardware, and use the software layer only to round the results to the desired number of significant bits of precision.

Available software packages for simulating low-precision floating-pointarithmetic. The first three columns reports the name of the package, the main pro-gramming language in which the software is written, and what storage formats are supported. The following three columns describe the parameters of the target formats: whether the number of bits of precision in the significand is arbitrary (A) or limited to the number of bits in the significand of the storage format (R); whether the exponentrange can be arbitrary (A), must be the same as the storage format (R), or a sub-range thereof (S); whether the target format supports subnormal numbers (S), does not support them (N), supports them only for builtin types (P), or supports them but allows the user to switch the functionality off (F). The following column lists the floating-point formats that are built into the system. The last five columns indicate whetherthe software supports round-to-nearest wih ties-to-even (RNE), ties-to-zero (RNZ),or ties-to-away (RNA), the three directed rounding modes of the IEEE 754 standardround-toward-zero (RZ), round-to-+∞ and round-to-−∞ (RUD), round-to-odd (RO), and the two variants of stochastic rounding discussed in Section 3 (SR). The abbreviations bf16, tf32, fp16, fp32, and fp64 denote the formats bfloat16, TensorFloat-32, binary16, binary32, and binary64, respectively.

If the hardware and the software-simulated floating-point format are both IEEE compliant, rounding requires only standard mathematical library functions, but a straightforward implementation of the rounding formulae may potentially cause two kinds of issues. From a theoretical point of view, handling subnormals, underflow, and overflow demands special attention, and numerical errors can cause mathematically correct formulae to behave incorrectly when evaluated in finite arithmetic. In terms of performance, on the other hand, the algorithms obtained in this way are not necessarily efficient, as they build upon library functions that, being designed to handle a broad range of cases, might not be optimised for the specific needs of floating-point rounding functions.

In MATLAB, low-precision arithmetics can be simulated using the chop function, developed by Higham and Pranesh and discussed in a previous post. Mantas Mikaitis and I decided to port the functionalities of this MATLAB implementation to a lower level language, in order to obtain better performance. The scope of the project has progressively broadened, and we have recently released the first version of CPFloat, an open-source C library that offers a wide range of routines for rounding arrays of binary32 or binary64 numbers to lower precision. Our code exploits the bit-level representation of the underlying floating-point formats, and performs only low-level bit manipulation and integer arithmetic without relying on costly library calls.

The project is hosted in a GitHub repository where, along with the C code, we provide a MEX interface for MATLAB and Octave which is fully compatible with the chop function. The algorithms that underlie our implementation, as well as the design choices and the testing infrastructure of CPFloat are discussed in detail in a recent EPrint. In the numerical experiments shown there, CPFloat brings a considerable speedup (typically one order of magnitude or more) over existing alternatives in C, C++, and MATLAB. To the best of our knowledge, our library is currently the most efficient and complete software for experimenting with custom low-precision floating-point arithmetic available in any language.

Related articles

Simulating Stochastically Rounded Floating-Point Arithmetic Efficiently

Most general-purpose computing devices implement floating-point arithmetic defined in the IEEE 754-1985 standard (revised in 2008 and 2019) [1]. The standard provides various compulsory and recommended operations in order to prepare the hardware for as wide a space of applications as possible, as well as to assure reproducibility of numerical calculations across different devices. Hardware that is IEEE 754-compliant usually includes at least addition/subtraction, multiplication, division and square root operations. Each of these can be performed with various rounding modes, chosen by the user from round-to-nearest ties-to-even, round-toward-zero, round-toward-infinity, and round-toward-minus-infinity. Round-to-nearest provides the lowest error bound possible and thus is the default choice for most, but the other rounding modes find uses in various applications as well.

Given hardware with the five arithmetic operations and four rounding modes described above, how can we simulate other rounding modes efficiently when needed? Rounding modes that are not included in the IEEE 754 standard but are of interest in various research domains are, for example: round-to-odd [2], round-to-nearest ties-to-zero [3], stochastic rounding with 50/50 chance of rounding to either of the neighbouring values [4], or stochastic rounding with probabilities proportional to the distances between the exact value and the neighbouring floating-point values [5, 6]. The last of these is also used in fixed-point arithmetic: for example in machine learning [7] and ordinary differential equation (ODE) solvers [8].

In a recent preprint with Massimiliano Fasi [9] we address the problem of simulating the proportional variant of stochastic rounding of operations in precision k, when there is no higher precision available in hardware (therefore k is the working precision as well as the target precision for rounding). This rounding mode requires the round-off bits captured to be used in calculating the probabilities, but they are lost in IEEE 754-compliant operations of present hardware devices—an operation is performed and only the rounded result is returned in a register. A straightforward approach would be to perform operations in higher precision than k and derive the probabilities from the high precision answer. However, this approach can result in low efficiency code because we would need to use software implementation of higher precision arithmetic operations; for example the MPFR package or other similar software packages developed for the purpose of computing in higher precisions than the working precision.

Our solution is to use error-free floating-point transformations [10] for addition/subtraction and multiplication, and similar transformations (yet not error free) for the division and square root, to compute the distances between the exact number and the two closest floating-point values. This can be done by having the original arguments and the rounded answer from the hardware operation, and knowing what rounding mode was used for producing it. Our algorithms presented in the paper cover two possible cases: when the rounding mode of the hardware operations is round-to-nearest ties-to-even (default) or a combination of the other three rounding modes. The former strategy results in slightly more complicated algorithms since round-to-nearest can round to both sides of the exact number, but the latter is slower due to costs associated with changing the rounding modes—this is true at least with Intel processors which we used for testing.

Table 1: Throughput (in Mop/s) of the C implementations of the algorithms discussed in the paper [9]. The parameter p represents the number of significant digits in the fraction of the MPFR numbers being used; algorithms that do not use MPFR have a missing value in the corresponding row. The baseline for the speedup is the mean thoughtput of the MPFR variant that uses 113 bits to perform the same operation.

In simulating stochastically rounded binary64 (double precision) arithmetic operations on a computer equipped with IEEE 754-compliant operations in that precision, our algorithms achieved between 7.3 and 19 times higher throughput of arithmetic operations compared with the simulation that uses MPFR (Table 1), while not making any compromises in accuracy. In the preprint we also evaluate our algorithms in various applications where stochastic rounding is expected to bring advantages: summation of floating-point numbers and ODE solvers. In summary, our algorithms can provide significant speedup for programs that use stochastic rounding operations in working precision.

It is worth noting that the algorithms explored in [9] should not be used if higher precision than the target precision (the precision of which stochastic rounding we would like to simulate) is available in hardware. In such cases, the higher precision should be used if possible, and the high precision results from arithmetic operations should be rounded stochastically to the target precision using different algorithms: such as the ones used in the MATLAB chop [11] and CPFloat [12] packages (a separate blog post on CPFloat can be found here).


[1] 754-2019 – IEEE Standard for Floating-Point Arithmetic. July 2019.
[2] Emulation of a FMA and Correctly Rounded Sums: Proved Algorithms Using Rounding to Odd. S. Boldo and G. Melquind. IEEE Trans. Comp., February 2008.
[3] Emulating Round-to-Nearest-Ties-to-Zero “augmented” Floating-Point Operations Using Round-to-Nearest-Ties-to-Even Arithmetic. S. Boldo, C. Q. Lauter, and J.-M. Muller. IEEE Trans. Comp., June 2020.
[4] CADNA: A library for estimating round-off error propagation. F. Jézéquel and J.-M. Chesneaux. Comp. Phys. Comm., 2008.
[5] Stochastic Rounding and its Probabilistic Backward Error Analysis. M. Connolly, N. J. Higham, and T. Mary. April 2020 (updated August 2020). To appear in SIAM J. Sci. Comp.
[6] Effects of round-to-nearest and stochastic rounding in the numerical solution of the heat equation in low precision. M. Croci and M. B. Giles. October 2020.
[7] Deep Learning with Limited Numerical Precision. S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan. Int. Conf. Machin. Learn., 2015.
[8] Stochastic rounding and reduced-precision fixed-point arithmetic for solving neural ordinary differential equations. M. Hopkins, M. Mikaitis, D. R. Lester, and S. B. Furber. Phil. Tran. A. Roy. Soc., January, 2020.
[9] Algorithms for stochastically rounded elementary arithmetic operations in IEEE 754 floating-point arithmetic. M. Fasi and M. Mikaitis. April 2020 (updated October 2020).
[10] Handbook of Floating-Point Arithmetic. J.-M. Muller, N. Brunie, F. de Dinechin, M. Joldes, V. Lefèvre, G. Melquiond, N. Revol, and S. Torres. 2nd ed. Birkhäuser Basel, 2018.
[11] Simulating Low Precision Floating-Point Arithmetic. N. J. Higham and S. Pranesh. SIAM J. Sci. Comp., October 2019.
[12] CPFloat: A C library for emulating low-precision arithmetic. M. Fasi and M. Mikaitis. October 2020.

Related articles

What Is Stochastic Rounding? by Nicholas J. Higham
What Is IEEE Standard Arithmetic? by Nicholas J. Higham
What Is Floating-Point Arithmetic? by Nicholas J. Higham
What Is Rounding? by Nicholas J. Higham

Parallel sparse Linear solvers: Double versus Single precision

It is well established that mixed precision algorithms that factorize a matrix at a precision lower than the working precision can reduce the execution time and the energy consumption of parallel solvers for dense linear systems. Much less is known about the efficiency of mixed precision parallel algorithms for sparse linear systems. Existing work on the performance of mixed precision algorithms for solving sparse linear systems has shown that a speedup up to 2x can be expected, but these studies are limited to single core experiments. In this EPrint, Nick Higham, Françoise Tisseur, Craig Lucas and I evaluate the benefits of using single precision arithmetic in solving a double precision sparse linear systems on multi-core processors and GPUs, focusing on the key components of LU factorization and matrix–vector products.

We find that single precision sparse LU factorization is prone to a severe loss of performance due to the intrusion of subnormal numbers. This phenomenon of LU factorization generating subnormal numbers does not appear to have been observed before. In fact, the elements at the (k +1)st stage of Gaussian elimination are generated from the formula a_{ij}^{(k+1)} = a_{ij}^{(k)} - m_{ik} a_{kj}^{(k)}, \quad m_{ik} = a_{ik}^{(k)}/a_{kk}^{(k)}, where m_{ik} is a multiplier. If A is a dense matrix of normalized floating-point numbers with norm of order 1, it is extremely unlikely that any of the a_{ij}^{(k)} will become subnormal. However, for sparse matrices we can identify a mechanism whereby fill-in cascades down a column and small multipliers combine multiplicatively. This mechanism is described, with an illustration for better understanding, in the EPrint, as well as strategies to automatically flush subnormals to zero, to avoid the performance penalties.

The figure below shows the speedup of single precision sparse LU factorization over double precision using the sparse direct solver MUMPS on 10 Intel Skylake cores. We used 36 sparse matrices from the SuiteSparse Matrix Collection, from which 21 matrices are of medium size with 700, 000 to 5, 000, 000 nonzero elements, and 15 larger matrices with 7,000,000 to 64,000,000 nonzeros. The orange bars represent the speeds achieved when subnormals are not flush to zero (FTZ) during the single precision LU factorization. For approximately 30% of the matrices, the orange bars are below the threshold of 1. This means instead of accelerating the computation, single precision slows it down compared with the double sparse LU decomposition. This has been corrected by flushing subnormals to zero as illustrated by the blue bars.

Single precision speedup over double precision for sparse LU factorization
using MUMPS on 10 Intel Skylake cores.

As for the overall performance, our experiments show that the anticipated speedup of 2x over a double precision LU factorization is obtained only for the very largest of our test problems. In the figure above, the matrices that exceed the speedup of 1.5x are exclusively of very large size. This result is not atypical, as two other sparse direct solvers considered in our work, PARDISO and SuperLU show a similar result. This contrasts with dense linear systems, where a 2x speedup is often achieved even for matrices of size as small as 200 \times 200 when single precision is used in place of double precision.

The speedups observed in our parallel experiments are clearly lower than the speedups reported in the existing works based on single core experiments. To understand this contrast, it is important to recall that sparse matrix factorization algorithms have a pre-processing step called reordering and analysis, before the actual numerical factorization. In most of the sparse direct solvers, the reordering and analysis step is serial or has a limited parallelism. Therefore, while the time spent in numerical factorization step decreases with increasing number of cores, the reordering and analysis time stagnates and can becomes a performance bottleneck. In addition, as the reordering and analysis step does not involve floating point computation, lowering the precision, from double to single, has a little impact on the speedup except in the case of very large matrices where the numerical factorization step remains dominant despite the use of multiple cores.

For iterative solvers, our results show that the performance gain in computing or applying incomplete factorization preconditioners in single precision is not appealing enough to justify the accuracy sacrifice, but we have observed a speedup of 1.5x from matrix–vector product kernels by using single precision. In future work, we will explore new approaches to integrate efficiently single precision matrix–vector product kernels and single precision preconditioners in double precision iterative solvers without accuracy loss.

Mixed Precision LU Factorization on GPU Tensor Cores

Modern GPUs are equipped with mixed precision units called tensor cores that offer the capability of computing matrix–matrix products both at very high performance and with high accuracy. GPU tensor cores have been used to accelerate various numerical linear algebra algorithms. Among these, LU factorization is a natural candidate, since it can be written so as to heavily rely on matrix–matrix operations.

Haidar et al. were the first to propose an algorithm to compute the LU factorization of a general dense matrix exploiting GPU tensor cores, and obtained up to a 4x-5x speedup in the solution of various linear systems. Moreover, they also showed that the accuracy boost delivered by tensor cores could significantly improve the convergence of iterative methods preconditioned by the computed LU factors.

In a recent paper, Pierre Blanchard, Nick Higham, Florent Lopez, Srikara Pranesh, and I performed the rounding error analysis of an LU factorization algorithm exploiting what we called block FMAs—a generalization of tensor cores that also includes similar mixed precision units such as Google’s TPUs. We explained how block FMAs can be chained together to accumulate the result of intermediate computations in the higher precision (single precision for tensor cores), which allows for a significant accuracy boost.

However, these algorithms shared a common limitation: the matrix was stored in single, rather than half, precision. Indeed, if the matrix is stored in half precision, the accuracy boost is lost and the computed LU factors are essentially identical to those obtained by a standard LU factorization run entirely in half precision. This represents a lost opportunity, because storing the matrix in half precision not only halves the memory footprint, but also greatly reduces data movement, which brings a significant performance boost.

In a recent preprint, Florent Lopez and I propose a new LU factorization algorithm that is able to store the matrix in half precision without incurring such a significant accuracy loss. There are two main ingredients to this new algorithm. First, we switch from the standard right-looking scheme to a left-looking one, which, together with single precision buffers of controlled size, allows us to perform the update operations accurately. Second, we doubly partition the matrix so as to accurately perform the panel factorizations by exploiting tensor cores.

In our experiments, we factorize a range of test matrices using an NVIDIA V100 GPU. The two figures below summarize our main results by comparing three LU factorization algorithms: the standard algorithm (of Haidar et al. and Blanchard et al.) with the matrix stored in either single or half precision, and the new algorithm.

The figure on the left reports the backward error obtained by solving a linear system Ax=b for a dense n\times n matrix with random entries. The figure on the right displays the performance in GigaFLOPS. Unlike the standard algorithm, the new one does not incur a significant loss of accuracy by storing the matrix in half precision. In doing so, it greatly reduces data movement which leads to significant speedup: the new algorithm can be up to twice faster than the standard one.


The need to compute matrix functions arises in many applications in science and engineering. Specialized methods exist for evaluating particular matrix functions, for example, the well-known scaling and squaring algorithm for the matrix exponential, Newton’s method for matrix sign function, and the inverse scaling and squaring method for the matrix logarithm. For some functions a specialized method is not available, in which case a general purpose algorithm is needed. The Schur-Parlett algorithm computes a general function f of a matrix, with the function dependence restricted to the evaluation of f on the diagonal blocks of the reordered and blocked Schur form. It evaluates f on the nontrivial diagonal blocks via a Taylor series, so it requires the derivatives of f and it also requires the Taylor series to have a sufficiently large radius of convergence. However, the derivatives are not always available or accurately computable.

In our recent preprint, Nick Higham and I present a version of the Schur-Parlett algorithm that requires only function values and uses higher precision arithmetic to evaluate f on the diagonal blocks of order greater than 2 (if there are any) of the reordered and blocked Schur form. Our algorithm is inspired by Davies’s randomized approximate diagonalization method, but we show that the measure of error that underlies randomized approximate diagonalization makes it not a reliable numerical method for computing matrix functions. The key idea in our algorithm is to compute by diagonalization the function of a small random diagonal perturbation of each triangular block, where the perturbation ensures that diagonalization will succeed. Multiprecision algorithms have already been developed for the matrix exponential and the matrix logarithm, and those algorithms are tightly coupled to the functions in question, whereas here we place no restrictions on the function. And the algorithm we propose is not restricted only to a working precision of double since its framework is precision independent. We test our algorithm in double precision on a variety of test problems, and the results show that the algorithm generally behaves in a numerically stable fashion. We apply our algorithm to the matrix Mittag-Leffler function and show that it yields results of accuracy similar to, and in some cases much greater than, the state of the art algorithm for this function.

In the following example we employ the algorithm to compute in double precision the matrix Airy function, and compare the result with that produced by a spectral decomposition method (which breaks down if A is defective, and will be inaccurate if A is close to being defective).

We observe that the normwise relative difference between our algorithm and the spectral decomposition method is tiny. This is just one of the many matrix functions that no specialized algorithm is available for, and the MATLAB \texttt{funm} function, which implements the Schur-Parlett algorithm, is not able to compute. Our new algorithm greatly expands the class of readily computable matrix functions, and its MATLAB code is available here.

Pivoted Matrices with Tunable Condition Number and the HPL-AI Benchmark

The RIKEN Center for Computational Science in Kobe, Japan, home to the supercomputer Fugaku. This machine, scheduled to become fully operational in 2021, heads the June 2020 edition of the TOP500 list and achieved an execution rate of 1.421 x 1018 (mixed-precision) floating-point operations per second (1.421 exaflops) on the HPL-AI Mixed-Precision Benchmark by solving a system of order 13,516,800.

Traditionally, the fastest supercomputers in the world are ranked according to their performance on HPL, a portable implementation of the High-Performance Computing Linpack Benchmark for distributed memory systems. This software package measures the rate of execution, expressed in binary64 floating-point operations per second, at which a computer solves a large dense linear system using LU factorization with row partial pivoting. This metric is an excellent predictor of the performance a machine will achieve on compute-bound tasks executed in binary64 arithmetic, but the test problem is not representative of typical artificial intelligence computations, where lower precision is typically considered satisfactory.

In an endeavour to consider both workloads at once, the Innovative Computing Laboratory (ICL) at the University of Tennesse, Knoxville, developed the HPL-AI Mixed-Precision Benchmark. The reference implementation of the benchmark solves a dense linear system Ax = b of order n to binary64 accuracy by complementing a low-precision direct solver with a high-precision refinement strategy. The code generates A and b in binary64, computes the LU factorization without pivoting of A using only binary32 arithmetic, finds an approximate binary32 solution \widetilde x using forward and back substitution, and finally solves the linear system with GMRES in binary64 arithmetic, using \widetilde x as starting vector and the low-precision LU factors of A as preconditioners.

In order to obtain the best possible rate of execution, the coefficient matrix A must be dense and, as the benchmark is expected to be needed for n > 10^7, it should also be easy to generate at scale on a distributed memory environment. Furthermore, we require that A have growth factor of order 1, so as to guarantee the stability of Gaussian elimination without pivoting, and be not too ill-conditioned, in order to ensure the convergence of GMRES. Currently, the reference implementation relies on a randomly generated matrix that is diagonally dominant by rows. This choice ensures the stability of LU factorization without pivoting, but the matrix thus constructed requires an amount of communication that depends on n, and turns out to be extremely well conditioned, with an \infty-norm condition number just above 4 for large n.

In our EPrint Matrices with Tunable Infinity-Norm Condition Number and No Need for Pivoting in LU Factorization, Nick Higham and I present a new parametric class of dense square matrices A(\alpha,\beta) for which LU factorization without pivoting is stable. The parameters \alpha and \beta that yield a specified \infty-norm condition number can be computed efficiently, and once they have been chosen the matrix A(\alpha,\beta) can be formed from an explicit formula in O(n^2) floating-point operations.

We also discuss several adaptations aimed at further improving the suitability of the generated matrix as a test problem for the HPL-AI benchmark, as our main goal is to provide a robust and efficient matrix generator for it. In particular, we explain how scaling the matrix can help transition from binary32 to binary16, in order to take advantage of the hardware accelerators for low precision such as the tensor cores that equip recent NVIDIA GPUs, and we discuss how the matrices can be tweaked to make the entries in the LU factors less predictable and slow down the convergence of GMRES.

Numerical Behaviour of Tensor Cores

For many years, the arithmetic operations available on most hardware were +, -, *, /, \sqrt{}. More recently (\sim 20 years ago) the FMA (fused multiply-add) operation also became prevalent in general purpose devices such as CPUs and GPUs. Software builds on top of these operations, for example compilers use a series of these and other hardware instructions to implement more complex functions and algorithms such as the exponential function, matrix multiply, or the algorithms for pseudo random number generation. These arithmetic operations have been standardized by the IEEE 754 floating-point standard since 1985 and most current systems are compliant with it.

Recently, because of the increasing adoption of machine learning, general purpose devices started to include inner product or matrix multiply-accumulate (MMA) operations in hardware. This is a generalization of a scalar FMA to vectors and matrices. Since it is performed in hardware, the expected speedup is achieved due to parallelism — instead of using a few FMA hardware units sequentially to multiply matrices as is the case in software, all of the elements of a matrix of some size are computed in parallel. Using the inner product and matrix multiply-accumulate operations in hardware, compute-bound applications that have high-intensity usage of them are sped up significantly.

Figure 1: Mixed-precision matrix muliply-accumulate operation on 4×4 matrices performed by the tensor cores.

NVIDIA GPUs are widely used for machine learning and other applications. In the latest TOP500 supercomputer list published in June 2020 [1], 112 computers are equipped with NVIDIA graphics cards. The main feature of the recent NVIDIA GPUs is hundreds of arithmetic units for performing the MMA operation — NVIDIA calls these tensor cores. Various applications outside machine learning are being explored in order to utilize very high arithmetic throughput that can be achieved when using MMAs in hardware [2]. Table 1 lists three recent NVIDIA architectures as well as other hardware devices with an MMA operation on chip. The NVIDIA V100 has the first version and the T4 has the second version of tensor cores; in these devices tensor cores work on matrices of size 4 \times 4 in mixed-precision of fp16 and fp32, as shown in Figure 1. The most recent revision of tensor cores in the A100 updated both the precisions available and the dimensions of input/output matrices. See the NVIDIA V100 and NVIDIA A100 whitepapers for more details.

Table 1: Processing units or architectures equipped with mixed-precision matrix multiply-accumulate accelerators. The notation m \times k \times n refers to the matrix multiply-accumulate operation where C and D are m \times n matrices, and A and B have size m \times k and k \times n, respectively.

The MMA operation is not standardized by the IEEE 754, therefore various numerical features of tensor cores are not clear. Knowing such features as the support for subnormal numbers, order of operations, and rounding modes and normalization of significands in various parts of the MMA computation can be important, for example when doing error analysis of algorithms that utilize MMAs in order to derive error bounds [3]. As the application space utilizing tensor cores is growing beyond machine learning, understanding of the numerical behaviour of tensor cores will become increasingly useful. Even before such hardware is put to use in some applications, one might want to simulate the behaviour of tensor cores in order to develop numerical codes targeted to tensor cores, on a conventional hardware. Furthermore, there is also a question of differences in numerical behaviour between tensor cores and software MMA using the standard FMA hardware calls, for example differences that would appear when transitioning the existent software to use tensor cores.

Motivated by this, in our recent preprint [4] we investigate an experimental testing methodology of the tensor cores. The method consists of the following steps.

  1. Identify a numerical feature that needs to be tested, for example, the rounding mode in the 5-operand adders that accumulate products in MMA.
  2. Identify possible implementations, for example round to nearest or round toward zero.
  3. Find floating-point inputs that would result in different outputs for each possible hardware behaviour in 2.
  4. Observe outputs and make a conclusion based on expected outputs for each possible behaviour in 2.

Using this approach, we identified the following list of numerical features of the NVIDIA tensor cores.

  • Subnormal floating-point numbers are fully supported, both on the inputs and outputs.
  • 5-operand adders accumulate 5 addends (4 products from AB and a value from C) starting from the largest in magnitude.
  • Round toward zero, rather than round to nearest (default in IEEE 754-compliant arithmetic), in the 5-operand adders is implemented.
  • Different normalization behaviour from the MMA implemented in software (tensor cores normalize the answer of the whole 5-element dot product at the end rather than after each addition of products).
  • Inner products without intermediate normalization are shown to be non-monotonic in rare cases (this result is more general than tensor cores, since to the best of our knowledge, most hardware implementations do not normalize on every addition due to lower hardware costs).

Our conclusion is that in the current version, the tensor cores on V100 and T4 (the A100 is not yet available to us) do not replicate the behaviour of the MMA implemented with IEEE 754 compliant FMA hardware operations. These numerical behaviours are expected in a hardware MMA optimized for reducing hardware cost and most likely is motivated by a fact that machine learning applications usually are claimed to not need all the IEEE 754 features and high precision in general. These results provide the parameters that can be used in rounding error analysis of tensor cores [3] which can be useful when developing numerical software.

Our CUDA code to test numerical features of tensor cores is available here.



[2] A. Abdelfattah et al. A Survey of Numerical Methods Utilizing Mixed Precision Arithmetic. July 2020. Published online.

[3] P. Blanchard, N. J. Higham , F. Lopez, T. Mary, and S. Pranesh. Mixed Precision Block Fused Multiply-Add: Error Analysis and Application to GPU Tensor Cores. SIAM J. Sci. Comput. May 2020.

[4] M. Fasi, N. J. Higham, M. Mikaitis, and S. Pranesh. Numerical Behavior of NVIDIA Tensor Cores. July 2020; revised October 2020. MIMS Eprint, published online.

Related articles

« Older Entries