## Nick Higham Named 2020 ACM Fellow

Professor Nick Higham has been named among the 2020 Association for Computing Machinery (ACM) Fellows, who are recognised for work underpinning contemporary computing.

The accomplishments of the 2020 ACM Fellows have driven innovations that have ushered in significant improvements across many areas of technology, industry, and personal life.

Nick has been recognised for his contributions to numerical linear algebra, numerical stability analysis, and communication of mathematics.

He is among 95 ACM Fellows, representing universities, corporations and research centres around the world, who are celebrated for their wide-ranging and fundamental contributions in areas including artificial intelligence, cloud computing, computer graphics, virtual reality, and more.

The ACM Fellows programme recognises the top 1% of ACM members for their outstanding accomplishments in computing and information technology and/or outstanding service to ACM and the larger computing community. Fellows are nominated by their peers, with nominations reviewed by a distinguished selection committee.

ACM President Gabriele Kotsis said: “The 2020 ACM Fellows have demonstrated excellence across many disciplines of computing. These men and women have made pivotal contributions to technologies that are transforming whole industries, as well as our personal lives. We fully expect that these new ACM Fellows will continue in the vanguard in their respective fields.”

## Numerical Linear Algebra Group Activities 2020

The Numerical Linear Algebra Group had a productive year in 2020, despite working remotely from March onwards because of the pandemic. This post summarizes what we got up to. Publications are not included here, but many of them can be found on MIMS EPrints under the category Numerical Analysis; see also these news stories about our publications.

Craig Lucas, Nick Higham, Xinye Chen, Steven Elsworth, Xiaobo Liu, Michael Connolly, Mantas Mikaitis, Len Freeman, Massimiliano Fasi, Pierre Blanchard, Sven Hammarling, Asad Raza Aitor Mehasi Mehasi, Stephanie Lai, Gian Maria Negri Porzio, Thomas McSweeney, Mawussi Zounon, Françoise Tisseur, Srikara Pranesh, Yuqing Zhang, Eleni Vlachopoulou, March 2020.

## Software

We make our research codes available as open source, principally on GitHub; see the repositories of ConnollyFasiHighamLiuPranesh, Tisseur, and Zounon.

We also put MATLAB software on MATLAB Central File Exchange and on our own web sites, e.g., the Rational Krylov Toolbox (RKToolbox).

## PhD Students

We welcomed new PhD students Xinye Chen and Thomas Seleiro.

Steven Elsworth successfully defended his PhD thesis Rational Krylov Methods and Machine Learning Approaches to Time Series Forecasting in March 2020 .

Michael Connolly took an internship with MathWorks from July to September 2020.

## Postdoctoral Research Associates (PDRAs)

Mantas Mikaitis, previously an EPSRC Doctoral Prize Fellow in the group, is now working on the ICONIC project in the group.   During the year he successfully defended his PhD thesis Arithmetic Accelerators for a Digital Neuromorphic Processor in the Department of Computer Science.

Massimiliano Fasi left the group in September 2020 and is now working at Örebro University in Sweden.

Roberto Cahuantzi  was a member of the group from March to September 2020, working with Stefan Güttel.

## Recognition and Service

Jack Dongarra received the 2020 IEEE Computer Society’s Computer Pioneer Award.

Srikara Pranesh and Michael Connolly won first and second best poster prizes, respectively, at the SIAM UKIE Section Meeting, Edinburgh, January 2020.

Françoise Tisseur received the London Mathematics Society’s Fröhlich Prize.

Theo Mary received an honourable mention for the 2020 Householder Prize and the 2021 SIAG/LA Early Career Prize. He also received a grant from the Faculty of Engineering Sciences of Sorbonne University for a project on”Mixed precision algorithms for HPC”.

Sixteen publications by members of the NLA group feature among the top 40 most read articles in two SIAM journals, both leading venues for publications in numerical linear algebra.

Stefan  Güttel was awarded the 2021 James H. Wilkinson Prize in Numerical Analysis and Scientific Computing.

Nick Higham received the IMA Gold Medal 2020.

Theo Mary has been awarded the 2021 SIAG/LA Early Career Prize by the SIAM Activity Group on Linear Algebra.

## Grants

Stefan Güttel’s and Nick Higham’s Alan Turing Fellowships have been extended by one year to September 2021.

Stefan Güttel received a Small Project Grant from the Alan Turing Institute.

Nick Higham and Françoise Tisseur received funding for work on multi-precision algorithms from Lawrence Livermore National Laboratory under the Exascale Computing Project.

Nick Higham and Françoise Tisseur received funding from The MathWorks, Inc. to support a PhD student to work on exploiting multiprecision arithmetic.

Massimiliano Fasi is one of the participants of the 2020 INdAM-GNCS project “Low-rank methods for linear algebra problems with data-sparse structure” funded by the Scientific Computing Group of the Istituto Nazionale di Alta Matematica “Francesco Severi”.

## Theo Mary Awarded the 2021 SIAG/LA Early Career Prize

Dr Theo Mary, a CNRS researcher at Sorbonne University (Paris) and a former postdoctoral researcher in the Numerical Linear Algebra Group (2018-2019), has been awarded the SIAG/LA Early Career Prize by the SIAM Activity Group on Linear Algebra.

The SIAM Activity Group on Linear Algebra (SIAG/LA) awards the SIAG/LA Early Career Prize every three years to one post-PhD early career researcher in the field of applicable linear algebra for outstanding contributions to the field within six years of receiving the PhD or equivalent degree as of January 1 of the award year. The selection committee wishes to recognize Theo for his “significant contributions to linear algebra topics, including block low rank methods, software development, probabilistic rounding error analysis, mixed precision arithmetic, and backward error analysis.”

The prize will be awarded at the 2021 SIAM Conference on Applied Linear Algebra (LA21), to be held on a virtual platform May 17-21, 2021.

## Nick Higham Awarded the IMA Gold Medal 2020

Professor Nick Higham has been awarded the IMA Gold Medal 2020 by the Institute of Mathematics and its Applications. Institute Gold Medals are awarded every two years in recognition of outstanding contributions to mathematics and its applications.

This is the second such success for a member of the Department of Mathematics: Professor Fritz Ursell won the IMA Gold Medal 1994.

The full prize citation is available here.

Professor Nicholas J. Higham, University of Manchester

## Stefan Güttel awarded the 2021 SIAM James H. Wilkinson Prize

Dr Stefan Güttel will receive the 2021 James H. Wilkinson Prize in Numerical Analysis and Scientific Computing. This prestigious prize, established in 1979, is awarded every four years by the Society for Industrial and Applied Mathematics (SIAM). SIAM is the world’s largest professional association devoted to applied mathematics with over 14,000 individual members.

Dr Güttel’s research is in computational mathematics, and in particular, in efficient numerical algorithms for high-dimensional problems. The prize recognizes his contributions to the analysis, implementation, and application of rational and block Krylov methods. These methods have recently become very popular for the efficient solution of large eigenvalue problems, matrix equations, and in model order reduction.

Dr Güttel’s work is applicable to a wide range of problems in science and engineering. He maintains active collaborations with other Departments and companies such as Arup, Autotrader, AspenTech, N Brown Group, and Schlumberger-Doll Research. He also holds a fellowship with the Alan Turing Institute, the UK’s national institute for data science and artificial intelligence.

Dr Güttel will deliver a prize lecture at the 2021 SIAM Conference on Computational Science and Engineering (CSE21), to be held on a virtual platform in March 2021.

## NLA Group Articles Amongst Most Read In SIAM Journals

Sixteen publications by members of the NLA group feature among the top 40 most read articles in two SIAM journals, both leading venues for publications in numerical linear algebra.

The following articles are amongst the 20 most read in the SIAM Journal on Matrix Analysis and Applications:

The following articles are amongst the 20 most read in the SIAM Journal on Scientific Computing:

The full lists (updated daily) are available here and here

## CPFloat: A C library for emulating low-precision arithmetic

Computing machines with floating-point capabilities started appearing in the late 1940s, and despite being regarded as an optional feature for decades to come, floating-point hardware equipped a number of the mainframe computers produced in the 1960s and 1970s. At the time, there was no consensus on what features a number system ought to have, and the arithmetics implemented by any two vendors could differ widely in terms of word sizes, floating-point representations, rounding behaviour, and accuracy of the operations implemented in hardware. This lack of standardization was a major hindrance to the portability of the source code, as it made it significantly harder to write and maintain architecture-independent code.

The way out of this predicament was the establishment of the IEEE 754-1985 standard, which dictated word sizes, representations, and accuracy requirements for two default floating-point formats: single and double precision. For about three decades, hardware vendors conformed with the standard, and the large majority of general-purpose processing units came equipped with IEEE-compliant arithmetic units. Motivated by the low-precision requirements of machine learning applications, however, in recent years hardware manufacturers have started exploring the realm of low-precision arithmetics, commercialising a variety of highly optimised hardware units that pursue efficiency at the expense of accuracy. The high-performance computing community has attempted to channel the extreme performance of this hardware and utilise it in more traditional scientific computing applications, where more stringent accuracy needs typically require the use of higher precision.

The broad range of different formats available poses a serious challenge to those trying to develop mixed-precision algorithms, since studying the numerical behaviour of an algorithm in different working precisions may require access to a number of high-end hardware platforms, including supercomputer-grade CPUs and GPUs. In order to alleviate the problem, several software packages for simulating low-precision floating-point arithmetics in software have been developed. Some of these are described in the table below. All these solutions execute each arithmetic operation in hardware, and use the software layer only to round the results to the desired number of significant bits of precision.

If the hardware and the software-simulated floating-point format are both IEEE compliant, rounding requires only standard mathematical library functions, but a straightforward implementation of the rounding formulae may potentially cause two kinds of issues. From a theoretical point of view, handling subnormals, underflow, and overflow demands special attention, and numerical errors can cause mathematically correct formulae to behave incorrectly when evaluated in finite arithmetic. In terms of performance, on the other hand, the algorithms obtained in this way are not necessarily efficient, as they build upon library functions that, being designed to handle a broad range of cases, might not be optimised for the specific needs of floating-point rounding functions.

In MATLAB, low-precision arithmetics can be simulated using the chop function, developed by Higham and Pranesh and discussed in a previous post. Mantas Mikaitis and I decided to port the functionalities of this MATLAB implementation to a lower level language, in order to obtain better performance. The scope of the project has progressively broadened, and we have recently released the first version of CPFloat, an open-source C library that offers a wide range of routines for rounding arrays of binary32 or binary64 numbers to lower precision. Our code exploits the bit-level representation of the underlying floating-point formats, and performs only low-level bit manipulation and integer arithmetic without relying on costly library calls.

The project is hosted in a GitHub repository where, along with the C code, we provide a MEX interface for MATLAB and Octave which is fully compatible with the chop function. The algorithms that underlie our implementation, as well as the design choices and the testing infrastructure of CPFloat are discussed in detail in a recent EPrint. In the numerical experiments shown there, CPFloat brings a considerable speedup (typically one order of magnitude or more) over existing alternatives in C, C++, and MATLAB. To the best of our knowledge, our library is currently the most efficient and complete software for experimenting with custom low-precision floating-point arithmetic available in any language.

## Simulating Stochastically Rounded Floating-Point Arithmetic Efficiently

Most general-purpose computing devices implement floating-point arithmetic defined in the IEEE 754-1985 standard (revised in 2008 and 2019) [1]. The standard provides various compulsory and recommended operations in order to prepare the hardware for as wide a space of applications as possible, as well as to assure reproducibility of numerical calculations across different devices. Hardware that is IEEE 754-compliant usually includes at least addition/subtraction, multiplication, division and square root operations. Each of these can be performed with various rounding modes, chosen by the user from round-to-nearest ties-to-even, round-toward-zero, round-toward-infinity, and round-toward-minus-infinity. Round-to-nearest provides the lowest error bound possible and thus is the default choice for most, but the other rounding modes find uses in various applications as well.

Given hardware with the five arithmetic operations and four rounding modes described above, how can we simulate other rounding modes efficiently when needed? Rounding modes that are not included in the IEEE 754 standard but are of interest in various research domains are, for example: round-to-odd [2], round-to-nearest ties-to-zero [3], stochastic rounding with 50/50 chance of rounding to either of the neighbouring values [4], or stochastic rounding with probabilities proportional to the distances between the exact value and the neighbouring floating-point values [5, 6]. The last of these is also used in fixed-point arithmetic: for example in machine learning [7] and ordinary differential equation (ODE) solvers [8].

In a recent preprint with Massimiliano Fasi [9] we address the problem of simulating the proportional variant of stochastic rounding of operations in precision $k$, when there is no higher precision available in hardware (therefore $k$ is the working precision as well as the target precision for rounding). This rounding mode requires the round-off bits captured to be used in calculating the probabilities, but they are lost in IEEE 754-compliant operations of present hardware devices—an operation is performed and only the rounded result is returned in a register. A straightforward approach would be to perform operations in higher precision than $k$ and derive the probabilities from the high precision answer. However, this approach can result in low efficiency code because we would need to use software implementation of higher precision arithmetic operations; for example the MPFR package or other similar software packages developed for the purpose of computing in higher precisions than the working precision.

Our solution is to use error-free floating-point transformations [10] for addition/subtraction and multiplication, and similar transformations (yet not error free) for the division and square root, to compute the distances between the exact number and the two closest floating-point values. This can be done by having the original arguments and the rounded answer from the hardware operation, and knowing what rounding mode was used for producing it. Our algorithms presented in the paper cover two possible cases: when the rounding mode of the hardware operations is round-to-nearest ties-to-even (default) or a combination of the other three rounding modes. The former strategy results in slightly more complicated algorithms since round-to-nearest can round to both sides of the exact number, but the latter is slower due to costs associated with changing the rounding modes—this is true at least with Intel processors which we used for testing.

In simulating stochastically rounded binary64 (double precision) arithmetic operations on a computer equipped with IEEE 754-compliant operations in that precision, our algorithms achieved between 7.3 and 19 times higher throughput of arithmetic operations compared with the simulation that uses MPFR (Table 1), while not making any compromises in accuracy. In the preprint we also evaluate our algorithms in various applications where stochastic rounding is expected to bring advantages: summation of floating-point numbers and ODE solvers. In summary, our algorithms can provide significant speedup for programs that use stochastic rounding operations in working precision.

It is worth noting that the algorithms explored in [9] should not be used if higher precision than the target precision (the precision of which stochastic rounding we would like to simulate) is available in hardware. In such cases, the higher precision should be used if possible, and the high precision results from arithmetic operations should be rounded stochastically to the target precision using different algorithms: such as the ones used in the MATLAB chop [11] and CPFloat [12] packages (a separate blog post on CPFloat can be found here).

### References

[1] 754-2019 – IEEE Standard for Floating-Point Arithmetic. July 2019.
[2] Emulation of a FMA and Correctly Rounded Sums: Proved Algorithms Using Rounding to Odd. S. Boldo and G. Melquind. IEEE Trans. Comp., February 2008.
[3] Emulating Round-to-Nearest-Ties-to-Zero “augmented” Floating-Point Operations Using Round-to-Nearest-Ties-to-Even Arithmetic. S. Boldo, C. Q. Lauter, and J.-M. Muller. IEEE Trans. Comp., June 2020.
[4] CADNA: A library for estimating round-off error propagation. F. Jézéquel and J.-M. Chesneaux. Comp. Phys. Comm., 2008.
[5] Stochastic Rounding and its Probabilistic Backward Error Analysis. M. Connolly, N. J. Higham, and T. Mary. April 2020 (updated August 2020). To appear in SIAM J. Sci. Comp.
[6] Effects of round-to-nearest and stochastic rounding in the numerical solution of the heat equation in low precision. M. Croci and M. B. Giles. October 2020.
[7] Deep Learning with Limited Numerical Precision. S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan. Int. Conf. Machin. Learn., 2015.
[8] Stochastic rounding and reduced-precision fixed-point arithmetic for solving neural ordinary differential equations. M. Hopkins, M. Mikaitis, D. R. Lester, and S. B. Furber. Phil. Tran. A. Roy. Soc., January, 2020.
[9] Algorithms for stochastically rounded elementary arithmetic operations in IEEE 754 floating-point arithmetic. M. Fasi and M. Mikaitis. April 2020 (updated October 2020).
[10] Handbook of Floating-Point Arithmetic. J.-M. Muller, N. Brunie, F. de Dinechin, M. Joldes, V. Lefèvre, G. Melquiond, N. Revol, and S. Torres. 2nd ed. Birkhäuser Basel, 2018.
[11] Simulating Low Precision Floating-Point Arithmetic. N. J. Higham and S. Pranesh. SIAM J. Sci. Comp., October 2019.
[12] CPFloat: A C library for emulating low-precision arithmetic. M. Fasi and M. Mikaitis. October 2020.

### Related articles

What Is Stochastic Rounding? by Nicholas J. Higham
What Is IEEE Standard Arithmetic? by Nicholas J. Higham
What Is Floating-Point Arithmetic? by Nicholas J. Higham
What Is Rounding? by Nicholas J. Higham

## Parallel sparse Linear solvers: Double versus Single precision

It is well established that mixed precision algorithms that factorize a matrix at a precision lower than the working precision can reduce the execution time and the energy consumption of parallel solvers for dense linear systems. Much less is known about the efficiency of mixed precision parallel algorithms for sparse linear systems. Existing work on the performance of mixed precision algorithms for solving sparse linear systems has shown that a speedup up to 2x can be expected, but these studies are limited to single core experiments. In this EPrint, Nick Higham, Françoise Tisseur, Craig Lucas and I evaluate the benefits of using single precision arithmetic in solving a double precision sparse linear systems on multi-core processors and GPUs, focusing on the key components of LU factorization and matrix–vector products.

We find that single precision sparse LU factorization is prone to a severe loss of performance due to the intrusion of subnormal numbers. This phenomenon of LU factorization generating subnormal numbers does not appear to have been observed before. In fact, the elements at the $(k +1)st$ stage of Gaussian elimination are generated from the formula $a_{ij}^{(k+1)} = a_{ij}^{(k)} - m_{ik} a_{kj}^{(k)}, \quad m_{ik} = a_{ik}^{(k)}/a_{kk}^{(k)}$, where $m_{ik}$ is a multiplier. If $A$ is a dense matrix of normalized floating-point numbers with norm of order 1, it is extremely unlikely that any of the $a_{ij}^{(k)}$ will become subnormal. However, for sparse matrices we can identify a mechanism whereby fill-in cascades down a column and small multipliers combine multiplicatively. This mechanism is described, with an illustration for better understanding, in the EPrint, as well as strategies to automatically flush subnormals to zero, to avoid the performance penalties.

The figure below shows the speedup of single precision sparse LU factorization over double precision using the sparse direct solver MUMPS on 10 Intel Skylake cores. We used 36 sparse matrices from the SuiteSparse Matrix Collection, from which 21 matrices are of medium size with 700, 000 to 5, 000, 000 nonzero elements, and 15 larger matrices with 7,000,000 to 64,000,000 nonzeros. The orange bars represent the speeds achieved when subnormals are not flush to zero (FTZ) during the single precision LU factorization. For approximately 30% of the matrices, the orange bars are below the threshold of 1. This means instead of accelerating the computation, single precision slows it down compared with the double sparse LU decomposition. This has been corrected by flushing subnormals to zero as illustrated by the blue bars.

As for the overall performance, our experiments show that the anticipated speedup of 2x over a double precision LU factorization is obtained only for the very largest of our test problems. In the figure above, the matrices that exceed the speedup of 1.5x are exclusively of very large size. This result is not atypical, as two other sparse direct solvers considered in our work, PARDISO and SuperLU show a similar result. This contrasts with dense linear systems, where a 2x speedup is often achieved even for matrices of size as small as $200 \times 200$ when single precision is used in place of double precision.

The speedups observed in our parallel experiments are clearly lower than the speedups reported in the existing works based on single core experiments. To understand this contrast, it is important to recall that sparse matrix factorization algorithms have a pre-processing step called reordering and analysis, before the actual numerical factorization. In most of the sparse direct solvers, the reordering and analysis step is serial or has a limited parallelism. Therefore, while the time spent in numerical factorization step decreases with increasing number of cores, the reordering and analysis time stagnates and can becomes a performance bottleneck. In addition, as the reordering and analysis step does not involve floating point computation, lowering the precision, from double to single, has a little impact on the speedup except in the case of very large matrices where the numerical factorization step remains dominant despite the use of multiple cores.

For iterative solvers, our results show that the performance gain in computing or applying incomplete factorization preconditioners in single precision is not appealing enough to justify the accuracy sacrifice, but we have observed a speedup of 1.5x from matrix–vector product kernels by using single precision. In future work, we will explore new approaches to integrate efficiently single precision matrix–vector product kernels and single precision preconditioners in double precision iterative solvers without accuracy loss.

## Mixed Precision LU Factorization on GPU Tensor Cores

Modern GPUs are equipped with mixed precision units called tensor cores that offer the capability of computing matrix–matrix products both at very high performance and with high accuracy. GPU tensor cores have been used to accelerate various numerical linear algebra algorithms. Among these, LU factorization is a natural candidate, since it can be written so as to heavily rely on matrix–matrix operations.

Haidar et al. were the first to propose an algorithm to compute the LU factorization of a general dense matrix exploiting GPU tensor cores, and obtained up to a 4x-5x speedup in the solution of various linear systems. Moreover, they also showed that the accuracy boost delivered by tensor cores could significantly improve the convergence of iterative methods preconditioned by the computed LU factors.

In a recent paper, Pierre Blanchard, Nick Higham, Florent Lopez, Srikara Pranesh, and I performed the rounding error analysis of an LU factorization algorithm exploiting what we called block FMAs—a generalization of tensor cores that also includes similar mixed precision units such as Google’s TPUs. We explained how block FMAs can be chained together to accumulate the result of intermediate computations in the higher precision (single precision for tensor cores), which allows for a significant accuracy boost.

However, these algorithms shared a common limitation: the matrix was stored in single, rather than half, precision. Indeed, if the matrix is stored in half precision, the accuracy boost is lost and the computed LU factors are essentially identical to those obtained by a standard LU factorization run entirely in half precision. This represents a lost opportunity, because storing the matrix in half precision not only halves the memory footprint, but also greatly reduces data movement, which brings a significant performance boost.

In a recent preprint, Florent Lopez and I propose a new LU factorization algorithm that is able to store the matrix in half precision without incurring such a significant accuracy loss. There are two main ingredients to this new algorithm. First, we switch from the standard right-looking scheme to a left-looking one, which, together with single precision buffers of controlled size, allows us to perform the update operations accurately. Second, we doubly partition the matrix so as to accurately perform the panel factorizations by exploiting tensor cores.

In our experiments, we factorize a range of test matrices using an NVIDIA V100 GPU. The two figures below summarize our main results by comparing three LU factorization algorithms: the standard algorithm (of Haidar et al. and Blanchard et al.) with the matrix stored in either single or half precision, and the new algorithm.

The figure on the left reports the backward error obtained by solving a linear system $Ax=b$ for a dense $n\times n$ matrix with random entries. The figure on the right displays the performance in GigaFLOPS. Unlike the standard algorithm, the new one does not incur a significant loss of accuracy by storing the matrix in half precision. In doing so, it greatly reduces data movement which leads to significant speedup: the new algorithm can be up to twice faster than the standard one.