## Stochastic Rounding Has Unconditional Probabilistic Error Bounds

In the IEEE standard 754 for binary floating-point arithmetic, four rounding modes are defined.

• Round to nearest.
• Round towards $0$.
• Round towards $+\infty$.
• Round towards $-\infty$.

Recently stochastic rounding, an old idea that dates back to the beginning of the digital computer era, has gained popularity, most notably in deep learning. While the rounding modes defined in the IEEE standard are deterministic, stochastic rounding is inherently random. We can define two modes of stochastic rounding. Consider the figure below, where we have a real number $x$ and adjacent floating-point numbers $a$ and $b$. In what we call mode 1 stochastic rounding, we round $x$ to either $a$ or $b$ with equal probability. In mode 2 stochastic rounding, we round to $a$ (or $b$) with probability proportional to $1$ minus the distance of $x$ from $a$ (or $b$). So in the example shown, for mode 2 we are more likely to round to $b$ than to $a$.

In our recent EPrint Stochastic Rounding and its Probabilistic Backward Error Analysis, Nick Higham, Theo Mary and I generalized previous probabilistic error analysis results [SIAM J. Sci. Comput., 41 (2019), pp. A2815–A2835] to stochastic rounding.

We show that stochastic rounding (specifically mode 2) has the property of producing rounding errors $\delta_k$ that are random variables satisfying

• mean zero: $\mathbb{E}(\delta_k) = 0$,
• mean independent: $\mathbb{E}(\delta_k \mid \delta_{k-1}, \dots, \delta_1) = \mathbb{E}(\delta_k)$.

Here, $\mathbb{E}$ denotes the expectation. A key consequence of these results is that we can replace the worst case error bound proportional to $\gamma_n = nu + O(u^2)$, ubiquitous in backward error analyses, by a more informative probabilistic bound proportional to $\widetilde{\gamma}_n = \sqrt{n}u + O(u^2)$. What is a rule of thumb for round to nearest becomes a rule for stochastic rounding: it is proved that our rounding errors satisfy the above properties.

In the figure below, we compute the inner product $s = x^Ty$ of vectors sampled uniformly from $[0,1]$ in both round to nearest and stochastic rounding in IEEE half precision arithmetic. Shown is the backward error for each value of $n$ and the bounds $\gamma_n$ and $\widetilde{\gamma}_n$.

For increasing problem size $n$, with round to nearest the error no longer satisfies the $\sqrt{n}u$ bound. This is due to the phenomenon we call stagnation, and low precisions are particularly susceptible to it. As we compute recursively $s_i = s_{i-1} + x_iy_i$, eventually the partial sum $s_{i-1}$ can grow so large that the update $x_iy_i$ is less than half of the spacing of floating-point numbers around $s_{i-1}$ and under round to nearest the partial sum does not increase. This means we produce negative rounding errors, which are of course not mean zero. Stochastic rounding avoids this issue by rounding randomly. In fact we can prove that $\mathbb{E}(\widehat{s}) = s$, that is, the expected value of the computed result under stochastic rounding is the exact result.

## Extreme-Scale Test Matrices with Specified 2-Norm Condition Number

The supercomputers in the TOP500 list are ranked using the High-Performance Linpack (HPL) Benchmark, which gauges the performance of a computer by solving a large dense linear system by Gaussian elimination with partial pivoting. The size of the coefficient matrix depends on the computational power of the machine being assessed, because more powerful systems require larger benchmark problems in order to reach their peak performance.

The test matrices used by HPL have random entries uniformly distributed on the interval $(-1/2,1/2]$. The 2-norm condition number of such matrices depends on their size, and can potentially be very large for the matrices required by today’s most powerful computers: the largest linear systems solved on Summit, the machines that leads the November 2019 TOP500 ranking, have order $10^7$, and even larger systems will be needed to benchmark the coming generations of exascale systems.

An $n \times n$ matrix with specified 2-norm condition number can be generated as $A := U \varSigma V^T$, where $U$ and $V$ are random real orthogonal $n \times n$ matrices from the Haar distribution (a natural uniform probability distribution on the orthogonal matrices) and $\varSigma$ is diagonal with nonnegative entries $\sigma_1\ge \cdots \ge \sigma_n \ge 0$. It is well known that $A$ has 2-norm condition number $\kappa_2(A) = \sigma_1/\sigma_n$ if $\sigma_n \neq 0$ and $\kappa(A) = \infty$ otherwise. This technique, which is used by the gallery('randsvd', ...) function, requires $2n^3$ floating-point operations to generate a test matrix of order $n$, and would thus be impractical in an extreme-scale setting.

In our recent EPrint Generating extreme-scale matrices with specified singular values or condition numbers, Nick Higham and I present four methods that, by giving up the requirement that the matrices $U$ and $V$ be from the Haar distribution, reduce the cost of the approach above from cubic to quadratic. The matrices generated retain a number of desirable properties that make them a suitable choice for testing linear solvers at scale.

These cheaper algorithms are particularly well suited to distributed-memory environments, since all communication between the nodes can be avoided at the price of a negligible increase in the overall computational cost. They are appealing to MATLAB users too, as the following example demostrates.

n = 10000; kappa = 1e6; mode = 2; rng(1)
% gallery('randsvd',...)
fprintf('gallery(''randsvd'',...):                %5.2f seconds elapsed.\n',...
timeit(@()gallery('randsvd',n,kappa,mode,[],[],1)));
% Algorithm 3.1 in our EPrint.
method = 1; matrix = 0;
fprintf('Alg. 3.1 with Haar U:                  %5.2f seconds elapsed.\n',...
timeit(@()randsvdfast(n,kappa,mode,method,matrix)));
matrix = 2;
fprintf('Alg. 3.1 with U=gallery(''orthog'',n,2): %5.2f seconds elapsed.\n',...
timeit(@()randsvdfast(n,kappa,mode,method,matrix)));
% Algorithm 4.1 in our EPrint.
method = 3; matrix = 0;
fprintf('Alg. 4.1 with Haar U:                  %5.2f seconds elapsed.\n',...
timeit(@()randsvdfast(n,kappa,mode,method,matrix)));
matrix = 2;
fprintf('Alg. 4.1 with U=gallery(''orthog'',n,2): %5.2f seconds elapsed.\n',...
timeit(@()randsvdfast(n,kappa,mode,method,matrix)));


In the listing above, randsvdfast is an implementation of our algorithms available on the MATLAB Central File Exchange. Setting the argument matrix to 0 tells our function to pick $U$ from the Haar distribution, whereas setting it to 2 causes $U$ to be chosen in a non-random way. Running this code in MATLAB 2019b on a machine equipped with an Intel processor I5-6500 running at 3.20GHz the script produces the output

gallery('randsvd',...):                79.52 seconds elapsed.
Alg. 3.1 with Haar U:                  19.70 seconds elapsed.
Alg. 3.1 with U=gallery('orthog',n,2):  1.90 seconds elapsed.
Alg. 4.1 with Haar U:                  19.28 seconds elapsed.
Alg. 4.1 with U=gallery('orthog',n,2):  1.43 seconds elapsed.


Therefore, for $n = 10{,}000$ randsvdfast is up to 56 times faster than gallery('randsvd',...).

## Sharper Probabilistic Backward Error Bounds

Most backward error bounds for numerical linear algebra algorithms are of the form $nu$, for a machine precision $u$ and a problem size $n$. The dependence on $n$ of these bounds is known to be pessimistic: together with Nick Higham, our recent probabilistic analysis [SIAM J. Sci. Comput., 41 (2019), pp. A2815–A2835], which assumes rounding errors to be independent random variables of mean zero, proves that $n$ can be replaced by a small multiple of $\sqrt{n}$ with high probability. However, even these smaller bounds can still be pessimistic, as the figure below illustrates.

The figure plots the backward error for summation (in single precision) of $n$ floating-point numbers randomly sampled from a uniform distribution. For numbers in the $[0,1]$ distribution, the bound $\sqrt{n}u$ is almost sharp and accurately predicts the error growth. However, for the $[-1,1]$ distribution, the error is much smaller, seemingly not growing with $n$. This strong dependence of the backward error on the data cannot be explained by the existing bounds, which do not depend on the values of the data.

In our recent preprint, we perform a new probabilistic analysis that combines a probabilistic model of the rounding errors with a second probabilistic model of the data. Our analysis reveals a strong dependence of the backward error on the mean of the data $\mu$: indeed, our new backward error bounds are proportional to $\mu\sqrt{n}u + u$. Therefore, for data with small or zero mean, these new bounds are much sharper as they bound the backward error by a small multiple of the machine precision independent of the problem size $n$.

Motivated by this observation, we also propose new algorithms that transform the data to have zero mean, so as to benefit from these more favorable bounds. We implement this idea for matrix multiplication and show that our new algorithm can produce significantly more accurate results than standard matrix multiplication.

## Issues with Rounding in the GCC Implementation of the ISO 18037:2008 Standard Fixed-Point Arithmetic

Embedded systems are based on low-power, low-performance processors and can be found in various medical devices, smart watches, various communication devices, cars, planes, mobile phones and many other places. These systems come in a hardware and software package optimized for specific computational tasks and most commonly have real-time constraints. As these systems usually have energy usage and cost constraints too, sophisticated numerical hardware that can process floating-point data is not included, but rather only integer arithmetic, which is simpler in terms of area and power of the processors.

ISO 18037:2008 is a standard for embedded C programming language support. It lays out various rules that C compilers should support to make embedded systems easier to program using a high-level language. One of the most important definitions in this standard is fixed-point arithmetic data types and operations. Support for fixed-point arithmetic is highly desirable, since if it is not provided integers with scaling factors have to be used, which makes code hard to maintain and debug and most commonly requires assembler level changes or completely new implementations for each different platform.

The GCC compiler provides some support of the fixed-point arithmetic defined in this standard for ARM processors. However, in my recent technical report (https://arxiv.org/abs/2001.01496) I demonstrated various numerical pitfalls that programmers of embedded systems based on ARM and using GCC can get into. The issues demonstrated include

• larger than half machine epsilon errors in rounding decimal constants to fixed-point data types,
• errors in conversions between different data types,
• incorrect pre-rounding of arguments of mixed-format arithmetic operations before the operation is performed, and
• lack of rounding of the outputs of arithmetic operations.

These findings can be used to improve the accuracy of various embedded numerical libraries that might be using this compiler. To demonstrate one of the issues, here is a piece of test code:

The multiplication operation is a mixed-format operation, since it multiplies an unsigned long fract argument with an accum argument, therefore it is subject to prerounding of the unsigned long fract argument as described in the report. Since the comparison step in the if () sees that the argument a is larger than zero and b larger than 1, the code is executed with a hope that c will not be set to zero. However, in the arithmetic operation, a is incorrectly pre-rounded to 0, which causes c = 0*b, an unexpected outcome and a bug that is hard to detect and fix.

## Solving Block Low-Rank Linear Systems by LU Factorization is Numerically Stable

In many applications requiring the solution of a linear system $Ax=b$, the matrix $A$ possesses a block low-rank (BLR) property: most of its off-diagonal blocks are of low numerical rank and can therefore be well approximated by low-rank matrices. This property arises for example in the solution of discretized partial differential equations, because the numerical rank of a given off-diagonal block is closely related to the distance between the two subdomains associated with this block. BLR matrices have been exploited to significantly reduce the cost of solving $Ax=b$, in particular in the context of sparse direct solvers such as MUMPS, PaStiX, and STRUMPACK.

However, the impact of these low-rank approximations on the numerical stability of the solution in floating-point arithmetic has not been previously analyzed. The difficulty of such an analysis lies in the fact that, unlike for classical algorithms without low-rank approximations, there are two kinds of errors to analyze: floating-point errors (which depend on the unit roundoff $u$), and low-rank truncation errors (which depend on the the low-rank threshold $\varepsilon$, the parameter controlling the accuracy of the blockwise low-rank approximations). Moreover, the two kinds of error cannot easily be isolated in the analysis, because the BLR compression and factorization stages are often interlaced.

In our recent preprint, with Nick Higham, we present rounding error analysis for the solution of a linear system by LU factorization of BLR matrices. Assuming that a stable pivoting scheme is used, we prove backward stability: the relative backward error is bounded by a modest constant times $\varepsilon$, and does not depend on the unit roundoff $u$ as long as $u$ is safely smaller than $\varepsilon$. This is a very desirable theoretical guarantee that can now be given to the users, who can therefore control the numerical behavior of BLR solvers simply by setting $\varepsilon$ to the target accuracy. We illustrate this key result in the figure below, which shows a strong correlation between the backward error and the $\varepsilon$ threshold for 26 matrices from the SuiteSparse collection coming from a wide range of real-life applications.

## Determinants of Bohemian matrix families

The adjective “Bohemian” was used for the first time in a linear algebra context by Robert Corless and Steven Thornton to describe the eigenvalues of matrices whose entries are taken from a finite discrete set, usually of integers. The term is a partial acronym for “BOunded Height Matrix of Integers”, but the origin of the term was soon forgotten, and the expression “Bohemian matrix” is now widely accepted.

As Olga Taussky observed already in 1960, the study of matrices with integer elements is “very vast and very old”, with early work of Sylvester and Hadamard that dates back to the second half of the nineteenth century. These names are the first two in a long list of mathematicians that worked on what is now known as the “Hadamard conjecture”: for any positive integer $n$ multiple of 4, there exists an $n$ by $n$ matrix $H$, with entries $-1$ and $+1$, such that $HH^T = nI$.

If this is the best-known open problem surrounding Bohemian matrices, it is far from being the only one. During the 3-day workshop “Bohemian Matrices and Applications” that our group hosted in June last year, Steven Thornton released the Characteristic Polynomial Database, which collects the determinants and characteristic polynomials of billions of samples from certain families of structured as well as unstructured Bohemian matrices. All the available data led Steven to formulate a number of conjectures regarding the determinants of several families of Bohemian upper Hessenberg matrices.

Gian Maria Negri Porzio and I attended the workshop, and set ourselves the task of solving at least one of these open problems. In our recent preprint, we enumerate all the possible determinants of Bohemian upper Hessenberg matrices with ones on the subdiagonal. We consider also the special case of families with main diagonal fixed to zero, whose determinants turn out to be related to some generalizations of Fibonacci numbers. Many of the conjectures stated in the Characteristic Polynomial Database follow from our results.

## A Class of Fast and Accurate Summation Algorithms

Summing $n$ numbers is a key computational task at the heart of many numerical algorithms. When performed in floating-point arithmetic, summation is subject to rounding errors: for a machine precision $u$, the error bound for the most basic summation algorithms, such as recursive summation, is proportional to $nu$.

Nowadays, with the growing interest in low floating-point precisions and ever increasing $n$ in applications, such error bounds have become unacceptably large. While summation algorithms leading to smaller error bounds are known (compensated summation is an example), they are computationally expensive.

In our recent preprint, Pierre Blanchard, Nick Higham and I propose a class of fast and accurate summation algorithms called FABsum. We show that FABsum has an error bound of the form $bu+O(u^2)$, where $b$ is a block size, which is independent of $n$ to first order. As illustrated by the figure below, which plots the measured error using single precision as a function of $n$, FABsum can deliver substantially more accurate results than recursive summation. Moreover, FABsum can be easily incorporated in high-performance numerical linear algebra kernels in order to boost accuracy with only a modest drop in performance, as we demonstrate in the paper with the PLASMA library.