## Simulating Stochastically Rounded Floating-Point Arithmetic Efficiently

Most general-purpose computing devices implement floating-point arithmetic defined in the IEEE 754-1985 standard (revised in 2008 and 2019) . The standard provides various compulsory and recommended operations in order to prepare the hardware for as wide a space of applications as possible, as well as to assure reproducibility of numerical calculations across different devices. Hardware that is IEEE 754-compliant usually includes at least addition/subtraction, multiplication, division and square root operations. Each of these can be performed with various rounding modes, chosen by the user from round-to-nearest ties-to-even, round-toward-zero, round-toward-infinity, and round-toward-minus-infinity. Round-to-nearest provides the lowest error bound possible and thus is the default choice for most, but the other rounding modes find uses in various applications as well.

Given hardware with the five arithmetic operations and four rounding modes described above, how can we simulate other rounding modes efficiently when needed? Rounding modes that are not included in the IEEE 754 standard but are of interest in various research domains are, for example: round-to-odd , round-to-nearest ties-to-zero , stochastic rounding with 50/50 chance of rounding to either of the neighbouring values , or stochastic rounding with probabilities proportional to the distances between the exact value and the neighbouring floating-point values [5, 6]. The last of these is also used in fixed-point arithmetic: for example in machine learning  and ordinary differential equation (ODE) solvers .

In a recent preprint with Massimiliano Fasi  we address the problem of simulating the proportional variant of stochastic rounding of operations in precision $k$, when there is no higher precision available in hardware (therefore $k$ is the working precision as well as the target precision for rounding). This rounding mode requires the round-off bits captured to be used in calculating the probabilities, but they are lost in IEEE 754-compliant operations of present hardware devices—an operation is performed and only the rounded result is returned in a register. A straightforward approach would be to perform operations in higher precision than $k$ and derive the probabilities from the high precision answer. However, this approach can result in low efficiency code because we would need to use software implementation of higher precision arithmetic operations; for example the MPFR package or other similar software packages developed for the purpose of computing in higher precisions than the working precision.

Our solution is to use error-free floating-point transformations  for addition/subtraction and multiplication, and similar transformations (yet not error free) for the division and square root, to compute the distances between the exact number and the two closest floating-point values. This can be done by having the original arguments and the rounded answer from the hardware operation, and knowing what rounding mode was used for producing it. Our algorithms presented in the paper cover two possible cases: when the rounding mode of the hardware operations is round-to-nearest ties-to-even (default) or a combination of the other three rounding modes. The former strategy results in slightly more complicated algorithms since round-to-nearest can round to both sides of the exact number, but the latter is slower due to costs associated with changing the rounding modes—this is true at least with Intel processors which we used for testing. Table 1: Throughput (in Mop/s) of the C implementations of the algorithms discussed in the paper . The parameter p represents the number of significant digits in the fraction of the MPFR numbers being used; algorithms that do not use MPFR have a missing value in the corresponding row. The baseline for the speedup is the mean thoughtput of the MPFR variant that uses 113 bits to perform the same operation.

In simulating stochastically rounded binary64 (double precision) arithmetic operations on a computer equipped with IEEE 754-compliant operations in that precision, our algorithms achieved between 7.3 and 19 times higher throughput of arithmetic operations compared with the simulation that uses MPFR (Table 1), while not making any compromises in accuracy. In the preprint we also evaluate our algorithms in various applications where stochastic rounding is expected to bring advantages: summation of floating-point numbers and ODE solvers. In summary, our algorithms can provide significant speedup for programs that use stochastic rounding operations in working precision.

It is worth noting that the algorithms explored in  should not be used if higher precision than the target precision (the precision of which stochastic rounding we would like to simulate) is available in hardware. In such cases, the higher precision should be used if possible, and the high precision results from arithmetic operations should be rounded stochastically to the target precision using different algorithms: such as the ones used in the MATLAB chop  and CPFloat  packages (a separate blog post on CPFloat can be found here).

### References

 754-2019 – IEEE Standard for Floating-Point Arithmetic. July 2019.
 Emulation of a FMA and Correctly Rounded Sums: Proved Algorithms Using Rounding to Odd. S. Boldo and G. Melquind. IEEE Trans. Comp., February 2008.
 Emulating Round-to-Nearest-Ties-to-Zero “augmented” Floating-Point Operations Using Round-to-Nearest-Ties-to-Even Arithmetic. S. Boldo, C. Q. Lauter, and J.-M. Muller. IEEE Trans. Comp., June 2020.
 CADNA: A library for estimating round-off error propagation. F. Jézéquel and J.-M. Chesneaux. Comp. Phys. Comm., 2008.
 Stochastic Rounding and its Probabilistic Backward Error Analysis. M. Connolly, N. J. Higham, and T. Mary. April 2020 (updated August 2020). To appear in SIAM J. Sci. Comp.
 Effects of round-to-nearest and stochastic rounding in the numerical solution of the heat equation in low precision. M. Croci and M. B. Giles. October 2020.
 Deep Learning with Limited Numerical Precision. S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan. Int. Conf. Machin. Learn., 2015.
 Stochastic rounding and reduced-precision fixed-point arithmetic for solving neural ordinary differential equations. M. Hopkins, M. Mikaitis, D. R. Lester, and S. B. Furber. Phil. Tran. A. Roy. Soc., January, 2020.
 Algorithms for stochastically rounded elementary arithmetic operations in IEEE 754 floating-point arithmetic. M. Fasi and M. Mikaitis. April 2020 (updated October 2020).
 Handbook of Floating-Point Arithmetic. J.-M. Muller, N. Brunie, F. de Dinechin, M. Joldes, V. Lefèvre, G. Melquiond, N. Revol, and S. Torres. 2nd ed. Birkhäuser Basel, 2018.
 Simulating Low Precision Floating-Point Arithmetic. N. J. Higham and S. Pranesh. SIAM J. Sci. Comp., October 2019.
 CPFloat: A C library for emulating low-precision arithmetic. M. Fasi and M. Mikaitis. October 2020.

### Related articles

What Is Stochastic Rounding? by Nicholas J. Higham
What Is IEEE Standard Arithmetic? by Nicholas J. Higham
What Is Floating-Point Arithmetic? by Nicholas J. Higham
What Is Rounding? by Nicholas J. Higham

## Numerical Behaviour of Tensor Cores

For many years, the arithmetic operations available on most hardware were $+, -, *, /, \sqrt{}$. More recently ( $\sim 20$ years ago) the FMA (fused multiply-add) operation also became prevalent in general purpose devices such as CPUs and GPUs. Software builds on top of these operations, for example compilers use a series of these and other hardware instructions to implement more complex functions and algorithms such as the exponential function, matrix multiply, or the algorithms for pseudo random number generation. These arithmetic operations have been standardized by the IEEE 754 floating-point standard since 1985 and most current systems are compliant with it.

Recently, because of the increasing adoption of machine learning, general purpose devices started to include inner product or matrix multiply-accumulate (MMA) operations in hardware. This is a generalization of a scalar FMA to vectors and matrices. Since it is performed in hardware, the expected speedup is achieved due to parallelism — instead of using a few FMA hardware units sequentially to multiply matrices as is the case in software, all of the elements of a matrix of some size are computed in parallel. Using the inner product and matrix multiply-accumulate operations in hardware, compute-bound applications that have high-intensity usage of them are sped up significantly. Figure 1: Mixed-precision matrix muliply-accumulate operation on 4×4 matrices performed by the tensor cores.

NVIDIA GPUs are widely used for machine learning and other applications. In the latest TOP500 supercomputer list published in June 2020 , 112 computers are equipped with NVIDIA graphics cards. The main feature of the recent NVIDIA GPUs is hundreds of arithmetic units for performing the MMA operation — NVIDIA calls these tensor cores. Various applications outside machine learning are being explored in order to utilize very high arithmetic throughput that can be achieved when using MMAs in hardware . Table 1 lists three recent NVIDIA architectures as well as other hardware devices with an MMA operation on chip. The NVIDIA V100 has the first version and the T4 has the second version of tensor cores; in these devices tensor cores work on matrices of size $4 \times 4$ in mixed-precision of fp16 and fp32, as shown in Figure 1. The most recent revision of tensor cores in the A100 updated both the precisions available and the dimensions of input/output matrices. See the NVIDIA V100 and NVIDIA A100 whitepapers for more details. Table 1: Processing units or architectures equipped with mixed-precision matrix multiply-accumulate accelerators. The notation refers to the matrix multiply-accumulate operation where and are matrices, and and have size and , respectively.

The MMA operation is not standardized by the IEEE 754, therefore various numerical features of tensor cores are not clear. Knowing such features as the support for subnormal numbers, order of operations, and rounding modes and normalization of significands in various parts of the MMA computation can be important, for example when doing error analysis of algorithms that utilize MMAs in order to derive error bounds . As the application space utilizing tensor cores is growing beyond machine learning, understanding of the numerical behaviour of tensor cores will become increasingly useful. Even before such hardware is put to use in some applications, one might want to simulate the behaviour of tensor cores in order to develop numerical codes targeted to tensor cores, on a conventional hardware. Furthermore, there is also a question of differences in numerical behaviour between tensor cores and software MMA using the standard FMA hardware calls, for example differences that would appear when transitioning the existent software to use tensor cores.

Motivated by this, in our recent preprint  we investigate an experimental testing methodology of the tensor cores. The method consists of the following steps.

1. Identify a numerical feature that needs to be tested, for example, the rounding mode in the 5-operand adders that accumulate products in MMA.
2. Identify possible implementations, for example round to nearest or round toward zero.
3. Find floating-point inputs that would result in different outputs for each possible hardware behaviour in 2.
4. Observe outputs and make a conclusion based on expected outputs for each possible behaviour in 2.

Using this approach, we identified the following list of numerical features of the NVIDIA tensor cores.

• Subnormal floating-point numbers are fully supported, both on the inputs and outputs.
• 5-operand adders accumulate 5 addends (4 products from $AB$ and a value from $C$) starting from the largest in magnitude.
• Round toward zero, rather than round to nearest (default in IEEE 754-compliant arithmetic), in the 5-operand adders is implemented.
• Different normalization behaviour from the MMA implemented in software (tensor cores normalize the answer of the whole 5-element dot product at the end rather than after each addition of products).
• Inner products without intermediate normalization are shown to be non-monotonic in rare cases (this result is more general than tensor cores, since to the best of our knowledge, most hardware implementations do not normalize on every addition due to lower hardware costs).

Our conclusion is that in the current version, the tensor cores on V100 and T4 (the A100 is not yet available to us) do not replicate the behaviour of the MMA implemented with IEEE 754 compliant FMA hardware operations. These numerical behaviours are expected in a hardware MMA optimized for reducing hardware cost and most likely is motivated by a fact that machine learning applications usually are claimed to not need all the IEEE 754 features and high precision in general. These results provide the parameters that can be used in rounding error analysis of tensor cores  which can be useful when developing numerical software.

Our CUDA code to test numerical features of tensor cores is available here.

### References

 A. Abdelfattah et al. A Survey of Numerical Methods Utilizing Mixed Precision Arithmetic. July 2020. Published online.

 P. Blanchard, N. J. Higham , F. Lopez, T. Mary, and S. Pranesh. Mixed Precision Block Fused Multiply-Add: Error Analysis and Application to GPU Tensor Cores. SIAM J. Sci. Comput. May 2020.

 M. Fasi, N. J. Higham, M. Mikaitis, and S. Pranesh. Numerical Behavior of NVIDIA Tensor Cores. July 2020; revised October 2020. MIMS Eprint, published online.

## Issues with Rounding in the GCC Implementation of the ISO 18037:2008 Standard Fixed-Point Arithmetic

Embedded systems are based on low-power, low-performance processors and can be found in various medical devices, smart watches, various communication devices, cars, planes, mobile phones and many other places. These systems come in a hardware and software package optimized for specific computational tasks and most commonly have real-time constraints. As these systems usually have energy usage and cost constraints too, sophisticated numerical hardware that can process floating-point data is not included, but rather only integer arithmetic, which is simpler in terms of area and power of the processors.

ISO 18037:2008 is a standard for embedded C programming language support. It lays out various rules that C compilers should support to make embedded systems easier to program using a high-level language. One of the most important definitions in this standard is fixed-point arithmetic data types and operations. Support for fixed-point arithmetic is highly desirable, since if it is not provided integers with scaling factors have to be used, which makes code hard to maintain and debug and most commonly requires assembler level changes or completely new implementations for each different platform.

The GCC compiler provides some support of the fixed-point arithmetic defined in this standard for ARM processors. However, in my recent technical report (https://arxiv.org/abs/2001.01496) I demonstrated various numerical pitfalls that programmers of embedded systems based on ARM and using GCC can get into. The issues demonstrated include

• larger than half machine epsilon errors in rounding decimal constants to fixed-point data types,
• errors in conversions between different data types,
• incorrect pre-rounding of arguments of mixed-format arithmetic operations before the operation is performed, and
• lack of rounding of the outputs of arithmetic operations.

These findings can be used to improve the accuracy of various embedded numerical libraries that might be using this compiler. To demonstrate one of the issues, here is a piece of test code:

The multiplication operation is a mixed-format operation, since it multiplies an unsigned long fract argument with an accum argument, therefore it is subject to prerounding of the unsigned long fract argument as described in the report. Since the comparison step in the if () sees that the argument a is larger than zero and b larger than 1, the code is executed with a hope that c will not be set to zero. However, in the arithmetic operation, a is incorrectly pre-rounded to 0, which causes c = 0*b, an unexpected outcome and a bug that is hard to detect and fix.