Author Archives: Mantas Mikaitis

Numerical Behaviour of Tensor Cores

For many years, the arithmetic operations available on most hardware were +, -, *, /, \sqrt{}. More recently (\sim 20 years ago) the FMA (fused multiply-add) operation also became prevalent in general purpose devices such as CPUs and GPUs. Software builds on top of these operations, for example compilers use a series of these and other hardware instructions to implement more complex functions and algorithms such as the exponential function, matrix multiply, or the algorithms for pseudo random number generation. These arithmetic operations have been standardized by the IEEE 754 floating-point standard since 1985 and most current systems are compliant with it.

Recently, because of the increasing adoption of machine learning, general purpose devices started to include inner product or matrix multiply-accumulate (MMA) operations in hardware. This is a generalization of a scalar FMA to vectors and matrices. Since it is performed in hardware, the expected speedup is achieved due to parallelism — instead of using a few FMA hardware units sequentially to multiply matrices as is the case in software, all of the elements of a matrix of some size are computed in parallel. Using the inner product and matrix multiply-accumulate operations in hardware, compute-bound applications that have high-intensity usage of them are sped up significantly.

Figure 1: Mixed-precision matrix muliply-accumulate operation on 4×4 matrices performed by the tensor cores.

NVIDIA GPUs are widely used for machine learning and other applications. In the latest TOP500 supercomputer list published in June 2020 [1], 112 computers are equipped with NVIDIA graphics cards. The main feature of the recent NVIDIA GPUs is hundreds of arithmetic units for performing the MMA operation — NVIDIA calls these tensor cores. Various applications outside machine learning are being explored in order to utilize very high arithmetic throughput that can be achieved when using MMAs in hardware [2]. Table 1 lists three recent NVIDIA architectures as well as other hardware devices with an MMA operation on chip. The NVIDIA V100 has the first version and the T4 has the second version of tensor cores; in these devices tensor cores work on matrices of size 4 \times 4 in mixed-precision of fp16 and fp32, as shown in Figure 1. The most recent revision of tensor cores in the A100 updated both the precisions available and the dimensions of input/output matrices. See the NVIDIA V100 and NVIDIA A100 whitepapers for more details.

Table 1: Processing units or architectures equipped with mixed-precision matrix multiply-accumulate accelerators. The notation m \times k \times n refers to the matrix multiply-accumulate operation where C and D are m \times n matrices, and A and B have size m \times k and k \times n, respectively.

The MMA operation is not standardized by the IEEE 754, therefore various numerical features of tensor cores are not clear. Knowing such features as the support for subnormal numbers, order of operations, and rounding modes and normalization of significands in various parts of the MMA computation can be important, for example when doing error analysis of algorithms that utilize MMAs in order to derive error bounds [3]. As the application space utilizing tensor cores is growing beyond machine learning, understanding of the numerical behaviour of tensor cores will become increasingly useful. Even before such hardware is put to use in some applications, one might want to simulate the behaviour of tensor cores in order to develop numerical codes targeted to tensor cores, on a conventional hardware. Furthermore, there is also a question of differences in numerical behaviour between tensor cores and software MMA using the standard FMA hardware calls, for example differences that would appear when transitioning the existent software to use tensor cores.

Motivated by this, in our recent preprint [4] we investigate an experimental testing methodology of the tensor cores. The method consists of the following steps.

  1. Identify a numerical feature that needs to be tested, for example, the rounding mode in the 5-operand adders that accumulate products in MMA.
  2. Identify possible implementations, for example round to nearest or round toward zero.
  3. Find floating-point inputs that would result in different outputs for each possible hardware behaviour in 2.
  4. Observe outputs and make a conclusion based on expected outputs for each possible behaviour in 2.

Using this approach, we identified the following list of numerical features of the NVIDIA tensor cores.

  • Subnormal floating-point numbers are fully supported, both on the inputs and outputs.
  • 5-operand adders accumulate 5 addends (4 products from AB and a value from C) starting from the largest in magnitude.
  • Round toward zero, rather than round to nearest (default in IEEE 754-compliant arithmetic), in the 5-operand adders is implemented.
  • Different normalization behaviour from the MMA implemented in software (tensor cores normalize the answer of the whole 5-element dot product at the end rather than after each addition of products).
  • Inner products without intermediate normalization are shown to be non-monotonic in rare cases (this result is more general than tensor cores, since to the best of our knowledge, most hardware implementations do not normalize on every addition due to lower hardware costs).

Our conclusion is that in the current version, the tensor cores on V100 and T4 (the A100 is not yet available to us) do not replicate the behaviour of the MMA implemented with IEEE 754 compliant FMA hardware operations. These numerical behaviours are expected in a hardware MMA optimized for reducing hardware cost and most likely is motivated by a fact that machine learning applications usually are claimed to not need all the IEEE 754 features and high precision in general. These results provide the parameters that can be used in rounding error analysis of tensor cores [3] which can be useful when developing numerical software.

Our CUDA code to test numerical features of tensor cores is available here.



[2] A. Abdelfattah et al. A Survey of Numerical Methods Utilizing Mixed Precision Arithmetic. July 2020. Published online.

[3] P. Blanchard, N. J. Higham , F. Lopez, T. Mary, and S. Pranesh. Mixed Precision Block Fused Multiply-Add: Error Analysis and Application to GPU Tensor Cores. SIAM J. Sci. Comput. May 2020.

[4] M. Fasi, N. J. Higham, M. Mikaitis, and S. Pranesh. Numerical Behavior of NVIDIA Tensor Cores. July 2020. MIMS Eprint, published online.

Related articles

Issues with Rounding in the GCC Implementation of the ISO 18037:2008 Standard Fixed-Point Arithmetic

Embedded systems are based on low-power, low-performance processors and can be found in various medical devices, smart watches, various communication devices, cars, planes, mobile phones and many other places. These systems come in a hardware and software package optimized for specific computational tasks and most commonly have real-time constraints. As these systems usually have energy usage and cost constraints too, sophisticated numerical hardware that can process floating-point data is not included, but rather only integer arithmetic, which is simpler in terms of area and power of the processors.

ISO 18037:2008 is a standard for embedded C programming language support. It lays out various rules that C compilers should support to make embedded systems easier to program using a high-level language. One of the most important definitions in this standard is fixed-point arithmetic data types and operations. Support for fixed-point arithmetic is highly desirable, since if it is not provided integers with scaling factors have to be used, which makes code hard to maintain and debug and most commonly requires assembler level changes or completely new implementations for each different platform.

The GCC compiler provides some support of the fixed-point arithmetic defined in this standard for ARM processors. However, in my recent technical report ( I demonstrated various numerical pitfalls that programmers of embedded systems based on ARM and using GCC can get into. The issues demonstrated include

  • larger than half machine epsilon errors in rounding decimal constants to fixed-point data types,
  • errors in conversions between different data types,
  • incorrect pre-rounding of arguments of mixed-format arithmetic operations before the operation is performed, and
  • lack of rounding of the outputs of arithmetic operations.

These findings can be used to improve the accuracy of various embedded numerical libraries that might be using this compiler. To demonstrate one of the issues, here is a piece of test code:

The multiplication operation is a mixed-format operation, since it multiplies an unsigned long fract argument with an accum argument, therefore it is subject to prerounding of the unsigned long fract argument as described in the report. Since the comparison step in the if () sees that the argument a is larger than zero and b larger than 1, the code is executed with a hope that c will not be set to zero. However, in the arithmetic operation, a is incorrectly pre-rounded to 0, which causes c = 0*b, an unexpected outcome and a bug that is hard to detect and fix.