Numerical Behaviour of Tensor Cores

For many years, the arithmetic operations available on most hardware were +, -, *, /, \sqrt{}. More recently (\sim 20 years ago) the FMA (fused multiply-add) operation also became prevalent in general purpose devices such as CPUs and GPUs. Software builds on top of these operations, for example compilers use a series of these and other hardware instructions to implement more complex functions and algorithms such as the exponential function, matrix multiply, or the algorithms for pseudo random number generation. These arithmetic operations have been standardized by the IEEE 754 floating-point standard since 1985 and most current systems are compliant with it.

Recently, because of the increasing adoption of machine learning, general purpose devices started to include inner product or matrix multiply-accumulate (MMA) operations in hardware. This is a generalization of a scalar FMA to vectors and matrices. Since it is performed in hardware, the expected speedup is achieved due to parallelism — instead of using a few FMA hardware units sequentially to multiply matrices as is the case in software, all of the elements of a matrix of some size are computed in parallel. Using the inner product and matrix multiply-accumulate operations in hardware, compute-bound applications that have high-intensity usage of them are sped up significantly.

Figure 1: Mixed-precision matrix muliply-accumulate operation on 4×4 matrices performed by the tensor cores.

NVIDIA GPUs are widely used for machine learning and other applications. In the latest TOP500 supercomputer list published in June 2020 [1], 112 computers are equipped with NVIDIA graphics cards. The main feature of the recent NVIDIA GPUs is hundreds of arithmetic units for performing the MMA operation — NVIDIA calls these tensor cores. Various applications outside machine learning are being explored in order to utilize very high arithmetic throughput that can be achieved when using MMAs in hardware [2]. Table 1 lists three recent NVIDIA architectures as well as other hardware devices with an MMA operation on chip. The NVIDIA V100 has the first version and the T4 has the second version of tensor cores; in these devices tensor cores work on matrices of size 4 \times 4 in mixed-precision of fp16 and fp32, as shown in Figure 1. The most recent revision of tensor cores in the A100 updated both the precisions available and the dimensions of input/output matrices. See the NVIDIA V100 and NVIDIA A100 whitepapers for more details.

Table 1: Processing units or architectures equipped with mixed-precision matrix multiply-accumulate accelerators. The notation m \times k \times n refers to the matrix multiply-accumulate operation where C and D are m \times n matrices, and A and B have size m \times k and k \times n, respectively.

The MMA operation is not standardized by the IEEE 754, therefore various numerical features of tensor cores are not clear. Knowing such features as the support for subnormal numbers, order of operations, and rounding modes and normalization of significands in various parts of the MMA computation can be important, for example when doing error analysis of algorithms that utilize MMAs in order to derive error bounds [3]. As the application space utilizing tensor cores is growing beyond machine learning, understanding of the numerical behaviour of tensor cores will become increasingly useful. Even before such hardware is put to use in some applications, one might want to simulate the behaviour of tensor cores in order to develop numerical codes targeted to tensor cores, on a conventional hardware. Furthermore, there is also a question of differences in numerical behaviour between tensor cores and software MMA using the standard FMA hardware calls, for example differences that would appear when transitioning the existent software to use tensor cores.

Motivated by this, in our recent preprint [4] we investigate an experimental testing methodology of the tensor cores. The method consists of the following steps.

  1. Identify a numerical feature that needs to be tested, for example, the rounding mode in the 5-operand adders that accumulate products in MMA.
  2. Identify possible implementations, for example round to nearest or round toward zero.
  3. Find floating-point inputs that would result in different outputs for each possible hardware behaviour in 2.
  4. Observe outputs and make a conclusion based on expected outputs for each possible behaviour in 2.

Using this approach, we identified the following list of numerical features of the NVIDIA tensor cores.

  • Subnormal floating-point numbers are fully supported, both on the inputs and outputs.
  • 5-operand adders accumulate 5 addends (4 products from AB and a value from C) starting from the largest in magnitude.
  • Round toward zero, rather than round to nearest (default in IEEE 754-compliant arithmetic), in the 5-operand adders is implemented.
  • Different normalization behaviour from the MMA implemented in software (tensor cores normalize the answer of the whole 5-element dot product at the end rather than after each addition of products).
  • Inner products without intermediate normalization are shown to be non-monotonic in rare cases (this result is more general than tensor cores, since to the best of our knowledge, most hardware implementations do not normalize on every addition due to lower hardware costs).

Our conclusion is that in the current version, the tensor cores on V100 and T4 (the A100 is not yet available to us) do not replicate the behaviour of the MMA implemented with IEEE 754 compliant FMA hardware operations. These numerical behaviours are expected in a hardware MMA optimized for reducing hardware cost and most likely is motivated by a fact that machine learning applications usually are claimed to not need all the IEEE 754 features and high precision in general. These results provide the parameters that can be used in rounding error analysis of tensor cores [3] which can be useful when developing numerical software.

Our CUDA code to test numerical features of tensor cores is available here.

References

[1] https://www.top500.org/lists/top500/2020/06/.

[2] A. Abdelfattah et al. A Survey of Numerical Methods Utilizing Mixed Precision Arithmetic. July 2020. Published online.

[3] P. Blanchard, N. J. Higham , F. Lopez, T. Mary, and S. Pranesh. Mixed Precision Block Fused Multiply-Add: Error Analysis and Application to GPU Tensor Cores. SIAM J. Sci. Comput. May 2020.

[4] M. Fasi, N. J. Higham, M. Mikaitis, and S. Pranesh. Numerical Behavior of NVIDIA Tensor Cores. July 2020; revised October 2020. MIMS Eprint, published online.

Related articles

Françoise Tisseur Awarded the LMS Fröhlich Prize

Professor Françoise Tisseur has been awarded the London Mathematics Society’s Fröhlich Prize for “her important and highly innovative contributions to the analysis, perturbation theory, and numerical solution of nonlinear eigenvalue problems”.

The Fröhlich Prize is awarded even two years in memory of Albrecht Fröhlich, FRS. The prize is awarded for original and extremely innovative work in any branch of mathematics.

The full prize citation is available here.

200614-0937-24_0020_177-Edit_470x363.jpg

Nick Higham Interviewed in The Actuary Magazine

Professor Nick Higham is featured in the June 2020 issue of The Actuary Magazine, the magazine of the Institute and Faculty of Actuaries.

In an interview with the editor, Dan Georgescu, Nick talks about his development in 2002 of his nearest correlation matrix algorithm, which is widely used in finance and data science.

The magazine is freely readable on the Issuu platform.

More details about the nearest correlation matrix problem and algorithms for solving it are available on Nick’s blog.

actuary_2020_06_njh.jpg

Stochastic Rounding Has Unconditional Probabilistic Error Bounds

In the IEEE standard 754 for binary floating-point arithmetic, four rounding modes are defined.

  • Round to nearest.
  • Round towards 0.
  • Round towards +\infty.
  • Round towards -\infty.

Recently stochastic rounding, an old idea that dates back to the beginning of the digital computer era, has gained popularity, most notably in deep learning. While the rounding modes defined in the IEEE standard are deterministic, stochastic rounding is inherently random. We can define two modes of stochastic rounding. Consider the figure below, where we have a real number x and adjacent floating-point numbers a and b. In what we call mode 1 stochastic rounding, we round x to either a or b with equal probability. In mode 2 stochastic rounding, we round to a (or b) with probability proportional to 1 minus the distance of x from a (or b). So in the example shown, for mode 2 we are more likely to round to b than to a. stoch_round_fig.jpg

In our recent EPrint Stochastic Rounding and its Probabilistic Backward Error Analysis, Nick Higham, Theo Mary and I generalized previous probabilistic error analysis results [SIAM J. Sci. Comput., 41 (2019), pp. A2815–A2835] to stochastic rounding.

We show that stochastic rounding (specifically mode 2) has the property of producing rounding errors \delta_k that are random variables satisfying

  • mean zero: \mathbb{E}(\delta_k) = 0,
  • mean independent: \mathbb{E}(\delta_k \mid \delta_{k-1}, \dots, \delta_1) =   \mathbb{E}(\delta_k).

Here, \mathbb{E} denotes the expectation. A key consequence of these results is that we can replace the worst case error bound proportional to \gamma_n = nu + O(u^2), ubiquitous in backward error analyses, by a more informative probabilistic bound proportional to \widetilde{\gamma}_n = \sqrt{n}u + O(u^2). What is a rule of thumb for round to nearest becomes a rule for stochastic rounding: it is proved that our rounding errors satisfy the above properties.

In the figure below, we compute the inner product s = x^Ty of vectors sampled uniformly from [0,1] in both round to nearest and stochastic rounding in IEEE half precision arithmetic. Shown is the backward error for each value of n and the bounds \gamma_n and \widetilde{\gamma}_n. 0125h.jpg

For increasing problem size n, with round to nearest the error no longer satisfies the \sqrt{n}u bound. This is due to the phenomenon we call stagnation, and low precisions are particularly susceptible to it. As we compute recursively s_i = s_{i-1} + x_iy_i, eventually the partial sum s_{i-1} can grow so large that the update x_iy_i is less than half of the spacing of floating-point numbers around s_{i-1} and under round to nearest the partial sum does not increase. This means we produce negative rounding errors, which are of course not mean zero. Stochastic rounding avoids this issue by rounding randomly. In fact we can prove that \mathbb{E}(\widehat{s}) = s, that is, the expected value of the computed result under stochastic rounding is the exact result.

Online Seminar Series on Numerical Linear Algebra

Dr Stefan Güttel is one of the organizers of a new online series of seminars on Numerical Linear Algebra. The talks are broadcast live on Zoom and YouTube every Wednesday at 4pm Central European Summer Time (CEST). The programme is available on the seminar website. At the time of writing, 942 people from 79 countries have signed up to the seminar mailing list.

The other organizers are Melina Freitag (University of Potsdam), Daniel Kressner (EPF Lausanne), Jörg Liesen (TU Berlin), Valeria Simoncini (University of Bologna), and Bart Vandereycken (University of Geneva).

The first talk in the series was given by Professor Nick Higham, titled Are Numerical Linear Algebra Algorithms Accurate at Extreme Scale and at Low Precisions? The slides for the talk are available at this link and the video is below.

Extreme-Scale Test Matrices with Specified 2-Norm Condition Number

Row H of the compute racks of Summit, the most powerful supercompter in the TOP500 list since June 2018.

The supercomputers in the TOP500 list are ranked using the High-Performance Linpack (HPL) Benchmark, which gauges the performance of a computer by solving a large dense linear system by Gaussian elimination with partial pivoting. The size of the coefficient matrix depends on the computational power of the machine being assessed, because more powerful systems require larger benchmark problems in order to reach their peak performance.

The test matrices used by HPL have random entries uniformly distributed on the interval (-1/2,1/2]. The 2-norm condition number of such matrices depends on their size, and can potentially be very large for the matrices required by today’s most powerful computers: the largest linear systems solved on Summit, the machines that leads the November 2019 TOP500 ranking, have order 10^7, and even larger systems will be needed to benchmark the coming generations of exascale systems.

An n \times n matrix with specified 2-norm condition number can be generated as A := U \varSigma V^T, where U and V are random real orthogonal n \times n matrices from the Haar distribution (a natural uniform probability distribution on the orthogonal matrices) and \varSigma is diagonal with nonnegative entries \sigma_1\ge \cdots \ge \sigma_n \ge 0. It is well known that A has 2-norm condition number \kappa_2(A) = \sigma_1/\sigma_n if \sigma_n \neq 0 and \kappa(A) = \infty otherwise. This technique, which is used by the gallery('randsvd', ...) function, requires 2n^3 floating-point operations to generate a test matrix of order n, and would thus be impractical in an extreme-scale setting.

In our recent EPrint Generating extreme-scale matrices with specified singular values or condition numbers, Nick Higham and I present four methods that, by giving up the requirement that the matrices U and V be from the Haar distribution, reduce the cost of the approach above from cubic to quadratic. The matrices generated retain a number of desirable properties that make them a suitable choice for testing linear solvers at scale.

These cheaper algorithms are particularly well suited to distributed-memory environments, since all communication between the nodes can be avoided at the price of a negligible increase in the overall computational cost. They are appealing to MATLAB users too, as the following example demostrates.

n = 10000; kappa = 1e6; mode = 2; rng(1)
% gallery('randsvd',...)
fprintf('gallery(''randsvd'',...):                %5.2f seconds elapsed.\n',...
timeit(@()gallery('randsvd',n,kappa,mode,[],[],1)));
% Algorithm 3.1 in our EPrint.
method = 1; matrix = 0;
fprintf('Alg. 3.1 with Haar U:                  %5.2f seconds elapsed.\n',...
        timeit(@()randsvdfast(n,kappa,mode,method,matrix)));
matrix = 2;
fprintf('Alg. 3.1 with U=gallery(''orthog'',n,2): %5.2f seconds elapsed.\n',...
        timeit(@()randsvdfast(n,kappa,mode,method,matrix)));
% Algorithm 4.1 in our EPrint.
method = 3; matrix = 0;
fprintf('Alg. 4.1 with Haar U:                  %5.2f seconds elapsed.\n',...
        timeit(@()randsvdfast(n,kappa,mode,method,matrix)));
matrix = 2;
fprintf('Alg. 4.1 with U=gallery(''orthog'',n,2): %5.2f seconds elapsed.\n',...
        timeit(@()randsvdfast(n,kappa,mode,method,matrix)));

In the listing above, randsvdfast is an implementation of our algorithms available on the MATLAB Central File Exchange. Setting the argument matrix to 0 tells our function to pick U from the Haar distribution, whereas setting it to 2 causes U to be chosen in a non-random way. Running this code in MATLAB 2019b on a machine equipped with an Intel processor I5-6500 running at 3.20GHz the script produces the output

gallery('randsvd',...):                79.52 seconds elapsed.
Alg. 3.1 with Haar U:                  19.70 seconds elapsed.
Alg. 3.1 with U=gallery('orthog',n,2):  1.90 seconds elapsed.
Alg. 4.1 with Haar U:                  19.28 seconds elapsed.
Alg. 4.1 with U=gallery('orthog',n,2):  1.43 seconds elapsed.

Therefore, for n = 10{,}000 randsvdfast is up to 56 times faster than gallery('randsvd',...).

2020 NLA group photo

This year’s group photo was taken on March 5, 2020 after the NLA group meeting. Most group members are in the photo; those missing include Jack Dongarra, Stefan Güttel, Ramaseshan Kannan and Marcus Webb. 

The banner on this website has also been updated with the new group photo.  A high resolution version of the photo is available here.

200305-1133-38_cropped

By row from the back: Craig Lucas, Nick Higham, Xinye Chen, Steven Elsworth, Xiaobo (Bob) Liu, Michael Connolly, Mantas Mikaitis, Len Freeman, Massimiliano Fasi, Pierre Blanchard, Sven Hammarling, Asad Raza Aitor Mehasi Mehasi, Stephanie Lai, Gian Maria Negri Porzio, Thomas McSweeney Mawussi Zounon, Françoise Tisseur, Srikara Pranesh, Yuqing (Mila) Zhang, Eleni Vlachopoulou.

NLA Group at the SIAM Conference on Parallel Processing for Scientific Computing

Several members of the group attended the SIAM Conference on Parallel Processing for Scientific Computing held in Seattle on February 12-15, 2020.

The presentations given are as follows:

Nick Higham and Srikara Pranesh also organised a two part mini-symposium (Advances in Algorithms Exploiting Low Precision Floating-Point Arithmetic, MS10 and MS21) at the conference. 

SIAM PP20

Max Fasi, Mantas Mikatis, Mawussi Zounon, Sri Pranesh, Theo Mary at SIAM Conference on Parallel Processing for Scientific Computing, Seattle, Washington, February 12-15, 2020.

Conference Celebrating the 70th Birthday of Jack Dongarra

by Sven Hammarling, Nick Higham, and Françoise Tisseur

IMG_8781.CR2

Jack Dongarra

July 18, 2020 is the 70th birthday of Professor Jack Dongarra, who holds appointments at the University of Tennessee, Oak Ridge National Laboratory, and the University of Manchester.

Jack has made seminal contributions to algorithms for numerical linear algebra and the design and development of high performance mathematical software for machines ranging from workstations to the largest parallel computers. His recent honours include election as a Foreign Member of the Royal Society and receipt of the
SIAM/ACM Prize in Computational Science and Engineering (2019)and the IEEE Computer Society Computer Pioneer Award (2020).

To celebrate Jack’s birthday we are organizing a conference New Directions in Numerical Linear Algebra and High Performance Computing: Celebrating the 70th Birthday of Jack Dongarra at The University of Manchester, July 17, 2020.  Registration is now open and we welcome submission of posters.

Sharper Probabilistic Backward Error Bounds

Most backward error bounds for numerical linear algebra algorithms are of the form nu, for a machine precision u and a problem size n. The dependence on n of these bounds is known to be pessimistic: together with Nick Higham, our recent probabilistic analysis [SIAM J. Sci. Comput., 41 (2019), pp. A2815–A2835], which assumes rounding errors to be independent random variables of mean zero, proves that n can be replaced by a small multiple of \sqrt{n} with high probability. However, even these smaller bounds can still be pessimistic, as the figure below illustrates.

The figure plots the backward error for summation (in single precision) of n floating-point numbers randomly sampled from a uniform distribution. For numbers in the [0,1] distribution, the bound \sqrt{n}u is almost sharp and accurately predicts the error growth. However, for the [-1,1] distribution, the error is much smaller, seemingly not growing with n. This strong dependence of the backward error on the data cannot be explained by the existing bounds, which do not depend on the values of the data.

In our recent preprint, we perform a new probabilistic analysis that combines a probabilistic model of the rounding errors with a second probabilistic model of the data. Our analysis reveals a strong dependence of the backward error on the mean of the data \mu: indeed, our new backward error bounds are proportional to \mu\sqrt{n}u + u. Therefore, for data with small or zero mean, these new bounds are much sharper as they bound the backward error by a small multiple of the machine precision independent of the problem size n.

Motivated by this observation, we also propose new algorithms that transform the data to have zero mean, so as to benefit from these more favorable bounds. We implement this idea for matrix multiplication and show that our new algorithm can produce significantly more accurate results than standard matrix multiplication.

« Older Entries Recent Entries »