Author Archives: Mantas Mikaitis

SIAM PP22 Minisymposium on Understanding and Exploiting Mixed-Precision Accelerators for High-Performance Computing

NVIDIA A100 (source: https://www.nvidia.com/en-gb/data-center/a100/).

This double minisymposium, organized by Mantas Mikaitis and Nick Higham, took place during SIAM Conference on Parallel Processing for Scientific Computing (PP22) which happened virtually on 23-26 February, 2022. Here we provide slides of some of the talks that were delivered during this minisymposium.

Abstract: The growth of domain-specific hardware devices, such as low- and mixed-precision Matrix-Multiply Accumulate (MMA) accelerators (for example Tensor Processing Units and Tensor Cores), motivates several strands of research in scientific computing. First, algorithm designers aim to benefit from the speedup these hardware devices make possible by adapting algorithms, or parts of them, to run in low or mixed precisions. Second, we need to understand the low level details of how the devices implement floating-point arithmetic and to what extent they satisfy floating-point arithmetic standards. Third, new rounding error analysis is being developed to further support the task of finding the best ways to use the accelerators in order to maximize the accuracy of the results. This minisymposium gathers researchers in scientific computing, numerical analysis, and the standardization and testing of floating-point arithmetics to report the latest research on applying and understanding the MMA hardware.

Numerical Behavior of GPU Matrix Multiply-Accumulate Hardware. Massimiliano Fasi, Örebro University, Sweden; Nicholas J. Higham, Mantas Mikaitis, and Srikara Pranesh, The University of Manchester, United Kingdom; Florent Lopez, ANSYS, Inc., U.S.; Theo Mary, Sorbonne Universités and CNRS, France. Abstract.

Mixed Precision in Linear Algebra. Jack J. Dongarra, University of Tennessee and Oak Ridge National Laboratory, U.S.

Challenges of Mixed-Precision Fast Fourier Transforms from the Instruction Level to at Scale Computations. Lukasz Ligowski, NVIDIA, U.S. Abstract.

Double-Word Arithmetic and Accurate Calculation of Euclidean Norms. Vincent Lefèvre, Inria Paris, France; Nicolas Louvet, Université Claude Bernard Lyon 1, France; Jean-Michel Muller, CNRS, France; Joris Picot, Ecole Normale Superieure de Lyon, France; Laurence Rideau, Inria Sophia Antipolis, France. Abstract.

Online and Offline Precision Tuning in Hardware Accelerators. George A. Constantinides, Imperial College London, United Kingdom. Abstract.

Reducing Data Movement in Mixed Precision LU Factorization with GPU Tensor Cores. Atef Dorai, LIP-ENS Lyon, France; Roman Iakymchuk, Sorbonne Université, France and Fraunhofer ITWM, Germany; Florent Lopez, ANSYS, Inc., U.S.; Theo Mary, Sorbonne Universités and CNRS, France. Abstract.

BLIS: Mixing, Matching, and Extending Precision. Greg Henry, Intel Corporation, U.S.; Devin Matthews, Southern Methodist University, U.S.; Maggie E. Myers,  Devangi N. Parikh, Robert A. van de Geijn, and Field G. Van Zee, University of Texas at Austin, U.S. Abstract.

Fluid Simulations Accelerated with 16 bit: Approaching 4x Speedup on A64FX by Squeezing ShallowWaters.jl into Float16. Milan Kloewer, University of Oxford, United Kingdom; Sam Hatfield, European Center for Medium-Range Weather Forecasts (ECMWF) ; Matteo Croci, University of Oxford, United Kingdom; Peter D. Dueben, European Weather Centre, United Kingdom; Tim Palmer, University of Oxford, United Kingdom. Abstract.

Exploiting the MATLAB Language in Implementing a Matrix Collection

“Concise code” may not be commonly used or defined, but as in languages, conciseness—the quality of communicating information in as few words as is feasible—can be a good guide in programming. Conciseness of code, which could mean short and self-explanatory lines or blocks of code, is easier to achieve in higher level languages such as Python or MATLAB, and more difficult when writing, for example, in RISC assembly languages with short instructions each doing very little.

For developing the recently released matrix collection, Anymatrix [1,2] (with Nick Higham), we used MATLAB and exploited its string handling routines to do complicated string manipulations in a few lines of code. This post describes some details of the implementation and further information can be found in the user’s guide [3].

Groups of Matrices

Anymatrix enforces the organization of matrices into groups, which means a new matrix has to be placed either into one of the existing groups or a new group. For example, the built-in groups are contest, core, gallery, hadamard, matlab, nessie, and regtools. Each of these come with around 20 matrices, which can be fixed-sized matrices or matrix generators.

Matrices are identified by <group_name>/<matrix_name>. For example, core/beta is a matrix called beta in the core group. Matrix names are their file names minus the extension .m. Group names are directory names of the groups. Since the IDs of matrices are derived from the directory and file names, uniqueness is guaranteed. This also means that multiple matrices can have the same name as long as they are in different groups.

Around the time we started working on this collection, MATLAB R2020b introduced pattern, to search and match strings conveniently. An example usage in Anymatrix is as follows.

% Matrix ID pattern
matrix_ID_pat = ...
  asManyOfPattern(alphanumericsPattern | ...
    characterListPattern("_") | ...
    characterListPattern("-")) + ...
  '/' + ...
  asManyOfPattern(alphanumericsPattern | ...
    characterListPattern("_") | ...
    characterListPattern("-"));

This code creates a pattern for matrix IDs, which matches sequences of letter, digit, underscore or hyphen characters, followed by a single forward slash, followed again by the sequence. This sort of pattern can then be used, for example, to match strings (using matches) and extract substrings (using extract):

>> S = '...core/beta...';
>> extract(S, matrix_ID_pat)
ans =
  1×1 cell array
    {'core/beta'}

The pattern functionality helped in writing concise code. We used it in various places of Anymatrix, for example in detecting matrix IDs in the commands and extracting substrings in the Boolean search expressions to separate properties from brackets and operators and/or/not.

Properties of Matrices

Matrix properties can be specified in two ways, either as an assignment properties = {...} inside the matrix .m files, or in the am_properties.m files in group folders. With the latter option we may specify properties of multiple matrices in one assignment. Anymatrix contains a list of recognized properties, such as ill conditioned or banded. When Anymatrix scans the file structure, it gives a warning if a matrix contains an unrecognized property.

Properties can be searched by Boolean expressions with operators and/or/not and brackets for specifying precedence. For example:

>> anymatrix('p', 'integer and ill conditioned')
ans =
  3×1 cell array
    {'core/wilson'     }
    {'gallery/dramadah'}
    {'matlab/pascal'   }

The following function that transforms the supplied Boolean expression into an expression that can be evaluated in MATLAB, demonstrates a lot of MATLAB’s features that we made use of.

function IDs = search_by_properties(expression)
  IDs = {};
  % Replace 'and', 'or', and 'not' by corresponding MATLAB symbols.
  expression = replace(expression, ' and ', ' & ');
  expression = replace(expression, ' or ', ' | ');
  expression = replace(expression, ' not ', ' ~');
  expression = replace(expression, '(not ', '(~');
  if startsWith(expression, 'not')
    expression = expression(4:length(expression));
    expression(1) = '~';
  end
  % Assume properties are made up letters, can include a hyphen
  % or a white space character, and there is no case sensitivity.
  pat = (lettersPattern + whitespacePattern + lettersPattern) ...
        | (lettersPattern + characterListPattern('-') ...
          + lettersPattern) ...
        | lettersPattern;
  % Extract properties from the logical expression and replace
  % them by function calls to test for membership.
  ex_props = extract(expression, pat);
  ind = 1;
  new_expression = '';
  for p = ex_props.'
    mod_prop = strcat('ismember(''', ...
        strrep(lower(p{1}), '-', ' '), ...
        ''', strrep(lower(matrix_properties{1}), ''-'', '' ''))');
    trunc_exp = expression(ind:end);
    % Find where the property is in the expression.
    prop_index = strfind(trunc_exp, p{1});
    % Take anything before the property and append the modified
    % version of the property.
    new_expression = strcat(new_expression, ...
        trunc_exp(1:prop_index(1)-1), ...
        mod_prop);
    % Move the index after the property that was replaced.
    ind = ind + prop_index(1) + length(p{1}) - 1;
  end
  new_expression = strcat(new_expression, expression(ind:end));

  % Find matrices whose properties satisfy the specified logical
  % expression.
  k = 0;
  for matrix_properties = properties.'
    k = k + 1;
    % Test if the expression is true for this matrix
    % and add it's ID.
    if eval(new_expression)
      IDs = [IDs; matrix_IDs{k}];
    end
  end
end

On lines 4 to 11 we replace the operators and/or/not by the corresponding MATLAB ones &/|/~. We then extract properties from the expression, making use of patterns, modify them to test for membership of the properties and place them back into the expression (lines 14 to 38). Finally, on lines 42 to 50 we use eval to run the Boolean expression as MATLAB code for every matrix, and return a list of suitable matrices.

In the example command anymatrix('p', 'integer and ill conditioned'), the expression 'integer and ill conditioned' is transformed internally to 'ismember('integer', strrep(lower(matrix_properties{1}), '-', ' ')) & ismember('ill conditioned', strrep(lower(matrix_properties{1}), '-', ' '))' which can then be evaluated using eval. Notice that Anymatrix replaces hyphens by white spaces to allow two ways to specify properties, and converts properties to lower case to avoid case sensitivity. This way, ill conditioned and Ill-conditioned, for example, are equivalent properties in Anymatrix.

Testing in Anymatrix

There are two ways to implement testing in Anymatrix. One way is to add a function in file test_run.m in the group folder which can be invoked by an appropriate Anymatrix command. Another way is to test matrices for their properties. This is provided in the directory testing/ in the root folder of Anymatrix.

It is worth noting that not all supported properties can be tested, so only a subset in the built-in collection are. Each property that is feasible to test has a corresponding test_<property>.m file. Functions in these files must return true or false, given a matrix. Anymatrix utilizes MATLAB’s unit testing framework and automatically generates function-based unit tests for every matrix in the collection. When a new matrix or a new property to an existent matrix is added, the unit testing framework picks that up automatically when run. The following is an example function-based unit test that is automatically generated by Anymatrix. It tests the properties of a set of the core/fourier matrices with arbitrary dimensions.

function test_core_fourier(testcase)
  A = anymatrix('core/fourier',3);
  anymatrix_check_props(A, 'core/fourier', testcase);
  A = anymatrix('core/fourier',5);
  anymatrix_check_props(A, 'core/fourier', testcase);
  A = anymatrix('core/fourier',8);
  anymatrix_check_props(A, 'core/fourier', testcase);
  A = anymatrix('core/fourier',10);
  anymatrix_check_props(A, 'core/fourier', testcase);
  A = anymatrix('core/fourier',15);
  anymatrix_check_props(A, 'core/fourier', testcase);
  A = anymatrix('core/fourier',24);
  anymatrix_check_props(A, 'core/fourier', testcase);
  A = anymatrix('core/fourier',25);
  anymatrix_check_props(A, 'core/fourier', testcase);
  A = anymatrix('core/fourier',30);
  anymatrix_check_props(A, 'core/fourier', testcase);
  A = anymatrix('core/fourier',31);
  anymatrix_check_props(A, 'core/fourier', testcase);
end

Above, the function anymatrix_check_props looks for files test_<property>.m and runs those tests that it finds on the given matrix:

for prop = P.'
  test_func_name = strcat('test_', ...
    strrep(strrep(lower(prop{1}), ...
    '-', '_'), ' ', '_'));
  if isfile(strcat(root_path, '/private/', test_func_name, '.m'))
    handle = str2func(test_func_name);
    verifyTrue(testcase, handle(M), ...
      strcat("Matrix ", matrix_ID, " is not ", prop{1}, "."));
  end
end

Using this, Anymatrix adds a level of almost automatic testing invoked by adding new matrices or properties to matrices. A good example of this is the MATLAB built-in gallery group, which we made available in Anymatrix. By appending each matrix with properties we can check that MATLAB’s gallery matrices have the expected properties.

Summary

Anymatrix is both a collection of matrices and a tool to organize them. It is worth noting that much of the infrastructure is not specific to organizing matrices and so it can be reused to organize into groups, search, and access through IDs any kind of objects appended with properties.

Anymatrix 1.0 was released in October 2021, but the development continues. We welcome contributions to the collection either in the form of remote groups that can be downloaded into Anymatrix given a git url (an example is matrices-mp-cosm group by X. Liu), or suggestions to include matrices or groups in the built-in collection. We also welcome reports of bugs or recommendations for improvements to the clarity, correctness and conciseness of the source code.

References

[1] N. J. Higham and M. Mikaitis. Anymatrix: An Extensible MATLAB Matrix Collection. MIMS EPrint 2021.16, Manchester Institute for Mathematical Sciences, The University of Manchester, UK. Oct. 2021 (to appear in Numer. Algorithms).

[2] N. J. Higham. Anymatrix: An Extensible MATLAB Matrix Collection. Nov. 2021.

[3] N. J. Higham and M. Mikaitis. Anymatrix: An Extensible MATLAB Matrix Collection. Users’ Guide. MIMS EPrint 2021.15, Manchester Institute for Mathematical Sciences, The University of Manchester, UK. Oct. 2021.

Plotting Circles to Test Low Precision Arithmetics and Rounding Modes

As part of the 28th IEEE Symposium of Computer Arithmetic (held in July 2021), a special issue of IEEE Transactions on Emerging Topics in Computing on “Emerging and Impacting Trends on Computer Arithmetic” was organized, which has recently been published [1]. With Massimiliano Fasi, we have contributed an article [2] (available Open Access) on simulating stochastic rounding [3] in floating-point arithmetics, and about which we wrote in this blog before [4].

One interesting numerical example which we used to test various arithmetics and rounding modes [Sec. VIII C2, 4] is an ordinary differential equation (ODE)

u'(t)=v(t), \\ v'(t)=-u(t).

With the initial conditions u(0)=1 and v(0)=0, the solution of this ODE lies on the unit circle in the uv-plane [p. 51, 5]. When solving it numerically our is aim to obtain the solution that lies as close as possible to the unit circle. A similar example is approximating orbits of asteroids in Physics [Sec. II A, 8], although it is more complicated since it additionally includes velocity and acceleration.

Using a forward Euler scheme we obtain:

u_{k+1}=u_k+hv_k, \\ v_{k+1}=v_k-hu_k,

where h=\frac{2\pi}{n} is a step size. Now, solving this with n=32 in bfloat16 arithmetic [6] with round to nearest (RN) and stochastic rounding in arithmetic operations, we obtain the solution in Figure 1 (dotted line—exact unit circle, dashed—RN, solid—SR), which spirals out as previously shown [p. 51, 5]. To improve this we could use a different solver, such as Trapezium or Leapfrog [p.51, 5], but for the purposes of testing arithmetics rather than solvers we stick with Euler and try different step sizes.

Figure 1: Unit circle ODE solutions with n=32. The x-axis represents u and the y-axis represents v. Dotted line—exact unit circle, dashed—RN, and solid—SR.

Next, we tried n=2^9 and the plot is shown on the left hand side of Figure 2. Visually, this timestep size for the bfloat16 arithmetic with RN looks ideal since it quite closely approximates the unit circle and the solution does not spiral out as before. SR performs quite well here as well but noticeably worse than RN.

Next, we further reduce the timestep to n=2^{11} and the solution with that is plotted on the right hand side of Figure 2. In this case the solution with RN has been affected by rounding errors—the approximation of the unit circle looks visually as an octagon rather than a circle!

So what happened there? In the paper, we explain this through the problem of stagnation in floating-point arithmetic. It happens when many addends to some relatively large value are small enough so that they are all lost to rounding and the sum stays at some initial or intermediately reached value. In this case, once we start at an initial point u=1 and v=0, we expect that both u and v will start to decrease. However, only v is doing that since in u_{k+1}=u_k+hv_k the term hv_k is too small initially to change u_k.

The same pattern repeats during the whole solution and u and v keep switching places in suffering from stagnation. Since SR is immune to stagnation [7], this issue does not appear.

Figure 2: Unit circle ODE solutions with n=2^{9} (left) and n=2^{11}.

Finally, we reduce the timestep by a further factor of 100 (Figure 3). This revealed that the ODE solution solved in bfloat16 hardly moves away from the initial conditions with RN, but is still quite accurately computing the approximation of the unit circle with SR.

Figure 3: Unit circle ODE solutions with n=2^{13}.

The unit circle ODE constitutes an easy to run experiment to observe stagnation in floating-point arithmetic, which is usually done through recursive summation (also in our paper, [Sec. VIII A-B, 2]), and is a good visually convenient benchmark for testing low precision arithmetics and alternative rounding modes, such as stochastic rounding, and perhaps for teaching students about floating point and stagnation.

Further detail on the unit circle experiments, as well as other experiments with different ODEs and solvers, can be found in Section VIII of our paper [2]. The MATLAB code for the experiments is available at https://github.com/mmikaitis/stochastic-rounding-evaluation.

References

[1] M. Joldes, F. Lamberti, and A. Nannarelli, Special Section on “Emerging and Impacting Trends on Computer Arithmetic”. IEEE Trans. Emerg. Topics Comput., 9:3. Sep. 2021

[2] M. Fasi and M. Mikaitis, Algorithms for Stochastically Rounded Elementary Arithmetic Operations in IEEE 754 Floating-Point Arithmetic. IEEE Trans. Emerg. Topics Comput., 9:3. Sep. 2021

[3] N. J. Higham, What is Stochastic Rounding?. Jul. 2020

[4] M. Mikaitis, Simulating Stochastically Rounded Floating-Point Arithmetic Efficiently. Nov. 2020

[5] N. J. Higham, “Goals of applied mathematical research” in The Princeton Companion to Applied Mathematics, N. J. Higham, M. R. Dennis, P. Glen- dinning, P. A. Martin, F. Santosa, and J. Tanner, Eds. Princeton, NJ, USA: Princeton Univ. Press, 2015, pp. 48–55.

[6] N. J. Higham, What is Bfloat16 Arithmetic?. Jun. 2020

[7] M. P. Connolly, N. J. Higham, and T. Mary, Stochastic rounding and its probabilistic backward error analysis. SIAM J. Sci. Comput., 43(1), A566–A585. Feb. 2021

[8] D. A. Fauxa and J. Godolphin, The floating point: Tales of the unexpected. Am. J. Phys., 89 (8). Aug. 2021

Simulating Stochastically Rounded Floating-Point Arithmetic Efficiently

Most general-purpose computing devices implement floating-point arithmetic defined in the IEEE 754-1985 standard (revised in 2008 and 2019) [1]. The standard provides various compulsory and recommended operations in order to prepare the hardware for as wide a space of applications as possible, as well as to assure reproducibility of numerical calculations across different devices. Hardware that is IEEE 754-compliant usually includes at least addition/subtraction, multiplication, division and square root operations. Each of these can be performed with various rounding modes, chosen by the user from round-to-nearest ties-to-even, round-toward-zero, round-toward-infinity, and round-toward-minus-infinity. Round-to-nearest provides the lowest error bound possible and thus is the default choice for most, but the other rounding modes find uses in various applications as well.

Given hardware with the five arithmetic operations and four rounding modes described above, how can we simulate other rounding modes efficiently when needed? Rounding modes that are not included in the IEEE 754 standard but are of interest in various research domains are, for example: round-to-odd [2], round-to-nearest ties-to-zero [3], stochastic rounding with 50/50 chance of rounding to either of the neighbouring values [4], or stochastic rounding with probabilities proportional to the distances between the exact value and the neighbouring floating-point values [5, 6]. The last of these is also used in fixed-point arithmetic: for example in machine learning [7] and ordinary differential equation (ODE) solvers [8].

In a recent preprint with Massimiliano Fasi [9] we address the problem of simulating the proportional variant of stochastic rounding of operations in precision k, when there is no higher precision available in hardware (therefore k is the working precision as well as the target precision for rounding). This rounding mode requires the round-off bits captured to be used in calculating the probabilities, but they are lost in IEEE 754-compliant operations of present hardware devices—an operation is performed and only the rounded result is returned in a register. A straightforward approach would be to perform operations in higher precision than k and derive the probabilities from the high precision answer. However, this approach can result in low efficiency code because we would need to use software implementation of higher precision arithmetic operations; for example the MPFR package or other similar software packages developed for the purpose of computing in higher precisions than the working precision.

Our solution is to use error-free floating-point transformations [10] for addition/subtraction and multiplication, and similar transformations (yet not error free) for the division and square root, to compute the distances between the exact number and the two closest floating-point values. This can be done by having the original arguments and the rounded answer from the hardware operation, and knowing what rounding mode was used for producing it. Our algorithms presented in the paper cover two possible cases: when the rounding mode of the hardware operations is round-to-nearest ties-to-even (default) or a combination of the other three rounding modes. The former strategy results in slightly more complicated algorithms since round-to-nearest can round to both sides of the exact number, but the latter is slower due to costs associated with changing the rounding modes—this is true at least with Intel processors which we used for testing.

Table 1: Throughput (in Mop/s) of the C implementations of the algorithms discussed in the paper [9]. The parameter p represents the number of significant digits in the fraction of the MPFR numbers being used; algorithms that do not use MPFR have a missing value in the corresponding row. The baseline for the speedup is the mean thoughtput of the MPFR variant that uses 113 bits to perform the same operation.

In simulating stochastically rounded binary64 (double precision) arithmetic operations on a computer equipped with IEEE 754-compliant operations in that precision, our algorithms achieved between 7.3 and 19 times higher throughput of arithmetic operations compared with the simulation that uses MPFR (Table 1), while not making any compromises in accuracy. In the preprint we also evaluate our algorithms in various applications where stochastic rounding is expected to bring advantages: summation of floating-point numbers and ODE solvers. In summary, our algorithms can provide significant speedup for programs that use stochastic rounding operations in working precision.

It is worth noting that the algorithms explored in [9] should not be used if higher precision than the target precision (the precision of which stochastic rounding we would like to simulate) is available in hardware. In such cases, the higher precision should be used if possible, and the high precision results from arithmetic operations should be rounded stochastically to the target precision using different algorithms: such as the ones used in the MATLAB chop [11] and CPFloat [12] packages (a separate blog post on CPFloat can be found here).

References

[1] 754-2019 – IEEE Standard for Floating-Point Arithmetic. July 2019.
[2] Emulation of a FMA and Correctly Rounded Sums: Proved Algorithms Using Rounding to Odd. S. Boldo and G. Melquind. IEEE Trans. Comp., February 2008.
[3] Emulating Round-to-Nearest-Ties-to-Zero “augmented” Floating-Point Operations Using Round-to-Nearest-Ties-to-Even Arithmetic. S. Boldo, C. Q. Lauter, and J.-M. Muller. IEEE Trans. Comp., June 2020.
[4] CADNA: A library for estimating round-off error propagation. F. Jézéquel and J.-M. Chesneaux. Comp. Phys. Comm., 2008.
[5] Stochastic Rounding and its Probabilistic Backward Error Analysis. M. Connolly, N. J. Higham, and T. Mary. April 2020 (updated August 2020). To appear in SIAM J. Sci. Comp.
[6] Effects of round-to-nearest and stochastic rounding in the numerical solution of the heat equation in low precision. M. Croci and M. B. Giles. October 2020.
[7] Deep Learning with Limited Numerical Precision. S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan. Int. Conf. Machin. Learn., 2015.
[8] Stochastic rounding and reduced-precision fixed-point arithmetic for solving neural ordinary differential equations. M. Hopkins, M. Mikaitis, D. R. Lester, and S. B. Furber. Phil. Tran. A. Roy. Soc., January, 2020.
[9] Algorithms for stochastically rounded elementary arithmetic operations in IEEE 754 floating-point arithmetic. M. Fasi and M. Mikaitis. April 2020 (updated October 2020).
[10] Handbook of Floating-Point Arithmetic. J.-M. Muller, N. Brunie, F. de Dinechin, M. Joldes, V. Lefèvre, G. Melquiond, N. Revol, and S. Torres. 2nd ed. Birkhäuser Basel, 2018.
[11] Simulating Low Precision Floating-Point Arithmetic. N. J. Higham and S. Pranesh. SIAM J. Sci. Comp., October 2019.
[12] CPFloat: A C library for emulating low-precision arithmetic. M. Fasi and M. Mikaitis. October 2020.

Related articles

What Is Stochastic Rounding? by Nicholas J. Higham
What Is IEEE Standard Arithmetic? by Nicholas J. Higham
What Is Floating-Point Arithmetic? by Nicholas J. Higham
What Is Rounding? by Nicholas J. Higham

Numerical Behaviour of Tensor Cores

For many years, the arithmetic operations available on most hardware were +, -, *, /, \sqrt{}. More recently (\sim 20 years ago) the FMA (fused multiply-add) operation also became prevalent in general purpose devices such as CPUs and GPUs. Software builds on top of these operations, for example compilers use a series of these and other hardware instructions to implement more complex functions and algorithms such as the exponential function, matrix multiply, or the algorithms for pseudo random number generation. These arithmetic operations have been standardized by the IEEE 754 floating-point standard since 1985 and most current systems are compliant with it.

Recently, because of the increasing adoption of machine learning, general purpose devices started to include inner product or matrix multiply-accumulate (MMA) operations in hardware. This is a generalization of a scalar FMA to vectors and matrices. Since it is performed in hardware, the expected speedup is achieved due to parallelism — instead of using a few FMA hardware units sequentially to multiply matrices as is the case in software, all of the elements of a matrix of some size are computed in parallel. Using the inner product and matrix multiply-accumulate operations in hardware, compute-bound applications that have high-intensity usage of them are sped up significantly.

Figure 1: Mixed-precision matrix muliply-accumulate operation on 4×4 matrices performed by the tensor cores.

NVIDIA GPUs are widely used for machine learning and other applications. In the latest TOP500 supercomputer list published in June 2020 [1], 112 computers are equipped with NVIDIA graphics cards. The main feature of the recent NVIDIA GPUs is hundreds of arithmetic units for performing the MMA operation — NVIDIA calls these tensor cores. Various applications outside machine learning are being explored in order to utilize very high arithmetic throughput that can be achieved when using MMAs in hardware [2]. Table 1 lists three recent NVIDIA architectures as well as other hardware devices with an MMA operation on chip. The NVIDIA V100 has the first version and the T4 has the second version of tensor cores; in these devices tensor cores work on matrices of size 4 \times 4 in mixed-precision of fp16 and fp32, as shown in Figure 1. The most recent revision of tensor cores in the A100 updated both the precisions available and the dimensions of input/output matrices. See the NVIDIA V100 and NVIDIA A100 whitepapers for more details.

Table 1: Processing units or architectures equipped with mixed-precision matrix multiply-accumulate accelerators. The notation m \times k \times n refers to the matrix multiply-accumulate operation where C and D are m \times n matrices, and A and B have size m \times k and k \times n, respectively.

The MMA operation is not standardized by the IEEE 754, therefore various numerical features of tensor cores are not clear. Knowing such features as the support for subnormal numbers, order of operations, and rounding modes and normalization of significands in various parts of the MMA computation can be important, for example when doing error analysis of algorithms that utilize MMAs in order to derive error bounds [3]. As the application space utilizing tensor cores is growing beyond machine learning, understanding of the numerical behaviour of tensor cores will become increasingly useful. Even before such hardware is put to use in some applications, one might want to simulate the behaviour of tensor cores in order to develop numerical codes targeted to tensor cores, on a conventional hardware. Furthermore, there is also a question of differences in numerical behaviour between tensor cores and software MMA using the standard FMA hardware calls, for example differences that would appear when transitioning the existent software to use tensor cores.

Motivated by this, in our recent preprint [4] we investigate an experimental testing methodology of the tensor cores. The method consists of the following steps.

  1. Identify a numerical feature that needs to be tested, for example, the rounding mode in the 5-operand adders that accumulate products in MMA.
  2. Identify possible implementations, for example round to nearest or round toward zero.
  3. Find floating-point inputs that would result in different outputs for each possible hardware behaviour in 2.
  4. Observe outputs and make a conclusion based on expected outputs for each possible behaviour in 2.

Using this approach, we identified the following list of numerical features of the NVIDIA tensor cores.

  • Subnormal floating-point numbers are fully supported, both on the inputs and outputs.
  • 5-operand adders accumulate 5 addends (4 products from AB and a value from C) starting from the largest in magnitude.
  • Round toward zero, rather than round to nearest (default in IEEE 754-compliant arithmetic), in the 5-operand adders is implemented.
  • Different normalization behaviour from the MMA implemented in software (tensor cores normalize the answer of the whole 5-element dot product at the end rather than after each addition of products).
  • Inner products without intermediate normalization are shown to be non-monotonic in rare cases (this result is more general than tensor cores, since to the best of our knowledge, most hardware implementations do not normalize on every addition due to lower hardware costs).

Our conclusion is that in the current version, the tensor cores on V100 and T4 (the A100 is not yet available to us) do not replicate the behaviour of the MMA implemented with IEEE 754 compliant FMA hardware operations. These numerical behaviours are expected in a hardware MMA optimized for reducing hardware cost and most likely is motivated by a fact that machine learning applications usually are claimed to not need all the IEEE 754 features and high precision in general. These results provide the parameters that can be used in rounding error analysis of tensor cores [3] which can be useful when developing numerical software.

Our CUDA code to test numerical features of tensor cores is available here.

References

[1] https://www.top500.org/lists/top500/2020/06/.

[2] A. Abdelfattah et al. A Survey of Numerical Methods Utilizing Mixed Precision Arithmetic. July 2020. Published online.

[3] P. Blanchard, N. J. Higham , F. Lopez, T. Mary, and S. Pranesh. Mixed Precision Block Fused Multiply-Add: Error Analysis and Application to GPU Tensor Cores. SIAM J. Sci. Comput. May 2020.

[4] M. Fasi, N. J. Higham, M. Mikaitis, and S. Pranesh. Numerical Behavior of NVIDIA Tensor Cores. July 2020; revised October 2020. MIMS Eprint, published online.

Related articles

Issues with Rounding in the GCC Implementation of the ISO 18037:2008 Standard Fixed-Point Arithmetic

Embedded systems are based on low-power, low-performance processors and can be found in various medical devices, smart watches, various communication devices, cars, planes, mobile phones and many other places. These systems come in a hardware and software package optimized for specific computational tasks and most commonly have real-time constraints. As these systems usually have energy usage and cost constraints too, sophisticated numerical hardware that can process floating-point data is not included, but rather only integer arithmetic, which is simpler in terms of area and power of the processors.

ISO 18037:2008 is a standard for embedded C programming language support. It lays out various rules that C compilers should support to make embedded systems easier to program using a high-level language. One of the most important definitions in this standard is fixed-point arithmetic data types and operations. Support for fixed-point arithmetic is highly desirable, since if it is not provided integers with scaling factors have to be used, which makes code hard to maintain and debug and most commonly requires assembler level changes or completely new implementations for each different platform.

The GCC compiler provides some support of the fixed-point arithmetic defined in this standard for ARM processors. However, in my recent technical report (https://arxiv.org/abs/2001.01496) I demonstrated various numerical pitfalls that programmers of embedded systems based on ARM and using GCC can get into. The issues demonstrated include

  • larger than half machine epsilon errors in rounding decimal constants to fixed-point data types,
  • errors in conversions between different data types,
  • incorrect pre-rounding of arguments of mixed-format arithmetic operations before the operation is performed, and
  • lack of rounding of the outputs of arithmetic operations.

These findings can be used to improve the accuracy of various embedded numerical libraries that might be using this compiler. To demonstrate one of the issues, here is a piece of test code:

The multiplication operation is a mixed-format operation, since it multiplies an unsigned long fract argument with an accum argument, therefore it is subject to prerounding of the unsigned long fract argument as described in the report. Since the comparison step in the if () sees that the argument a is larger than zero and b larger than 1, the code is executed with a hope that c will not be set to zero. However, in the arithmetic operation, a is incorrectly pre-rounded to 0, which causes c = 0*b, an unexpected outcome and a bug that is hard to detect and fix.