Author Archives: Srikara Pranesh

SIAM CSE21 MINISYMPOSIUM ON “Mixed Precision Algorithms for High Performance Scientific Computing”

The biannual SIAM Conference on Computational Science and Engineering (CSE) was conducted virtually between March 1 to 5, 2021. Theo Mary and I organised a two-part minisymposium on recent algorithmic and software advances of mixed precision methods in scientific computing. Below are the links to the slides of the talk.

Minisymposium description: The increasing support of lower precision arithmetics in hardware, such as fp16 and bfloat16, provides new opportunities for high performance scientific computing. However, even though low precision arithmetics can provide significant speed, communication, and energy benefits, their use in scientific computing poses the challenge of preserving the accuracy and stability of the computation. To address this issue, a variety of mixed precision algorithms that combine low and high precisions have emerged.

Simulating Low Precision Floating-Point Arithmetics

In earlier blog posts, I wrote about the benefits of using half precision arithmetic (fp16) and about the problems of overflow and underflow in fp16 and how to avoid them. But how can one experiment with fp16, or other low precision formats such as bfloat16, in order to study how algorithms behave in these arithmetics? (For an accessible introduction to fp16 and bfloat16 see the blog post by Nick Higham.)

As of now, fp16 is supported by several GPUs, but these are specialist devices and they can be very expensive.  Moreover, architectures that support bfloat16 have not yet not been released. Therefore software that simulates these floating-point formats is needed.

In our latest EPrint, Nick Higham and I investigate algorithms for simulating fp16, bfloat16 and other low precision formats. We have also written a MATLAB function chop that can be incorporated into other MATLAB codes to simulate low precision arithmetic.  It can easily be used to study the effect of low precision formats on various algorithms.

Imagine a hypothetical situation where the computer can just represent integers. Then the question is how do we represent numbers like 4/3? An obvious answer would be to represent it via the integer closest to it, 1 in this case. However, one will have to come with a convention to handle the case where the number is in the centre. Now replace the integer in the example with floating-point numbers, and a similar question arises. This process of converting any given number to a floating-point number is called  rounding. If we adopt a rule where we choose the closest floating-point number (as above), then we formally call it as ‘round to nearest’. There are other ways to round as well, and different rounding modes can yield different results for the same code. However meddling with the parameters of a floating-point format without a proper understanding of their consequences can be a recipe for disaster. Cleve Moler in his blog on sub-normal numbers makes this point by warning ‘don’t try this at home’. The MATLAB software we have written provides a safe environment to experiment with the effects of changing any parameter of a floating-point format (such as rounding modes and support of subnormal numbers) on the output of a code. All the technical details can be found in the Eprint and our MATLAB codes.

Scaling a Matrix to Exploit Half Precision Arithmetic

In my previous post, I gave a general introduction to the importance of using half precision (fp16) in the solution of linear systems of equations. In this post I will focus on one specific obstacle in using fp16 for general scientific computing: handling very large numbers and very small numbers.

To clarify the meaning of very large and very small it is helpful to draw an analogy with a ruler, which we use it on daily basis to measure length.

The minimum length which one can measure with the ruler shown in the picture (Image credits splashmath)  is 1 mm (millimeter), and this is referred to as the least count of this measuring instrument. The maximum length that can be measured is 10.5 cm (centimeters).  If  we have a pencil whose length falls between 5.6 cm and 5.7 cm, then we decide if it is 5.6 or 5.7 based on to which it is closer to. A similar process also happens in a computer!

Drawing parallels with the example above, we use a similar ruler to measure in a computer, but we measure numbers rather than lengths, and this ruler is called a “floating point system”. Just like a ruler there is a minimum number which the floating point system can measure, and any number less than that is treated as zero. This process of numbers becoming zero in a floating point system because they are very small is called “underflow”. Next, any number which is too large to be measured by a floating point system is made infinity, and this process is called “overflow”. Finally just like a scale, any number is represented by a number closest to it in the floating point format, and this process is called as ‘rounding’.  For a detailed rigorous and accessible introduction to floating point arithmetic, I would refer interested readers to this blog post by Cleve Moler.

There are four standard ways of measuring numbers, and they are called half precision, single precision, double precision, and quadruple precision. They are in the increasing order of maximum number they can represent, and decreasing order of the minimum number they can represent. For the sake of concreteness, lets consider half precision and double precision. The maximum number that can be represented by double precision is 1.80 × 10308, and for half precision is 65500. The maximum number which can be represented in double precision is enough for most of the problems arising in scientific computing. On the other hand 65500 is extremely small! For example, the modulus of elasticity of many metals is in the order of 109, and this is infinity in fp16! Similarly the minimum positive (and normalized) numbers are 1.18 × 10-38 for double precision, and 6.10 x 10-5 for fp16. This limitation in the range of numbers poses a serious limitation for using fp16 in scientific computing.

Summarising, we have a dichotomy between the computational efficiency and the limitation in representing large and small numbers in fp16. To address this, Prof. Nick Higham and I have developed an algorithm that  squeezes a matrix into the range of numbers that can be represented by fp16. We use the well known technique of diagonal scaling to scale all the matrix entries between -1 and +1, and next we multiply the matrix by θ x 65500 to make complete use of the range of numbers in fp16. θ < 1, and is used to avoid overflow in subsequent computation. The scaling algorithm proposed is not restricted to any particular application, but we concentrate on solution of system of linear equations. We employ GMRES-based iterative refinement, which has generated a lot of interest in the scientific computing community. The main contribution of this scaling algorithm is that it greatly expands the class of problems in which fp16 can be used. All the technical details and results of numerical experiments can be found in the following EPrint of the manuscript.


Half Precision Arithmetic in Numerical Linear Algebra

In whatever shape or form, solution of a linear system of equation is the workhorse for many applications in scientific computing. Therefore our ability to accurately and efficiently solve these linear systems plays a pivotal role in advancing the boundaries of science and technology. Here, efficiency is measured with respect to time—if we have an algorithm to solve a linear system quickly then we can aspire to solve larger and more difficult problems. In this regard the purpose of this post is to describe one of  the latest developments of numerical linear algebra in which our group is actively involved, specifically in the solution of a system of linear equations.

Metaphorically one can think of the algorithm used for the solution of a linear system as an engine in a motor vehicle and the underlying application as the body of the vehicle built around it. Throughout this post I will use this metaphor to explain the course of development and current trends in the algorithms for the solution of a linear system of equations.

In scientific computing there are two components:  developing an algorithm and implementing it on a computer. If we again draw parallels with the engine of a motor vehicle, algorithm developers do the job of designing the various components of the engine and criteria to check if various parts are working as they should, and people implementing on a computer do the job of manufacturing the parts, assembling them, and making sure the engine is working as expected. Finally the most important component for the engine to work is the fuel, and for mathematical algorithms it is numbers. In a computer these numbers are stored in a specific format, and they are called  floating point numbers. 

Until very recently computers were getting faster every year, and computer scientists were devising intelligent ways of using the existing algorithms to solve larger and larger problems in the new computers. One important point to note is that, even though the computers were becoming bigger, the underlying mathematics of the algorithms did not change. Again drawing parallel with the engine analogy, the engine became more powerful and therefore the motor vehicle built around it became bigger, but the basic design of the engine parts and the fuel used remained same. But soon this was about to change!

Traditionally double precision or single precision floating point numbers are used for computation. A double precision number occupies 64 bits of memory and a single precision number occupies 32 bits. Double and single precision numbers carry a lot of informations, but at the same time they create a lot of traffic jam in communication channels! If you think of a communication channel in a computer as a road connecting point A to point B, and since the width of the road is fixed, if we send too many trucks on the road they will cause traffic jam, even though they can carry a lot of goods. Therefore the natural solution is to use smaller vehicles instead of bigger vehicles, but this will drastically reduce the amount goods that one can transport. This exactly was the solution proposed to avoid the jam in communication channels, but at the cost of amount of information that can be transferred. This new floating point format is called half precision where a single half precision number occupies 16 bit of memory. The development of half precision as a floating point format was kick-started by  developments in machine learning, where it was found that, for accurate prediction, machine learning models did not require very accurate representation of the input data. Because of this development in the machine learning community hardware vendors such as NVIDIA and AMD started developing chips that support half precision. The world’s fastest supercomputer, SUMMIT at Oak Ridge National Lab, can perform 3.3 X 1018 operations in one seconds when half precision is used. However when a linear system of equations are solved on the same computer using the High Performance LINPACK benchmark it performs 122.3 X 1015 operations in one second, and note that here double precision floating point format is used. Therefore to capitalise on the recent advances in hardware, we need to exploit half precision in the solution of linear systems of equations.

summit-1Image credits Oak Ridge National Lab Summit Gallery


Returning back to our engine analogy, using half precision is like changing the fuel. As we all are aware, we cannot use diesel in a petrol engine, so the engine has to be redesigned or, in the context of linear systems, the algorithms have to be redesigned. This has precisely been one of the core research efforts in our group. To conclude we are at a very interesting point in the development of algorithms for the solution of linear systems of equations, where the emerging architectures provide interesting opportunities for numerical analysts to rethink old algorithms.

For recent work relevant in the group see

Further work is in progress and will be reported here soon.