CPFloat: A C library for emulating low-precision arithmetic

Detail of a replica of the mechanical computer Z1 on display at the German Museum of Techonolgy in Berlin. This machine, developed by Konrad Zuse between 1936 and 1938, could perform floating-point computations using a 24-bit representation. “Mechanischer Computer Z1, Nachbildung” by Stiftung Deutsches Technikmuseum Berlin is licensed under CC BY 4.0.

Computing machines with floating-point capabilities started appearing in the late 1940s, and despite being regarded as an optional feature for decades to come, floating-point hardware equipped a number of the mainframe computers produced in the 1960s and 1970s. At the time, there was no consensus on what features a number system ought to have, and the arithmetics implemented by any two vendors could differ widely in terms of word sizes, floating-point representations, rounding behaviour, and accuracy of the operations implemented in hardware. This lack of standardization was a major hindrance to the portability of the source code, as it made it significantly harder to write and maintain architecture-independent code.

The way out of this predicament was the establishment of the IEEE 754-1985 standard, which dictated word sizes, representations, and accuracy requirements for two default floating-point formats: single and double precision. For about three decades, hardware vendors conformed with the standard, and the large majority of general-purpose processing units came equipped with IEEE-compliant arithmetic units. Motivated by the low-precision requirements of machine learning applications, however, in recent years hardware manufacturers have started exploring the realm of low-precision arithmetics, commercialising a variety of highly optimised hardware units that pursue efficiency at the expense of accuracy. The high-performance computing community has attempted to channel the extreme performance of this hardware and utilise it in more traditional scientific computing applications, where more stringent accuracy needs typically require the use of higher precision.

The broad range of different formats available poses a serious challenge to those trying to develop mixed-precision algorithms, since studying the numerical behaviour of an algorithm in different working precisions may require access to a number of high-end hardware platforms, including supercomputer-grade CPUs and GPUs. In order to alleviate the problem, several software packages for simulating low-precision floating-point arithmetics in software have been developed. Some of these are described in the table below. All these solutions execute each arithmetic operation in hardware, and use the software layer only to round the results to the desired number of significant bits of precision.

Available software packages for simulating low-precision floating-pointarithmetic. The first three columns reports the name of the package, the main pro-gramming language in which the software is written, and what storage formats are supported. The following three columns describe the parameters of the target formats: whether the number of bits of precision in the significand is arbitrary (A) or limited to the number of bits in the significand of the storage format (R); whether the exponentrange can be arbitrary (A), must be the same as the storage format (R), or a sub-range thereof (S); whether the target format supports subnormal numbers (S), does not support them (N), supports them only for builtin types (P), or supports them but allows the user to switch the functionality off (F). The following column lists the floating-point formats that are built into the system. The last five columns indicate whetherthe software supports round-to-nearest wih ties-to-even (RNE), ties-to-zero (RNZ),or ties-to-away (RNA), the three directed rounding modes of the IEEE 754 standardround-toward-zero (RZ), round-to-+∞ and round-to-−∞ (RUD), round-to-odd (RO), and the two variants of stochastic rounding discussed in Section 3 (SR). The abbreviations bf16, tf32, fp16, fp32, and fp64 denote the formats bfloat16, TensorFloat-32, binary16, binary32, and binary64, respectively.

If the hardware and the software-simulated floating-point format are both IEEE compliant, rounding requires only standard mathematical library functions, but a straightforward implementation of the rounding formulae may potentially cause two kinds of issues. From a theoretical point of view, handling subnormals, underflow, and overflow demands special attention, and numerical errors can cause mathematically correct formulae to behave incorrectly when evaluated in finite arithmetic. In terms of performance, on the other hand, the algorithms obtained in this way are not necessarily efficient, as they build upon library functions that, being designed to handle a broad range of cases, might not be optimised for the specific needs of floating-point rounding functions.

In MATLAB, low-precision arithmetics can be simulated using the chop function, developed by Higham and Pranesh and discussed in a previous post. Mantas Mikaitis and I decided to port the functionalities of this MATLAB implementation to a lower level language, in order to obtain better performance. The scope of the project has progressively broadened, and we have recently released the first version of CPFloat, an open-source C library that offers a wide range of routines for rounding arrays of binary32 or binary64 numbers to lower precision. Our code exploits the bit-level representation of the underlying floating-point formats, and performs only low-level bit manipulation and integer arithmetic without relying on costly library calls.

The project is hosted in a GitHub repository where, along with the C code, we provide a MEX interface for MATLAB and Octave which is fully compatible with the chop function. The algorithms that underlie our implementation, as well as the design choices and the testing infrastructure of CPFloat are discussed in detail in a recent EPrint. In the numerical experiments shown there, CPFloat brings a considerable speedup (typically one order of magnitude or more) over existing alternatives in C, C++, and MATLAB. To the best of our knowledge, our library is currently the most efficient and complete software for experimenting with custom low-precision floating-point arithmetic available in any language.

Related articles


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s