## Parallel sparse Linear solvers: Double versus Single precision

It is well established that mixed precision algorithms that factorize a matrix at a precision lower than the working precision can reduce the execution time and the energy consumption of parallel solvers for dense linear systems. Much less is known about the efficiency of mixed precision parallel algorithms for sparse linear systems. Existing work on the performance of mixed precision algorithms for solving sparse linear systems has shown that a speedup up to 2x can be expected, but these studies are limited to single core experiments. In this EPrint, Nick Higham, Françoise Tisseur, Craig Lucas and I evaluate the benefits of using single precision arithmetic in solving a double precision sparse linear systems on multi-core processors and GPUs, focusing on the key components of LU factorization and matrix–vector products.

We find that single precision sparse LU factorization is prone to a severe loss of performance due to the intrusion of subnormal numbers. This phenomenon of LU factorization generating subnormal numbers does not appear to have been observed before. In fact, the elements at the $(k +1)st$ stage of Gaussian elimination are generated from the formula $a_{ij}^{(k+1)} = a_{ij}^{(k)} - m_{ik} a_{kj}^{(k)}, \quad m_{ik} = a_{ik}^{(k)}/a_{kk}^{(k)}$, where $m_{ik}$ is a multiplier. If $A$ is a dense matrix of normalized floating-point numbers with norm of order 1, it is extremely unlikely that any of the $a_{ij}^{(k)}$ will become subnormal. However, for sparse matrices we can identify a mechanism whereby fill-in cascades down a column and small multipliers combine multiplicatively. This mechanism is described, with an illustration for better understanding, in the EPrint, as well as strategies to automatically flush subnormals to zero, to avoid the performance penalties.

The figure below shows the speedup of single precision sparse LU factorization over double precision using the sparse direct solver MUMPS on 10 Intel Skylake cores. We used 36 sparse matrices from the SuiteSparse Matrix Collection, from which 21 matrices are of medium size with 700, 000 to 5, 000, 000 nonzero elements, and 15 larger matrices with 7,000,000 to 64,000,000 nonzeros. The orange bars represent the speeds achieved when subnormals are not flush to zero (FTZ) during the single precision LU factorization. For approximately 30% of the matrices, the orange bars are below the threshold of 1. This means instead of accelerating the computation, single precision slows it down compared with the double sparse LU decomposition. This has been corrected by flushing subnormals to zero as illustrated by the blue bars. Single precision speedup over double precision for sparse LU factorization using MUMPS on 10 Intel Skylake cores.

As for the overall performance, our experiments show that the anticipated speedup of 2x over a double precision LU factorization is obtained only for the very largest of our test problems. In the figure above, the matrices that exceed the speedup of 1.5x are exclusively of very large size. This result is not atypical, as two other sparse direct solvers considered in our work, PARDISO and SuperLU show a similar result. This contrasts with dense linear systems, where a 2x speedup is often achieved even for matrices of size as small as $200 \times 200$ when single precision is used in place of double precision.

The speedups observed in our parallel experiments are clearly lower than the speedups reported in the existing works based on single core experiments. To understand this contrast, it is important to recall that sparse matrix factorization algorithms have a pre-processing step called reordering and analysis, before the actual numerical factorization. In most of the sparse direct solvers, the reordering and analysis step is serial or has a limited parallelism. Therefore, while the time spent in numerical factorization step decreases with increasing number of cores, the reordering and analysis time stagnates and can becomes a performance bottleneck. In addition, as the reordering and analysis step does not involve floating point computation, lowering the precision, from double to single, has a little impact on the speedup except in the case of very large matrices where the numerical factorization step remains dominant despite the use of multiple cores.

For iterative solvers, our results show that the performance gain in computing or applying incomplete factorization preconditioners in single precision is not appealing enough to justify the accuracy sacrifice, but we have observed a speedup of 1.5x from matrix–vector product kernels by using single precision. In future work, we will explore new approaches to integrate efficiently single precision matrix–vector product kernels and single precision preconditioners in double precision iterative solvers without accuracy loss.