Category Archives: publications

Mixed Precision LU Factorization on GPU Tensor Cores

Modern GPUs are equipped with mixed precision units called tensor cores that offer the capability of computing matrix–matrix products both at very high performance and with high accuracy. GPU tensor cores have been used to accelerate various numerical linear algebra algorithms. Among these, LU factorization is a natural candidate, since it can be written so as to heavily rely on matrix–matrix operations.

Haidar et al. were the first to propose an algorithm to compute the LU factorization of a general dense matrix exploiting GPU tensor cores, and obtained up to a 4x-5x speedup in the solution of various linear systems. Moreover, they also showed that the accuracy boost delivered by tensor cores could significantly improve the convergence of iterative methods preconditioned by the computed LU factors.

In a recent paper, Pierre Blanchard, Nick Higham, Florent Lopez, Srikara Pranesh, and I performed the rounding error analysis of an LU factorization algorithm exploiting what we called block FMAs—a generalization of tensor cores that also includes similar mixed precision units such as Google’s TPUs. We explained how block FMAs can be chained together to accumulate the result of intermediate computations in the higher precision (single precision for tensor cores), which allows for a significant accuracy boost.

However, these algorithms shared a common limitation: the matrix was stored in single, rather than half, precision. Indeed, if the matrix is stored in half precision, the accuracy boost is lost and the computed LU factors are essentially identical to those obtained by a standard LU factorization run entirely in half precision. This represents a lost opportunity, because storing the matrix in half precision not only halves the memory footprint, but also greatly reduces data movement, which brings a significant performance boost.

In a recent preprint, Florent Lopez and I propose a new LU factorization algorithm that is able to store the matrix in half precision without incurring such a significant accuracy loss. There are two main ingredients to this new algorithm. First, we switch from the standard right-looking scheme to a left-looking one, which, together with single precision buffers of controlled size, allows us to perform the update operations accurately. Second, we doubly partition the matrix so as to accurately perform the panel factorizations by exploiting tensor cores.

In our experiments, we factorize a range of test matrices using an NVIDIA V100 GPU. The two figures below summarize our main results by comparing three LU factorization algorithms: the standard algorithm (of Haidar et al. and Blanchard et al.) with the matrix stored in either single or half precision, and the new algorithm.

The figure on the left reports the backward error obtained by solving a linear system Ax=b for a dense n\times n matrix with random entries. The figure on the right displays the performance in GigaFLOPS. Unlike the standard algorithm, the new one does not incur a significant loss of accuracy by storing the matrix in half precision. In doing so, it greatly reduces data movement which leads to significant speedup: the new algorithm can be up to twice faster than the standard one.

A MULTIPRECISION DERIVATIVE-FREE SCHUR-PARLETT ALGORITHM FOR COMPUTING MATRIX FUNCTIONS

The need to compute matrix functions arises in many applications in science and engineering. Specialized methods exist for evaluating particular matrix functions, for example, the well-known scaling and squaring algorithm for the matrix exponential, Newton’s method for matrix sign function, and the inverse scaling and squaring method for the matrix logarithm. For some functions a specialized method is not available, in which case a general purpose algorithm is needed. The Schur-Parlett algorithm computes a general function f of a matrix, with the function dependence restricted to the evaluation of f on the diagonal blocks of the reordered and blocked Schur form. It evaluates f on the nontrivial diagonal blocks via a Taylor series, so it requires the derivatives of f and it also requires the Taylor series to have a sufficiently large radius of convergence. However, the derivatives are not always available or accurately computable.

In our recent preprint, Nick Higham and I present a version of the Schur-Parlett algorithm that requires only function values and uses higher precision arithmetic to evaluate f on the diagonal blocks of order greater than 2 (if there are any) of the reordered and blocked Schur form. Our algorithm is inspired by Davies’s randomized approximate diagonalization method, but we show that the measure of error that underlies randomized approximate diagonalization makes it not a reliable numerical method for computing matrix functions. The key idea in our algorithm is to compute by diagonalization the function of a small random diagonal perturbation of each triangular block, where the perturbation ensures that diagonalization will succeed. Multiprecision algorithms have already been developed for the matrix exponential and the matrix logarithm, and those algorithms are tightly coupled to the functions in question, whereas here we place no restrictions on the function. And the algorithm we propose is not restricted only to a working precision of double since its framework is precision independent. We test our algorithm in double precision on a variety of test problems, and the results show that the algorithm generally behaves in a numerically stable fashion. We apply our algorithm to the matrix Mittag-Leffler function and show that it yields results of accuracy similar to, and in some cases much greater than, the state of the art algorithm for this function.

In the following example we employ the algorithm to compute in double precision the matrix Airy function, and compare the result with that produced by a spectral decomposition method (which breaks down if A is defective, and will be inaccurate if A is close to being defective).

We observe that the normwise relative difference between our algorithm and the spectral decomposition method is tiny. This is just one of the many matrix functions that no specialized algorithm is available for, and the MATLAB \texttt{funm} function, which implements the Schur-Parlett algorithm, is not able to compute. Our new algorithm greatly expands the class of readily computable matrix functions, and its MATLAB code is available here.

Pivoted Matrices with Tunable Condition Number and the HPL-AI Benchmark

The RIKEN Center for Computational Science in Kobe, Japan, home to the supercomputer Fugaku. This machine, scheduled to become fully operational in 2021, heads the June 2020 edition of the TOP500 list and achieved an execution rate of 1.421 x 1018 (mixed-precision) floating-point operations per second (1.421 exaflops) on the HPL-AI Mixed-Precision Benchmark by solving a system of order 13,516,800.

Traditionally, the fastest supercomputers in the world are ranked according to their performance on HPL, a portable implementation of the High-Performance Computing Linpack Benchmark for distributed memory systems. This software package measures the rate of execution, expressed in binary64 floating-point operations per second, at which a computer solves a large dense linear system using LU factorization with row partial pivoting. This metric is an excellent predictor of the performance a machine will achieve on compute-bound tasks executed in binary64 arithmetic, but the test problem is not representative of typical artificial intelligence computations, where lower precision is typically considered satisfactory.

In an endeavour to consider both workloads at once, the Innovative Computing Laboratory (ICL) at the University of Tennesse, Knoxville, developed the HPL-AI Mixed-Precision Benchmark. The reference implementation of the benchmark solves a dense linear system Ax = b of order n to binary64 accuracy by complementing a low-precision direct solver with a high-precision refinement strategy. The code generates A and b in binary64, computes the LU factorization without pivoting of A using only binary32 arithmetic, finds an approximate binary32 solution \widetilde x using forward and back substitution, and finally solves the linear system with GMRES in binary64 arithmetic, using \widetilde x as starting vector and the low-precision LU factors of A as preconditioners.

In order to obtain the best possible rate of execution, the coefficient matrix A must be dense and, as the benchmark is expected to be needed for n > 10^7, it should also be easy to generate at scale on a distributed memory environment. Furthermore, we require that A have growth factor of order 1, so as to guarantee the stability of Gaussian elimination without pivoting, and be not too ill-conditioned, in order to ensure the convergence of GMRES. Currently, the reference implementation relies on a randomly generated matrix that is diagonally dominant by rows. This choice ensures the stability of LU factorization without pivoting, but the matrix thus constructed requires an amount of communication that depends on n, and turns out to be extremely well conditioned, with an \infty-norm condition number just above 4 for large n.

In our EPrint Matrices with Tunable Infinity-Norm Condition Number and No Need for Pivoting in LU Factorization, Nick Higham and I present a new parametric class of dense square matrices A(\alpha,\beta) for which LU factorization without pivoting is stable. The parameters \alpha and \beta that yield a specified \infty-norm condition number can be computed efficiently, and once they have been chosen the matrix A(\alpha,\beta) can be formed from an explicit formula in O(n^2) floating-point operations.

We also discuss several adaptations aimed at further improving the suitability of the generated matrix as a test problem for the HPL-AI benchmark, as our main goal is to provide a robust and efficient matrix generator for it. In particular, we explain how scaling the matrix can help transition from binary32 to binary16, in order to take advantage of the hardware accelerators for low precision such as the tensor cores that equip recent NVIDIA GPUs, and we discuss how the matrices can be tweaked to make the entries in the LU factors less predictable and slow down the convergence of GMRES.

Numerical Behaviour of Tensor Cores

For many years, the arithmetic operations available on most hardware were +, -, *, /, \sqrt{}. More recently (\sim 20 years ago) the FMA (fused multiply-add) operation also became prevalent in general purpose devices such as CPUs and GPUs. Software builds on top of these operations, for example compilers use a series of these and other hardware instructions to implement more complex functions and algorithms such as the exponential function, matrix multiply, or the algorithms for pseudo random number generation. These arithmetic operations have been standardized by the IEEE 754 floating-point standard since 1985 and most current systems are compliant with it.

Recently, because of the increasing adoption of machine learning, general purpose devices started to include inner product or matrix multiply-accumulate (MMA) operations in hardware. This is a generalization of a scalar FMA to vectors and matrices. Since it is performed in hardware, the expected speedup is achieved due to parallelism — instead of using a few FMA hardware units sequentially to multiply matrices as is the case in software, all of the elements of a matrix of some size are computed in parallel. Using the inner product and matrix multiply-accumulate operations in hardware, compute-bound applications that have high-intensity usage of them are sped up significantly.

Figure 1: Mixed-precision matrix muliply-accumulate operation on 4×4 matrices performed by the tensor cores.

NVIDIA GPUs are widely used for machine learning and other applications. In the latest TOP500 supercomputer list published in June 2020 [1], 112 computers are equipped with NVIDIA graphics cards. The main feature of the recent NVIDIA GPUs is hundreds of arithmetic units for performing the MMA operation — NVIDIA calls these tensor cores. Various applications outside machine learning are being explored in order to utilize very high arithmetic throughput that can be achieved when using MMAs in hardware [2]. Table 1 lists three recent NVIDIA architectures as well as other hardware devices with an MMA operation on chip. The NVIDIA V100 has the first version and the T4 has the second version of tensor cores; in these devices tensor cores work on matrices of size 4 \times 4 in mixed-precision of fp16 and fp32, as shown in Figure 1. The most recent revision of tensor cores in the A100 updated both the precisions available and the dimensions of input/output matrices. See the NVIDIA V100 and NVIDIA A100 whitepapers for more details.

Table 1: Processing units or architectures equipped with mixed-precision matrix multiply-accumulate accelerators. The notation m \times k \times n refers to the matrix multiply-accumulate operation where C and D are m \times n matrices, and A and B have size m \times k and k \times n, respectively.

The MMA operation is not standardized by the IEEE 754, therefore various numerical features of tensor cores are not clear. Knowing such features as the support for subnormal numbers, order of operations, and rounding modes and normalization of significands in various parts of the MMA computation can be important, for example when doing error analysis of algorithms that utilize MMAs in order to derive error bounds [3]. As the application space utilizing tensor cores is growing beyond machine learning, understanding of the numerical behaviour of tensor cores will become increasingly useful. Even before such hardware is put to use in some applications, one might want to simulate the behaviour of tensor cores in order to develop numerical codes targeted to tensor cores, on a conventional hardware. Furthermore, there is also a question of differences in numerical behaviour between tensor cores and software MMA using the standard FMA hardware calls, for example differences that would appear when transitioning the existent software to use tensor cores.

Motivated by this, in our recent preprint [4] we investigate an experimental testing methodology of the tensor cores. The method consists of the following steps.

  1. Identify a numerical feature that needs to be tested, for example, the rounding mode in the 5-operand adders that accumulate products in MMA.
  2. Identify possible implementations, for example round to nearest or round toward zero.
  3. Find floating-point inputs that would result in different outputs for each possible hardware behaviour in 2.
  4. Observe outputs and make a conclusion based on expected outputs for each possible behaviour in 2.

Using this approach, we identified the following list of numerical features of the NVIDIA tensor cores.

  • Subnormal floating-point numbers are fully supported, both on the inputs and outputs.
  • 5-operand adders accumulate 5 addends (4 products from AB and a value from C) starting from the largest in magnitude.
  • Round toward zero, rather than round to nearest (default in IEEE 754-compliant arithmetic), in the 5-operand adders is implemented.
  • Different normalization behaviour from the MMA implemented in software (tensor cores normalize the answer of the whole 5-element dot product at the end rather than after each addition of products).
  • Inner products without intermediate normalization are shown to be non-monotonic in rare cases (this result is more general than tensor cores, since to the best of our knowledge, most hardware implementations do not normalize on every addition due to lower hardware costs).

Our conclusion is that in the current version, the tensor cores on V100 and T4 (the A100 is not yet available to us) do not replicate the behaviour of the MMA implemented with IEEE 754 compliant FMA hardware operations. These numerical behaviours are expected in a hardware MMA optimized for reducing hardware cost and most likely is motivated by a fact that machine learning applications usually are claimed to not need all the IEEE 754 features and high precision in general. These results provide the parameters that can be used in rounding error analysis of tensor cores [3] which can be useful when developing numerical software.

Our CUDA code to test numerical features of tensor cores is available here.

References

[1] https://www.top500.org/lists/top500/2020/06/.

[2] A. Abdelfattah et al. A Survey of Numerical Methods Utilizing Mixed Precision Arithmetic. July 2020. Published online.

[3] P. Blanchard, N. J. Higham , F. Lopez, T. Mary, and S. Pranesh. Mixed Precision Block Fused Multiply-Add: Error Analysis and Application to GPU Tensor Cores. SIAM J. Sci. Comput. May 2020.

[4] M. Fasi, N. J. Higham, M. Mikaitis, and S. Pranesh. Numerical Behavior of NVIDIA Tensor Cores. July 2020. MIMS Eprint, published online.

Related articles

Stochastic Rounding Has Unconditional Probabilistic Error Bounds

In the IEEE standard 754 for binary floating-point arithmetic, four rounding modes are defined.

  • Round to nearest.
  • Round towards 0.
  • Round towards +\infty.
  • Round towards -\infty.

Recently stochastic rounding, an old idea that dates back to the beginning of the digital computer era, has gained popularity, most notably in deep learning. While the rounding modes defined in the IEEE standard are deterministic, stochastic rounding is inherently random. We can define two modes of stochastic rounding. Consider the figure below, where we have a real number x and adjacent floating-point numbers a and b. In what we call mode 1 stochastic rounding, we round x to either a or b with equal probability. In mode 2 stochastic rounding, we round to a (or b) with probability proportional to 1 minus the distance of x from a (or b). So in the example shown, for mode 2 we are more likely to round to b than to a. stoch_round_fig.jpg

In our recent EPrint Stochastic Rounding and its Probabilistic Backward Error Analysis, Nick Higham, Theo Mary and I generalized previous probabilistic error analysis results [SIAM J. Sci. Comput., 41 (2019), pp. A2815–A2835] to stochastic rounding.

We show that stochastic rounding (specifically mode 2) has the property of producing rounding errors \delta_k that are random variables satisfying

  • mean zero: \mathbb{E}(\delta_k) = 0,
  • mean independent: \mathbb{E}(\delta_k \mid \delta_{k-1}, \dots, \delta_1) =   \mathbb{E}(\delta_k).

Here, \mathbb{E} denotes the expectation. A key consequence of these results is that we can replace the worst case error bound proportional to \gamma_n = nu + O(u^2), ubiquitous in backward error analyses, by a more informative probabilistic bound proportional to \widetilde{\gamma}_n = \sqrt{n}u + O(u^2). What is a rule of thumb for round to nearest becomes a rule for stochastic rounding: it is proved that our rounding errors satisfy the above properties.

In the figure below, we compute the inner product s = x^Ty of vectors sampled uniformly from [0,1] in both round to nearest and stochastic rounding in IEEE half precision arithmetic. Shown is the backward error for each value of n and the bounds \gamma_n and \widetilde{\gamma}_n. 0125h.jpg

For increasing problem size n, with round to nearest the error no longer satisfies the \sqrt{n}u bound. This is due to the phenomenon we call stagnation, and low precisions are particularly susceptible to it. As we compute recursively s_i = s_{i-1} + x_iy_i, eventually the partial sum s_{i-1} can grow so large that the update x_iy_i is less than half of the spacing of floating-point numbers around s_{i-1} and under round to nearest the partial sum does not increase. This means we produce negative rounding errors, which are of course not mean zero. Stochastic rounding avoids this issue by rounding randomly. In fact we can prove that \mathbb{E}(\widehat{s}) = s, that is, the expected value of the computed result under stochastic rounding is the exact result.

Extreme-Scale Test Matrices with Specified 2-Norm Condition Number

Row H of the compute racks of Summit, the most powerful supercompter in the TOP500 list since June 2018.

The supercomputers in the TOP500 list are ranked using the High-Performance Linpack (HPL) Benchmark, which gauges the performance of a computer by solving a large dense linear system by Gaussian elimination with partial pivoting. The size of the coefficient matrix depends on the computational power of the machine being assessed, because more powerful systems require larger benchmark problems in order to reach their peak performance.

The test matrices used by HPL have random entries uniformly distributed on the interval (-1/2,1/2]. The 2-norm condition number of such matrices depends on their size, and can potentially be very large for the matrices required by today’s most powerful computers: the largest linear systems solved on Summit, the machines that leads the November 2019 TOP500 ranking, have order 10^7, and even larger systems will be needed to benchmark the coming generations of exascale systems.

An n \times n matrix with specified 2-norm condition number can be generated as A := U \varSigma V^T, where U and V are random real orthogonal n \times n matrices from the Haar distribution (a natural uniform probability distribution on the orthogonal matrices) and \varSigma is diagonal with nonnegative entries \sigma_1\ge \cdots \ge \sigma_n \ge 0. It is well known that A has 2-norm condition number \kappa_2(A) = \sigma_1/\sigma_n if \sigma_n \neq 0 and \kappa(A) = \infty otherwise. This technique, which is used by the gallery('randsvd', ...) function, requires 2n^3 floating-point operations to generate a test matrix of order n, and would thus be impractical in an extreme-scale setting.

In our recent EPrint Generating extreme-scale matrices with specified singular values or condition numbers, Nick Higham and I present four methods that, by giving up the requirement that the matrices U and V be from the Haar distribution, reduce the cost of the approach above from cubic to quadratic. The matrices generated retain a number of desirable properties that make them a suitable choice for testing linear solvers at scale.

These cheaper algorithms are particularly well suited to distributed-memory environments, since all communication between the nodes can be avoided at the price of a negligible increase in the overall computational cost. They are appealing to MATLAB users too, as the following example demostrates.

n = 10000; kappa = 1e6; mode = 2; rng(1)
% gallery('randsvd',...)
fprintf('gallery(''randsvd'',...):                %5.2f seconds elapsed.\n',...
timeit(@()gallery('randsvd',n,kappa,mode,[],[],1)));
% Algorithm 3.1 in our EPrint.
method = 1; matrix = 0;
fprintf('Alg. 3.1 with Haar U:                  %5.2f seconds elapsed.\n',...
        timeit(@()randsvdfast(n,kappa,mode,method,matrix)));
matrix = 2;
fprintf('Alg. 3.1 with U=gallery(''orthog'',n,2): %5.2f seconds elapsed.\n',...
        timeit(@()randsvdfast(n,kappa,mode,method,matrix)));
% Algorithm 4.1 in our EPrint.
method = 3; matrix = 0;
fprintf('Alg. 4.1 with Haar U:                  %5.2f seconds elapsed.\n',...
        timeit(@()randsvdfast(n,kappa,mode,method,matrix)));
matrix = 2;
fprintf('Alg. 4.1 with U=gallery(''orthog'',n,2): %5.2f seconds elapsed.\n',...
        timeit(@()randsvdfast(n,kappa,mode,method,matrix)));

In the listing above, randsvdfast is an implementation of our algorithms available on the MATLAB Central File Exchange. Setting the argument matrix to 0 tells our function to pick U from the Haar distribution, whereas setting it to 2 causes U to be chosen in a non-random way. Running this code in MATLAB 2019b on a machine equipped with an Intel processor I5-6500 running at 3.20GHz the script produces the output

gallery('randsvd',...):                79.52 seconds elapsed.
Alg. 3.1 with Haar U:                  19.70 seconds elapsed.
Alg. 3.1 with U=gallery('orthog',n,2):  1.90 seconds elapsed.
Alg. 4.1 with Haar U:                  19.28 seconds elapsed.
Alg. 4.1 with U=gallery('orthog',n,2):  1.43 seconds elapsed.

Therefore, for n = 10{,}000 randsvdfast is up to 56 times faster than gallery('randsvd',...).

Sharper Probabilistic Backward Error Bounds

Most backward error bounds for numerical linear algebra algorithms are of the form nu, for a machine precision u and a problem size n. The dependence on n of these bounds is known to be pessimistic: together with Nick Higham, our recent probabilistic analysis [SIAM J. Sci. Comput., 41 (2019), pp. A2815–A2835], which assumes rounding errors to be independent random variables of mean zero, proves that n can be replaced by a small multiple of \sqrt{n} with high probability. However, even these smaller bounds can still be pessimistic, as the figure below illustrates.

The figure plots the backward error for summation (in single precision) of n floating-point numbers randomly sampled from a uniform distribution. For numbers in the [0,1] distribution, the bound \sqrt{n}u is almost sharp and accurately predicts the error growth. However, for the [-1,1] distribution, the error is much smaller, seemingly not growing with n. This strong dependence of the backward error on the data cannot be explained by the existing bounds, which do not depend on the values of the data.

In our recent preprint, we perform a new probabilistic analysis that combines a probabilistic model of the rounding errors with a second probabilistic model of the data. Our analysis reveals a strong dependence of the backward error on the mean of the data \mu: indeed, our new backward error bounds are proportional to \mu\sqrt{n}u + u. Therefore, for data with small or zero mean, these new bounds are much sharper as they bound the backward error by a small multiple of the machine precision independent of the problem size n.

Motivated by this observation, we also propose new algorithms that transform the data to have zero mean, so as to benefit from these more favorable bounds. We implement this idea for matrix multiplication and show that our new algorithm can produce significantly more accurate results than standard matrix multiplication.

Issues with Rounding in the GCC Implementation of the ISO 18037:2008 Standard Fixed-Point Arithmetic

Embedded systems are based on low-power, low-performance processors and can be found in various medical devices, smart watches, various communication devices, cars, planes, mobile phones and many other places. These systems come in a hardware and software package optimized for specific computational tasks and most commonly have real-time constraints. As these systems usually have energy usage and cost constraints too, sophisticated numerical hardware that can process floating-point data is not included, but rather only integer arithmetic, which is simpler in terms of area and power of the processors.

ISO 18037:2008 is a standard for embedded C programming language support. It lays out various rules that C compilers should support to make embedded systems easier to program using a high-level language. One of the most important definitions in this standard is fixed-point arithmetic data types and operations. Support for fixed-point arithmetic is highly desirable, since if it is not provided integers with scaling factors have to be used, which makes code hard to maintain and debug and most commonly requires assembler level changes or completely new implementations for each different platform.

The GCC compiler provides some support of the fixed-point arithmetic defined in this standard for ARM processors. However, in my recent technical report (https://arxiv.org/abs/2001.01496) I demonstrated various numerical pitfalls that programmers of embedded systems based on ARM and using GCC can get into. The issues demonstrated include

  • larger than half machine epsilon errors in rounding decimal constants to fixed-point data types,
  • errors in conversions between different data types,
  • incorrect pre-rounding of arguments of mixed-format arithmetic operations before the operation is performed, and
  • lack of rounding of the outputs of arithmetic operations.

These findings can be used to improve the accuracy of various embedded numerical libraries that might be using this compiler. To demonstrate one of the issues, here is a piece of test code:

The multiplication operation is a mixed-format operation, since it multiplies an unsigned long fract argument with an accum argument, therefore it is subject to prerounding of the unsigned long fract argument as described in the report. Since the comparison step in the if () sees that the argument a is larger than zero and b larger than 1, the code is executed with a hope that c will not be set to zero. However, in the arithmetic operation, a is incorrectly pre-rounded to 0, which causes c = 0*b, an unexpected outcome and a bug that is hard to detect and fix.

Solving Block Low-Rank Linear Systems by LU Factorization is Numerically Stable

A BLR matrix associated with the root separator of a 128^3 Poisson problem. The darker the blocks, the higher their numerical rank.

In many applications requiring the solution of a linear system Ax=b, the matrix A possesses a block low-rank (BLR) property: most of its off-diagonal blocks are of low numerical rank and can therefore be well approximated by low-rank matrices. This property arises for example in the solution of discretized partial differential equations, because the numerical rank of a given off-diagonal block is closely related to the distance between the two subdomains associated with this block. BLR matrices have been exploited to significantly reduce the cost of solving Ax=b, in particular in the context of sparse direct solvers such as MUMPS, PaStiX, and STRUMPACK.

However, the impact of these low-rank approximations on the numerical stability of the solution in floating-point arithmetic has not been previously analyzed. The difficulty of such an analysis lies in the fact that, unlike for classical algorithms without low-rank approximations, there are two kinds of errors to analyze: floating-point errors (which depend on the unit roundoff u), and low-rank truncation errors (which depend on the the low-rank threshold \varepsilon, the parameter controlling the accuracy of the blockwise low-rank approximations). Moreover, the two kinds of error cannot easily be isolated in the analysis, because the BLR compression and factorization stages are often interlaced.

In our recent preprint, with Nick Higham, we present rounding error analysis for the solution of a linear system by LU factorization of BLR matrices. Assuming that a stable pivoting scheme is used, we prove backward stability: the relative backward error is bounded by a modest constant times \varepsilon, and does not depend on the unit roundoff u as long as u is safely smaller than \varepsilon. This is a very desirable theoretical guarantee that can now be given to the users, who can therefore control the numerical behavior of BLR solvers simply by setting \varepsilon to the target accuracy. We illustrate this key result in the figure below, which shows a strong correlation between the backward error and the \varepsilon threshold for 26 matrices from the SuiteSparse collection coming from a wide range of real-life applications.

Determinants of Bohemian matrix families

Density plots of eigenvalues of matrices from various Bohemian families. Collection of images by Steven Thornton. To find out more about these figures, or to create your own, check out the gallery at the Bohemian Matrices website.

The adjective “Bohemian” was used for the first time in a linear algebra context by Robert Corless and Steven Thornton to describe the eigenvalues of matrices whose entries are taken from a finite discrete set, usually of integers. The term is a partial acronym for “BOunded Height Matrix of Integers”, but the origin of the term was soon forgotten, and the expression “Bohemian matrix” is now widely accepted.

As Olga Taussky observed already in 1960, the study of matrices with integer elements is “very vast and very old”, with early work of Sylvester and Hadamard that dates back to the second half of the nineteenth century. These names are the first two in a long list of mathematicians that worked on what is now known as the “Hadamard conjecture”: for any positive integer n multiple of 4, there exists an n by n matrix H, with entries -1 and +1, such that HH^T = nI.

If this is the best-known open problem surrounding Bohemian matrices, it is far from being the only one. During the 3-day workshop “Bohemian Matrices and Applications” that our group hosted in June last year, Steven Thornton released the Characteristic Polynomial Database, which collects the determinants and characteristic polynomials of billions of samples from certain families of structured as well as unstructured Bohemian matrices. All the available data led Steven to formulate a number of conjectures regarding the determinants of several families of Bohemian upper Hessenberg matrices.

Gian Maria Negri Porzio and I attended the workshop, and set ourselves the task of solving at least one of these open problems. In our recent preprint, we enumerate all the possible determinants of Bohemian upper Hessenberg matrices with ones on the subdiagonal. We consider also the special case of families with main diagonal fixed to zero, whose determinants turn out to be related to some generalizations of Fibonacci numbers. Many of the conjectures stated in the Characteristic Polynomial Database follow from our results.

« Older Entries