A problem specific configuration of a multigrid solver is often the most efficient tool for the solution of sparse linear equation systems. However, neither solvers with best numerical convergence nor solvers with best parallel efficiency are the best choice for an accurate and fast solution in practice. The fastest solvers require a delicate balance between their numerical and hardware characteristics. Balancing both aspects we can even parallelize strong sequential preconditioners with large parallel speedup and hardly any loss in numerical performance. In this way GMG can also solve ill-conditioned systems.
Besides extracting massive parallelism from initially sequential preconditioners there is the second big challenge of fine-grained data access and placement, which is crucial for high performance solvers. We present innovative on-the-fly data sorting techniques for Gauss-Seidel and tridiagonal preconditioners for maximum bandwidth utilization . The entire approach is combined with mixed-precision executionand evaluated extensively on anisotropic grids. In a second step we have parallelized even stronger preconditiner and applied them to yet more ill-conditioned problems, including strong operator anisotropies .
Although geometric multigrid is not as flexible as algebraic multigrid, we conclude that with advanced smoothers GMG can also solve many ill-conditioned problems efficiently.
Here we discusses the parallel solution of ill-conditioned problems of about 1 million unknowns on a single GPU. This is an critical builiding block in our library for solving much larger distributed memory problems with GPU-cluster computing.
1. On the left data flow and placement in shared memory in the standard cyclic reduction. Hats stand for updated equations. On the right data placement in our bank-conflict free implementation of cyclic reduction with on-the-fly data rearrangement. The data flow (arrow connections) is exactly the same as on the left, but we have omitted most arrows to improve clarity. In both figures circles in the same column occupy the same memory address, i.e. in-place updates are performed.
2. Total runtime efficiency of the GMG solvers with strong smoothers, on the CPU (left) and the GPU (right), for an elliptic problem with operator and domain anisotropies with sizes (2^L+1) x (2^L+1), logarithmic scale.