Iterative stencil computations are ubiquitous in scientific computing and the exponential growth of cores in current processors leads to a bandwidth wall problem where limited off-chip bandwidth severely restricts their performance. In this project we aim at overcoming these problems by new algorithms that scale mainly with the aggregate cache bandwidth rather than the system bandwidth.
Naive implementations of stencil computations suffer heavily from system bandwidth limitations. Various cache blocking techniques have been developed to optimize for the spatial data locality in this case. However, optimization of a single stencil application remains by construction bounded by the peak system bandwidth. In view of the exponentially growing discrepancy between peak system bandwidth and peak computational performance, this is a severe limitation for all current multi-core devices and even more so for future many-core devices.
In order to surpass system bandwidth limitations, we must turn to iterative stencil computations and reduce the off-chip traffic significantly by exploiting temporal data locality across multiple iterations with so called time skewing schemes. The idea is to look at the entire iteration space formed by the space and time, the space-time, and divide it into tiles that can be executed quickly in cache. Schemes that explicitly use the cache sizes to construct the tiles are called cache-aware, those that do not use them but can still utilize the caches efficiently are called cache-oblivious.
While time-skewing schemes have been known for a long time their practical benefits were small so far. This project has demonstrated the tremendous performance of efficiently implemented time-skewing schemes with speedups of up to 10x over an optimized naive space-time traversal.
In detail the project has introduced a novel cache-oblivious time-skewing scheme CORALS  and a novel cache-aware time-skewing scheme CATS . The latter can be cast into a particularly simple form, that allows to gain most of the benefits with little code complexity . It also possesses certain invariance properties that allow to compare and predict the performance of iterative stencil computations on past, current and future architectures . Finally the schemes have been enhanced to nuCORALS and nuCATS by also accounting for the characteristics of NUMA systems .
The performance results are impressive (Fig. 3), scaling well with the aggregate cache bandwidth of all cores rather than with the system bandwidth that does not increase automatically with the number of cores.
1. On the left naive space-time traversal. The entire spatial domain executes one timestep after another in sync. The progress of the space-time coverage is shown in color.
On the right hierarchical, cache oblivious space-time traversal. Different parts of the spatial domain progress in time at different speeds. Only at the end the entire spatial domain reaches the same timestep .
2. Cache-aware space-time traversal with diamonds . On the left we see the parallel processing of diamonds by three threads with local synchronization between the diamonds. On the right the projection view from the left is expanded into a 3D space-time, showing one diamond-tube with a skewed traversal surface.
3. Constant stencil strong scalability for 1 to 32 threads on a 500^3 (on the left) and 160^3 (on the right) domain of doubles and 100 timesteps on the Xeon X7550 (4 sockets with 8 cores at 2.0 GHz). Comparison with naive NUMA-aware traversal, PluTo locality optimizer 0.7.0 and Pochoir stencil compiler 0.5 .