
Sparse Format Specialization+
The processing time of sparse representations of local discrete operators depends heavily on the storage format. The requirements of high accuracy, minimal memory footprint, high data locality, parallelism friendly layout, wide applicability, and easy modifiability are contradicting and therefore only case specific choices lead to satisfactory results. 

Parallel Adaptive Data Structures+
While GPUs and other highly parallel devices excel in processing of regularly structured data their large SIMD width and high number of cores quickly leads to inefficiencies in finegranular branches and complex synchronization. However, adaptive data structures that cause such problems are indispensable to capture multiscale phenomena. We must rethink our data arrangement in order to reconcile parallel and adaptive requirements. 

Algebraic multigrid is one of the main tools in science and industry for the solution of large sparse linear equation systems. However, an efficient alllevel parallelization of the algebraic multigrid method on a GPUcluster requires many innovations in parallel numerical schemes, graph algorithms and work scheduling. 

★Efficient parallel GMG for illconditioned systems★ Neither solvers with best numerical convergence nor solvers with best parallel efficiency are the best choice for the fast solution of PDE problems in practice. The fastest solvers require a delicate balance between their numerical and hardware characteristics. Balancing both aspects we can even parallelize completely sequential preconditioners with large parallel speedup and hardly any loss in numerical performance. 

Data layout has tremendous impact on performance, in particular for high throughput devices. But traditional languages force the programmer to specify the memory layout of multivalued data containers in a syntax that makes it impossible to change later without modifying all accesses to the data container. Data layout abstractions allow to switch between array of structs and struct of arrays layouts with a single parameter while keeping the traditional syntax. 

★Stencil algorithms breaking the memory wall★ Iterative stencil computations are ubiquitous in scientific computing and the exponential growth of cores in current processors leads to a bandwidth wall problem where limited offchip bandwidth severely restricts their performance. In this project we aim at overcoming these problems by new algorithms that scale mainly with the aggregate cache bandwidth rather than the system bandwidth. 
20052009


★Minimally invasive acceleration of legacy code on GPUclusters★ A single GPU already offers two levels of parallelism, but similar to CPUs, demand for higher performance and larger problem sizes leads to the utilization of GPUclusters, in which every cluster node is equipped with GPUs. This adds the intranode and internode parallelism. The main challenge for these heterogeneous systems is the enormous discrepancy in the bandwidth between the two finer and two coarser levels of parallelism and their integration in legacy code. 

★Double accuracy with single precision GPUs and FPGAs★ To obtain a result of high accuracy it is not necessary to compute all intermediate results with high precision. Mixed precision methods apply high precision computations only where necessary and save space or time without decreasing the accuracy of the final solution. 

GPUs process data of the same resolution very quickly with massive data parallel execution. But even the massive parallelism cannot compete with adaptive methods when the data size grows cubically under uniform refinement. This project develops parallel refinement strategies with grids and particles that allow to introduce higher resolution in only parts of the computational domain. 

Scientific simulations have higher accuracy requirements than multimedia processing applications. With the introduction of optimized floating point processing units in graphics processors and reconfigurable hardware these devices are now also attractive as powerful scientific coprocessors. 
20002004


This projects investigates how the enormous parallelism of reconfigurable hardware can be harnessed to accelerate PDE solvers. Both fine and coarsegrained architectures are examined. The performance is very convincing but for complex problems higher level programming languages for these devices are required. 

Although graphics processor units (GPUs) are still very restricted in data handling some strategies allow the focusing of processing on datadependent regions of interest. Thus computer vision algorithms which require computations on changing regions of interest can already benefit from the high GPU performance. Current implementations comprise the Generalized Hough Transform, skeleton computation and motion estimation. 

★Pioneering work on PDE solvers with GPUs★ The data parallelism in typical image processing algorithms is very well suited for datastreambased architectures. PDE based methods for image denoising, segmentation and registration have been thus accelerated on graphics cards. 

The choice of visualization methods and parameters is already a part of the interpretation process of the data, as it emphasizes certain structures and subdues others. This can lead to positive effects uncovering otherwise unconceivable relations in the data, but may also produce false evidence. Combinations of multiple methods, and data based parameter controls try to limit this danger. 