Since the 1990's graphics cards have developed quickly from a primitive drawing device to a major computing resource in a PC. High end Graphics Processor Units (GPUs) have already far more transistors than a typical CPU. Also, they devote the majority of these transistors to computations whereas a large percentage of a CPU is occupied by caches.
We have started research on implementations of partial differential equation (PDE) solvers in graphics hardware in 2000. At this time the GPUs were very restricted in the precision of number formats and the programmability. Their main advantage was the much higher memory bandwidth as opposed to a PC. Many image processing applications pose exactly these requirements and allow these restrictions. They involve large image data which needs to be transferred quickly and do not need ultimate precision for exact computations, but rather a faithful reconstruction of the image evolution known from the continuous PDE model. In case of the nonlinear diffusion these are the decreasing diffusivity in areas of large gradients and the smoothing in image regions which are expected to be apart from edges (Fig. 1). Whereas for the levelset evolution these are the fast front propagation in homogeneous regions and the deceleration of the front at segments' borders (Fig. 2).
With the advent of floating point units in DirectX9 GPUs not only the number format has changed but also the balance between the bandwidth of the video memory and the computational power of the GPUs. Previously, the bandwidth basically sufficed to provide all processing elements of the GPU with individual data. But floating point computations on GPUs nowadays imply  similar to microprocessors  a bandwidth shortage. As a consequence operations with higher computational intensity should be executed, i.e. several operations should be performed on each read data item. The greatly increased programmability also supports the implement ion of more complex algorithms and allows the incorporation of more advanced numerical methods. The task of image registration requires to find a deformation between two images which minimizes a certain energy, e.g. intensity differences. We implemented a cascaded gradient flow PDE for the minimization of the energy (Fig. 3). The algorithm operates on a multiscale which is represented by a multigrid hierarchy with several scales per grid. Efficient multigrid solvers and an adaptive timestep control accelerate the solution. Without the high level programming languages this complexity could be hardly realized on GPUs.
Figures
2. Parallel segmentation of tree branches computed in DirectX7 graphics hardware.
3. Elimination of a possible acquisition artifact computed in DirectX9 graphics hardware.
Bibliography
