Data layout has tremendous impact on performance, in particular for high throughput devices. But traditional languages force the programmer to specify the memory layout of multi-valued data containers in a syntax that makes it impossible to change later without modifying all accesses to the data container. Data layout abstractions allow to switch between array of structs and struct of arrays layouts with a single parameter while keeping the traditional syntax.
Paper [1] presents an abstraction layer that allows switching between the AoS and SoA layouts in C++ without having to change the data access syntax. A few changes to the structure and container definitions allow for easy performance comparison of AoS vs. SoA on existing AoS code. This abstraction retains the more intuitive AoS syntax 'container[index].component' for data access yet allows switching between the AoS and SoA layouts with a single template parameter in the container type definition on the CPU and GPU. In this way, code development becomes independent of the data layout and performance is improved by choosing the correct layout for the application's usage pattern.
The abstraction offers an 'ASX::Array' class, which provides static allocation based on a size that is known at compile-time, i.e., the equivalent of a normal array in C. For cases requiring dynamic allocation, the ASX library also provides the class 'ASX::Vector', which uses a constructor to allocate memory in a manner similar to STL vectors.
The previous solution relies on advanced C++ features, a language that is not supported on most accelerators. An alternative approach [2] develops a concise macro based solution that requires only support for structures and unions and can therefore be utilized in OpenCL, a widely supported programming language for parallel processors. This abstraction is not quite as powerful and easy to adapt as the C++ solution [1], but it also offers flexible containers that can switch layout at compile-time with a single parameter. Thus even in C/OpenCL one can develop high performance code without an a-priori commitment to a certain data layout.
Figures
1. On the left physical data placement of AoS and SoA layout. Highlighted are the positions which are accessed in case of parallel access to the third component of multiple elements. This is a typical SIMD access pattern.
On the right comparison of native AoS code and ASX CUDA/C++ code that supports both AoS and SoA layouts. The syntax used for the access to the elements and the container does not have to be changed, which makes a transition from native AoS code to this flexible ASX code particularly easy.
2. ASX CUDA [1] memory performance on two large containers in which each element consists of four floats. Access patterns vary from linear element indexing on the left, over increasingly irregular access in the middle, to a completely random permutation on the right. All four combinations of two data layouts (AoS and SoA) and two parallelization strategies (horizontal and vertical) are presented. The mismatched combinations (dashed lines) perform clearly worse. The matching combinations (solid lines) differ by 3.7x on the left and 3.3x on the right, which demonstrates the large speedups that can be gained by choosing the correct data layout.
3. ASX OpenCL example on multiple GPUs [2]. Computation of the bootstrap distribution of the mean value from sample $A$ which consists of a variable number of multi-valued elements of four floats. Different layouts (AoS, SoA) for the input and output, and the parallelization strategies (Vertical, Horizontal) are compared.
Bibliography
|
Software
Version | License | C++/CUDA | C/OpenCL |
---|---|---|---|
ASX_2_1_0.tgz |
|
|
|