General-purpose programming on GPU

From GPU to GPGPU

Giuseppe Bilotta, Eugenio Rustico, Alexis Hérault

DMI — Università di Catania
Sezione di Catania — INGV

From GPU to GPGPU

Inner workings of a GPU

Typical operations run on GPU: projections, interpolations, texture mapping. Mathematically speaking: matrix/vector products and algebraic operations.

Shaders have to repeat these operations for each vertex in the scene or for each pixel in the image ⇒ hardware is optimized to run these operations on many elements (vertices/pixels) at at the same time.

Stream processing

Concept: stream processing. A series of operations (kernel) acts on each element of a data set (stream). Example: a loop that operates over each element of a vector.

float stuff[num_els];
for (unsigned int i=0; i < num_els; ++i) {
    /* do something on stuff[i] */
}

The kernel is the body of the loop, the stream is the data vector stuff[].

If the same sets of operations can act independently on each element, we have uniform streaming: Single Instruction Multiple Data (SIMD) approach. Parallel computing hardware is optimized for this.

Most 3D graphic operations are uniform stream processing, GPU hardware is optimized for them ⇒ GPUs are high-performance parallel processing units.

GPU vs CPU

What if we could use this principle to offload more computations to the GPU, especially when the GPU is not being used for intensive rendering?

General-purpose Programming on the GPU (GPGPU): use of the GPU as a (math) coprocessor.

How does it work?

Math problem ⇒ find equivalent rendering problem ⇒ code 3D scene, appropriate vertex and fragment shading ⇒ render scene ⇒ read back the image ⇒ interpret it as the result.

Very limited:

Clumsy!
Limited understanding (and use!) of the underlying hardware.
Continuous transfers between CPU and GPU.

Yet quite effective: could simulate simple particles systems with 1,000,000 particles at 20 fps (2004).

Libraries developed to assist in scientific programming using OpenGL and Direct3D: BrookGPU (Stanford U.), Sh (U. of Waterloo).

Reference: GPGPU website

GPGPU (for real)

End of 2006/Beginning of 2007: ATI and NVIDIA release new hardware and software architectures with native support for GPGPU

ATI CTM (Close To Metal)/Stream:

CAL (Computing Abstraction Layer): lower-level interface to hardware
Brook+: higher-level interface
Hardware: R600 GPU class and later

NVIDIA CUDA (Compute Unified Driver Architecture):

CUDA: lower-level interface to hardware
CUDA runtime: higher-level interfaces
Hardware: G80 GPU class and later

Similar hardware principles.

each GPU has one or more SIMD multiprocessor
each multiprocessor executes multiple threads at the same time in lockstep
distinct multiprocessors execute thread batches ('warps' in CUDA-speak, 'wavefronts' in CTM-speak) in parallel but independently from each other

Similar software principles. Example for a vector sum:

prepare input streams:
- create vector1, vector2
- upload vector1, vector2 to the GPU
launch kernel that sums two locations in a third location
- kernel gets executed on each pair of locations (vector1[i], vector2[i]) in parallel
- maximum efficiency if thread batches do not diverge
- threads in different thread processors can diverge without loss of performance
post-process result
- download it to the CPU, if needed
- or launch new kernels on it, without doing useless memory transfers

Hardware differences (NV G200, ATI R700):

number of multiprocessors (30 for NVIDIA, 10 for ATI)
number of ALUs per multiprocessor (8 for NVIDIA, 80/64 ATI)
type of ALUs (1-way RISC for NVIDIA, 5-way VLIW for ATI)
hardware (ATI) vs emulated (NVIDIA) support for double precision
ALU frequency (~300MHz for NVIDIA, ~200MHz for ATI)

NVIDIA: multiprocessor decodes one instruction, instruction goes 4 times over the 8 ALUs for different data ⇒ 32 threads per warp.

ATI: multiprocessor decodes one 5-way VLIW instruction, instructions goes 4 times over the 80 ALUs for different data ⇒ 64 threads per wavefront.

NVIDIA: very fine thread parallelization granularity, complex scheduler, simpler compiler.

ATI: coarser thread granularity, very simple scheduler, complex compiler.

NVIDIA: lots of GPU dye dedicated to fixed functions (simpler 3D rendering components, not used for GPGPU).

ATI: most of the GPU dye dedicated to shaders (the parts used for the GPGPU).

ATI hardware is technically superior (1200 GFLOPS vs 600 GFLOPS), however:

poor ATI SDK documentation
clumsy and unstable interfaces on initial releases
aggressive marketing for CUDA from NVIDIA

took NVIDIA CUDA on the lead. (i.e., NV has better software and documentation, ATI has better hardware, but the better hardware is much more difficult to exploit).

Standardization for stream programming resulted in the OpenCL (Open Computing Language) spec. It is to be expected that interfaces will converge to OpenCL, although differences in the higher-level runtime interfaces will probably remain.

OpenCL is very similar to low-level CUDA programming, but CUDA offers easy high-level interface, and we will start by learning this.

Programming for the GPU

Many of the peculiarities (with their up and downsides) of the GPUs as computing platforms are tightly related to their origin as sophisticated 3D animated scene renderers. We enumerate them here, and will discuss them in more detail during the course of the project.

Memory types

Registers: R/W memory areas, local to each thread. The total number of registers on a card is fixed, and divided among the threads. Limits the complexity of kernels and the number of concurrent threads. Fast, not cached.
Shared: small R/W memory area that is warp-local (on-chip). Fast, not cached.
Local: per-thread memory. Slow. Not cached. Holds data that cannot fit in registers.
Global: the ‘actual’ GPU memory (minus the one dedicated to the graphics engine e.g. to render the current scene of GUI screen). R/W, it holds the input and output streams. Slow. Not cached.
Constant: R/O (GPU-side), R/W (CPU-side). Useful for small global constants (e.g. fluid properties in CFD). Slow. Cached. As fast as register access if all threads in a warp access same datum.
Texture: R/O (GPU-side), bound CPU-side. Special access patterns: builtin interpolation and clipping, ‘spatial’ caching. Slow. Cached.

Memory access patterns

Memory latency: multiple warps per MP ⇒ memory access latency for some warps can be covered by computation time on other warps … if there are lots of computations and few memory accesses!

Aligned and coalesced memory requests reduce number of accesses, improve latency.

Aligned: 32, 64 or 128 (reading a float4 is faster than reading a float3, so wasting one 32-bit word might me more efficient)

Coalesced: sequential, continguous, aligned

For shared memory: access as fast as register access if no bank conflicts happen. Bank conflicts happen if different threads access data in the same bank, unless everybody accesses data in the same bank (broadcast).

Write access: threads should write to different areas. If two threads access the same area, at least one is guaranteed to succeed (so the datum will be properly updated, but we don't know by whom).

More recent cards: atomic write operations (increment, decrement, add, compare, etc). Access is slow! (e.g. just counting the number of interactions in a particle system slows down the simulation by about 10-20%)

GPU saturation / scalability as computing platforms

Optimal GPU performance is achieved by:

computationally dense kernels
optimal memory access patterns
GPU saturation (all cores are always occupied doing something)

Computationally dense kernels: kernels that have a high computations-to-memory access ratio. GPUs are high-performance parallel computing, not memory transfer platforms.

Optimal memory access patterns: coalesced, no bank conflict. Design your data structures appropriately.

GPU saturation: keep the ALUs busy! You will not see computations scale correspondingly otherwise (e.g. a particle system with less than 5000 particles will run just the same on a G80 (first generation NVIDIA) and on a G200 (third generation NVIDIA) because the G200 is not saturated.

Final introductory remarks

Modern GPUs are high-performance parallel computing devices.

OpenCL and its precursors (CTM/Stream and CUDA) allow their use as high-end math coprocessors with relative ease of development.

Widespread ‘hardcore’ gaming keeps the price for gigaflop of these GPUs quite low ⇒ HPC for scientific applications is cheap because of this!

However: gaming is still the main (commercial) reason behind the existence and development of GPUs, we cannot expect to have improvements in the platforms if their benefits are only for their application for scientific (or other forms of generic) computing.

Example:

NVIDIA's support for double precision is an afterthought
NVIDIA has more fixed function hardware
FORTRAN is not supported (GPU kernels are always written in C)

General-purpose programming on GPU

From GPU to GPGPU

Giuseppe Bilotta, Eugenio Rustico, Alexis Hérault

DMI — Università di CataniaSezione di Catania — INGV

From GPU to GPGPU

Inner workings of a GPU

GPGPU (for real)

Programming for the GPU

Memory types

Memory access patterns

GPU saturation / scalability as computing platforms

Final introductory remarks

DMI — Università di Catania
Sezione di Catania — INGV