GPU programming

Fundamental concepts

Premise

The fundamental concepts of GPGPU are independent of the specific hardware (and thus common to CUDA and OpenCL), aside from semantic differences.

Host, device, compute unit

OpenCL system architecture: a host with one or more connected devices. Each device has its own memory and its own compute units.

Host

runs the main program, which controls the devices with appropriate commands;

Device

where the main computations are executed; examples: GPU, DSP;

Compute Unit (CU)

device components that actually run the computations;
example: multiprocessors of a GPU;

OpenCL system

A host (purple) connected to 4 devices (green)
A host (purple) connected to 4 devices (green)

Kernel, work-group, work-item

kernel

functions (sequences of instructions) to be executed on a device; multiple instances of a kernel may be launched concurrently;

work-item (CUDA thread)

a work-item executes a single kernel instance;

launch grid (OpenCL NDRange)

the collection of all work-items issued in one invokation of a kernel; may be 1D, 2D or 3D for ease of indexing;

work-group (CUDA block)

group of work-items; each work-group executes on a compute unit; work-items in a work-group can cooperate (synchronize and exchange information).

Stream processing (parenthesis)

Stream processing

Stream processing

simple form of (embarrassing) data-based parallelism, with reduced (or no) explicit management of the compute units.

Requirements:

Example: Mandelbrot fractal

Mandelbrot fractal (image source: Wikipedia)
Mandelbrot fractal (image source: Wikipedia)

Example: Mandelbrot fractal

Locus of complex points cc such that the sequence c,c2+c,(c2+c)2+c,...c, c^2 + c, (c^2 + c)^2 + c, ... or zn+1=zn2+cz_{n+1} = z_n^2 + c does not diverge. (Diverging points are often colored by escape velocity.)

Example: Mandelbrot fractal

int nrows, ncols, maxiter;
float xmin, xmax, ymin, ymax;
int *bitmap;
for (int row = 0; row < nrows; ++row) {
  for (int col = 0; col < ncols; ++col) {
    complex z, c;
    int iter;
    c.real = xmin + col*(xmax - xmin)/ncols;
    c.img = ymin + row*(ymax - ymin)/nrows;
    z = c;
    for (iter = 0; iter < maxiter; ++iter) {
      z = add(mul(z, z), c); /* z_{n+1} = z_n^2 + c */
      if (norm(z) > 2)
        break;
    }
    bitmap[row*ncols + col] = !!(iter < maxiter);
  }
}

Heart of the Mandelbrot

What’s the heart of the algorithm?

This would be the body of the mandelbrot() kernel.

The heart of the algorithm

Compute kernels are the ‘heart’ of the algorithm, to be applied independently to each element of the input data streams (e.g. in Mandelbrot, to each position), producing independent elements of the output data streams (e.g. in Mandelbrot, the resulting pixels).

Stream processing (bis)

Programmer perspective

run kernel mandelbrot on a 2D launch grid with nrows×ncols work-items;

Hardware (or driver) perspective

issue as many work-items as possible at a time to process the data set.

No (or very little) manual (programmer) management of the creation and destruction of the work-items (or the hardware threads they belong to).

Know thyself

How does each work-item know which one it is (e.g. which row, col)?

Hardware perspective

hardware- (or driver-) provided counters.

Programmer perspective

language-specific function or variable (e.g. get_global_id() in OpenCL)

Back to Fundamentals
(closed parenthesis)

Memory

A device has different kinds of memory:

global memory

main device memory; accessible by all work-items in a kernel; persistent across kernel launches;

constant memory

read-only on device, written by host; persistent across kernel launches;

images (CUDA texture)

global memory accessed via the GPU texture engines; offers multi-dimensional indexing, coordinate normalization, data normalization, spatial caching, (bi)linear interpolation;

Memory (bis)

local memory (CUDA shared memory)

reserved work-group memory; accessible by all work-items in a work-group; volatile (scope: kernel launch);

private memory

private to each work-item; volatile; held in the registers of the compute unit, unless too much is being used, in which case it spills into global memory (register spills are what CUDA calls local memory) and performance suffers.

Program structure

The typical GPGPU program follows this structure:

selects

the device(s) to use (CUDA runtime can do this behind the scenes, in OpenCL this must be done manually);

uploads

the initial data for the problem to the device(s);

runs

one or more kernel(s) on the device(s);

downloads

the final results from the device(s).

Host/device communication (parenthesis)

Host and device bandwidths

Consider a modern, middle- or high-range setup.

Host memory bandwidth: around 50 GB/s.

Device memory bandwidth: 300–500 GB/s.

What about data transfers between the host and the device?

The bottleneck

PCI-E 3.0 x16: less than 16GB/s in the ideal case:

Even the upcoming NVLink (64GB/s theoretical peak) has about an order of magnitude less than device VRAM bandwidth.

Don’t do it

EVERY TIME YOU TRANSFER DATA BETWEEN HOST AND DEVICE DOMO KILLS A KITTEN

If you really must …

Program structure reprise (closed parenthesis)

Program structure (bis)

A GPGPU program has two parts:

host code

runs on the CPU (starting from main()), manages the resources, handles I/O, tells the device(s) to run kernels;

device code

collection of kernels that are uploaded to the device(s), issued by the host code.

Single vs separate sources

single source

the same source code holds both the host and the device code; requires a specific compiler to produce an executable that includes both the host and device part of the program;

separate source

host code sources are separate from device code sources; host code is produced by (any) host compiler (gcc, clang, msvc, icc, etc), device code is produced by a distinct library, offline (before program execution) or at run-time.

Classic host program

Classic host program compilation
Classic host program compilation

Single-source GPGPU program

Single-source GPGPU program compilation (e.g. CUDA)
Single-source GPGPU program compilation (e.g. CUDA)

Separate-source GPGPU program

Separate-source GPGPU program compilation (e.g. OpenCL)
Separate-source GPGPU program compilation (e.g. OpenCL)

Single source pros and cons

Single source

Pro
  • less code to write;
  • consistency between host and device data types;
Con
  • needs specific compiler (harder toolchain integration, e.g. for MPI);
  • specific compiler may not support all host compilers;
  • compiles only for specific device(s);
  • ‘selective compile’ of device code impossible.

Separate source pros and cons

Separate source

Pro
  • can use any host compiler (easy to integrate in existing toolchains);
  • compiles for any device (and only the one used);
  • possibility to select kernels to compile at runtime;
Con
  • more code to write (to compile and run kernels);
  • care must be taken to preserve consistency between host and device for complex data types.

Stream computing and GPUs

Hardware design

GPUs are designed for stream computing

SIMD vs SPMD/SIMT: SIMD

on CPU

you use the SIMD units by writing the equivalent of

float4 a, b, c;
a = /* compute or load vector */;
b = /* compute or load vector */;
c = a + b;

Limitations: must be customized for specific hardware to exploit the full SIMD width (e.g. to add 32 pairs of floats, need to use 8 float4 adds with SSE, 4 float8 adds with AVX, 2 float16 with AVX2).

SIMD vs SPMD/SIMT: SPMD/SIMT

on GPU

you write what a single lane does:

float a, b, c;
a = /* compute or load scalar */;
b = /* compute or load scalar */;
c = a + b;

and ask the GPU to run this as many times as needed. The hardware takes care of distributing the load over its compute units as needed.

Wide cores

A GPU multiprocessor (MP) has multiple processing elements (PE), but they do not act independently from each other (similar to SIMD lanes, but with hardware-managed scheduling). Hardware and semantic details depend on architecture.

Limited flow control: optimized for predication, lane masking, no irreducible control flow (spaghetti code), no or limited recursion.

Note

GPUs run groups of work-item in lock-step. The group (warp in NVIDIA-speak, wavefront in AMD-speak) has a fixed size which is the SIMT width of the hardware. Even if you launch a single work-item, a full warp/wavefront will run!

NVIDIA: warp size 32

Tesla (Compute Capability 1)
  • 8 PE/MP, 4 cycles per warp, 1 scheduler/MP;
Fermi (Compute Capability 2)
  • 2×16 PE/MP, 2 cycles per warp, 2 schedulers/MP
  • 3×16 PE/MP, 2 cycles per warp, 2 schedulers/MP,
    dual-issue;
Kepler (Compute Capability 3)
  • 6×32 PE/MP, 1 cycle per warp, 4 schedulers/MP,
    dual-issue;
Maxwell (Compute Capability 5)
  • 4×32 PE/MP, 1 cycle per warp, 4 schedulers/MP,
    dual-isse only for compute + load/store.

AMD: wavefront size 64

Very different architecture, an Ultra-Threaded Dispatch Processor (UTDP) schedules wavefronts across all compute units.

TeraScale 1, 2 (VLIW5)

16×5 PE/MP, 4 cycles per wavefront

TeraScale 3 (VLIW4)

16×4 PE/MP, 4 cycles per wavefront

Graphic Core Next (scalar)

4×16 PE/MP, 4 cycles per wavefront, 4 wavefronts at a time

High-bandwidth, high-latency

GPU memory controllers: high bandwidth, high latency.

High latency

a transaction might take 100s of cycles to complete.

High bandwidth

a single transaction can serve a whole subgroup (warp/wavefront) under optimal conditions.

Optimal conditions

a “pixel”- or “vertex”-worth of data per work-item, for contiguous groups of “pixels”. How does this translate for compute?

Optimal data types

Preferred data types: 32-bit, 64-bit or 128-bit data types, i.e. int, float, or vector type such as

struct float2 {
    float x; float y;
};

and

struct float4 {
    float x; float y; float z; float w;
};

Try to keep it aligned and contiguous.

Optimal data layout

Supported data types have alingment constraints, typically the type size.

Data alignment
  • a float (size: 4 bytes) is read with a single transaction if its first byte is at an address which is a multiple of 4 (“4-byte boundary”)
  • a float2 is 2×float, but single transaction only if at 8-byte boundary.
  • a float4 is 4×float, but single transaction only if at 16-byte boundary.
  • what about a float3? (hint: ‘it depends’)

Optimal data layout (bis)

Subgroup alignment

subgroup (warp/wavefront) is served optimally for contiguous data aligned on multiple-of-data-alignment boundaries (i.e. first datum must be on a boundary which is a multiple of the number of values that the hardware can provide in a single transaction).

Transaction model examples

Consider for example a class of old GPU without global memory cache,

NVIDIA GPUs with Compute Capability 1.x

half-warp granularity for transactions (16 work-items). A single transaction loads:

  • 16 4-byte words in a 64-byte segment (e.g. float)
  • 16 8-byte words in a 128-byte segment (e.g.float2)

Two transactions load;

  • 16 16-byte words in two consecutive 128-byte segment (e.g.float4)

Other architectures may have (more or less effective) global memory cache, which may help hide inefficient layout choices. But beware of relying on this too much.

Data layout hints

Preserve alignment
  • e.g., prefer float4 to float3 even if fourth component is not needed;
Preserve contiguity

prefer structure of arrays (SoA) over arrays of structures (AoS);

Do not rely on caching

sometimes, bypassing the cache and using a good layout is better than exploiting the cache!

(Note: do use the cache(s) as much as possible, but lay out your data optimally regardless.)

Memory layout examples

Set up a particle system.

Each particle has the properties:

How do you proceed?

The worst possible way

typedef struct particle {
    float3 pos;
    float3 vel;
} particle_t;

kernel void
iteration(global particle_t *particles)
{
    int workitem = get_global_id(0);
    float3 pos = particles[workitem].pos;
    float3 vel = particles[workitem].vel;
    /* do stuff */
}

Results in inefficient, repeated loads.

Worst-case: 4 transactions per work-item.

Avoid repeated loads

typedef struct particle {
    float3 pos;
    float3 vel;
} particle_t;

kernel void
iteration(global particle_t *particles)
{
    int workitem = get_global_id(0);
    particle_t pdata = particles[workitem];
    /* do stuff */
}

May improve over the previous one, depending on presence/efficiency of GPU caches. The whole struct can be loaded as three float2 loads, if the compiler is smart enough. (Note that it cannot be loaded as a float4+float2 due to alignment constraints.)

Worst-case: 3 transactions per work-item.

Align to float4

typedef struct particle {
    float4 pos; /* ignore 4th component */
    float4 vel; /* ignore 4th component */
} particle_t;

kernel void
iteration(global particle_t *particles)
{
    int workitem = get_global_id(0);
    particle_t pdata = particles[workitem];
    /* do stuff */
}

As previous, but compiler knows that it can load each element of the struct as a float4. Better if you only need to load pos or vel. Cache usage might be worse.

Worst-case: 2 transactions per work-item.

The efficient way

kernel void
iteration(
    global float4 * restrict posArray,
    global float4 * restrict velArray)
{
    int workitem = get_global_id(0);
    float4 pos = posArray[workitem]; /* ignore 4th component */
    float4 vel = velArray[workitem]; /* ignore 4th component */
    /* do stuff */
}

All work-items in a subgroup load pos in a single transaction and vel in another (single) transaction: two transactions load all the data for the whole warp/wavefront.

Hint

you can store extra info in the 4th component, e.g. mass and density, if you need additional properties.

Why is AoS inefficient?

Particle system loading patterns
Particle system loading patterns

Asynchronicity

command queue (CUDA stream)

implicit or explicit list of commands that the host issues on a device; each device is controlled by one (or more) command queues.

Commands are executed asynchronously: the host queues the command, the device runs them as soon as possible, the host can do other stuff in the mean time.

event

markers associated with commands to allow the host to synchronize with the devices.

The host can wait on events, query even status, or wait until the device has completed all commands.

CUDA vs OpenCL

Host API differences

CUDA
  • full device control, including special features;
  • raw pointers to device buffers;
  • single-source, separate source possible (but not easy);
  • context and queue management can be implicit or manual;
  • large launch grids must be split manually;
OpenCL
  • device abstraction, vendors can expose special features via extensions;
  • abstract buffers, no raw device pointers (but see SVM in 2.0);
  • separate source only;
  • contexts and queues must be managed manually;
  • automatic (platform) control of launch grid splitting.

Device language differences

CUDA C++
  • based on C++98 (preliminary support for C++11 in CUDA 7.5);
  • no built-in standard library for vector math
    (but see nvVector.h in the CUDA samples);
  • no built-in syntax for vector selection and swizzling;
  • device intrinsics exposed;
  • global work-item ID must be assembled from hardware registers, no support for launch grid offsets;
OpenCL C
  • based on C99 (C++14 proposed for OpenCL 2.1);
  • extensive built-in standard library for vector math;
  • extended syntax for vector selection and swizzling;
  • vendors may expose device intrinsics via extensions;
  • built-in methods to obtain any local and global ID, including support for launch grid offset.

(end)