General-purpose programming on GPU

Benchmarking, profiling and optimization

Giuseppe Bilotta, Eugenio Rustico, Alexis Hérault

DMI — Università di Catania
Sezione di Catania — INGV


To benchmark a kernel execution, we use events. Events are like checkpointing stopwatches in GPU streams.

Defining events:

float runtime;
cudaEvent_t start, stop;

Using events:

cudaEventRecord(start, 0);
/* launch kernel */
cudaEventRecord(stop, 0);
cudaEventElapsedTime(&runtime, start, stop);

Destroying events:


Optimization tips

The GPU offers the best performance when all cores are computing. Optimization revolves around three main points:

Maximize parallelism

Using all the cores requires a large number of blocks (at least as many as there are MP on the GPU). On devices with capability 2.x, you can launch two kernels.

If you need to serialize (threads must depend on what other threads do), try to keep dependent threads on the same block: you can then use __syncthreads() instead of requiring separate kernel launches.

Hiding latency exploit parallelism between GPU components: if there are warps that can be dispatched while other warps wait, the latency is covered.

Latency hiding requires knowledge of how the instructions are dispatched, which depends on the device capability:

A typical instruction takes 22 cycles to complete. This means that 6 warps (capability 1.x) or 22 warps (capability 2.x) are the minimum to hide standard sequential execution (an instruction depending on the result of the previous instruction).

Global and local memory access takes 400 to 800 cycles. Warps needed to hide this latency depend on the density between instructions that do not depend on such memory and instructions that do depend on them.

When using synchronization, all warps in a block are stalled and their latency is that of the worst thread. This is covered by having more than one block per MP.

The occupancy (resident warps/maximum resident warps) can be computed using the CUDA Occupancy Calculator. A higher occupancy helps hiding synchronization latencies.

Minimizing memory latency

Memory accesses are managed in transactions. A single transaction accesses 32, 64 or 128 bytes, aligned to their size (e.g. 128-byte transactions must be aligned at a 128-byte boundary).

Memory access instructions are automatically managed in transactions. The number of transactions needed for memory accesses depend on:

A single memory instruction is issued to access powers-of-two bytes (1 to 16) which are naturally aligned. All other accesses are split in multiple instructions.

cudaMalloc() returns memory which is aligned at 256 bytes.

Row alignment for linearized 2D arrays can be improved using cudaMallocPitch():

data_t *devData; /* e.g. float *devData */
size_t pitch;
cudaMallocPitch(&devData, &pitch, numCols*sizeof(data_t), numRows);

The pitch is in bytes. Element (row, col) is then accessed as ((data_t*)((char*)devData + row*pitch))[col].

Copy this memory using cudaMemcpy2D():

cudaMemcpy2D(dest, destpitch, src, srcpitch, row_width, numrows, direction);

where destpitch, srcpitch and row_width are in bytes and indicate respectively:

Coalescing rules

Capability 1.0 and 1.1:

Capability 1.2 and 1.3:

Capability 2.0 and 2.1:

cudaFuncSetCacheConfig(kernelName, preference) where preference is one of:

Aligned, sequential

Aligned, non-sequential

Misaligned, sequential

(Images courtesy of the CUDA C Programming Guide)

Shared memory and bank conflicts

Shared memory is divided into banks. 32-bit (4-byte) words are assigned consecutively to banks. The bandwidth is 32 bits per two clock cycles.

Shared memory is typically as fast as the registers. Bank conflicts (accesses to the same bank) introduce a latency.

Each half-warp issues memory requests independently. Threads in the first half cannot have bank conflicts with threads in the second half.

Broadcasting: if all threads access the same word, the data is broadcast (no bank conflicts).

Common access: 32-bit strided access (float, float2, float3, float4 etc).

Bank conflicts happen if the stride is even (e.g. float2, float4), no bank conflict happens if stride is odd (e.g. float, float3).

8-bit and 16-bit access: all accesses cause bank conflicts unless the data is interleaved.

64-bit accesses or higher: bank conflicts cannot be avoided.

Capability 2.x: 32 banks instead of 16 banks. Memory access to the same 32-bit word are broadcast (no bank conflicts for 8-bit and 16-bit access). 64-bit accesses are split into non-conflict 32-bit accesses. 128-bit accesses cause 2-way bank conflicts.

(See pictures in the programming guide.)

Maximizing instruction throughput

Devices with capability 1.x have fast 24-bit integer multiply. __mul24 can be used instead of the standard product when the result does not overflow. Devices with capability 2.x have fast 32-bit integer multiply, and __mul24 is actually slower.

Some compiler flags control the use of functions such as logarithm, exponential, and trig functions. (--use_fast_math and the booleans --ftz=, --prec-div=, --prec-sqrt= which change treatment of denormals and the precision of division and square roots).

Optimization recommendations

  1. find ways to parallelize sequential code
  2. minimize data transfers between the host and the device
  3. adjust kernel launch configuration to maximize device utilization
  4. ensure global memory accesses are coalesced
  5. replace global memory accesses with shared memory accesses whenever possible
  6. avoid bank conflicts in shared memory
  7. avoid different execution paths within the same warp


CUDA profiler. Hands-on experience.

An example

Smoothing an image (averaging each pixel with its immediate neighbors).

To help reduce timing fluctuations, we launch the kernel 3 times and average the time. An extra kernel launch before the timing removes the kernel loading time.

Naive approach

Source: routines to load/save PAM files, source code

Timing on a GeForce 9600M GT: 57ms.

(Timings are for a 664x1024 RGB image).

Caching using shared memory

Threads in the same block use much of the same data. We can preload all data in shared memory and thus load each pixel data once.

Source: routines to load/save PAM files, source code

Timing on a GeForce 9600M GT: 19ms.

We have a 3x speed-up. The profiler shows that we have about 6 times less loads from global memory, but about 2 times more divergent branches.

Aligning the pixel data

The profiler shows that all reads are uncoalesced (gld coalesced is 0). This is because we read ushort3 data (RGB, no alpha channel). We can force the data to be aligned optimally by padding pixel data so that it always has an even number of channels.

Source: routines to load/save PAM files with padding, source code

Timing on a GeForce 9600M GT: 16ms.

Aligning the pixel data and caching the loads

We can still use shared memory to cache the data. However, since more memory is now used, we must use smaller blocks

Source: routines to load/save PAM files with padding, source code

Timing on a GeForce 9600M GT: 13ms.

Aligning the image rows

Not all rows are properly aligned, because each row has 664 pixels. Indeed, working with the transposed image (1024x664) gives better timings: 13ms without shared memory, 12ms with shared memory.

We can align all rows properly using pitched memory allocation.

Source: routines to load/save PAM files with padding, source code

Timing on a GeForce 9600M GT: 16ms.

Aligning rows and caching

We can combine the previous optimization with the shared memory caching. Again, we have to reduce the block size.

Source: routines to load/save PAM files with padding, source code

Timing on a GeForce 9600M GT: 12ms.

A note on shared memory initialization

There are a number of possible approaches to loading data into shared memory. The one shown in the mentioned files (load central block, load borders) is not the most efficient.

Animated graph: shared memory loading pattern

Final remarks

The timings on these examples are done on a GeForce 9600M GT, which has capability 1.1 and therefore the worst possible coalescing capability. On different hardware, different strategies would be more appropriate.

Optimizing memory access on lower hardware gives a much more significant benefit than relying on shared memory for caching. On 2.0 hardware and higher, we would not try using the shared memory at all, and expoit the L1 caching instead.