General-purpose programming on GPU

First steps in CUDA

Giuseppe Bilotta, Eugenio Rustico, Alexis Hérault

DMI — Università di Catania
Sezione di Catania — INGV

Vector types

CUDA has built-in support for vector types: multi-dimensional data with 1 to 4 components, addressed by .x, .y, .z, .w. Some definitions:

struct uchar1
  unsigned char x;

struct __align__(4) ushort2
  unsigned short x, y;

struct uint3
  unsigned int x, y, z;

struct __builtin_align__(16) float4
  float x, y, z, w;

You can make them available in CPU code by including the appropriate header:

#include <vector_types.h>

Example usages:

Texture memory

3D graphics uses textures to ‘draw’ fancy stuff on 3D surfaces. Texture allocation and reading are also made available to the computing part of a GPU and can improve memory access or reduce computations in some use cases.

Kernels only have read-only access to texture memory. Reading a texel (texture element) is done with a fetch at given coordinates on a texture reference.

Before it can be used, a texture reference must be bound to a memory region. Binding is done by the CPU. Multiple texture references can be bound to the same or to overlapping memory areas.

A texture reference can have one, two or three dimensions. They can be bound either to standard linear memory addresses, or to special memory allocations called CUDA Arrays.

Texture references can only be used for integers (signed and unsigned, 8-, 16- or 32-bit wide), floats, and for the corresponding 1-, 2- and 4-component (but not 3-component) vector types.

Normalization of the values (mapping the value range to [0.0, 1.0] or [-1.0, 1.0]) can also be done automatically by texture references for 8-bit or 16-bit signed and unsigned integers. In this mode, for example, an unsigned 8-bit value of 0xcd (decimal 205) will be fetched as 0.803921569f (205/255).

Texture coordinates are floats in the range [0, N) where N is the texture size in that dimension. Example: a 64×32 texture will have coordinates [0,63]×[0,31]. Textures can be set to use normalized coordinates, mapping the actual size to [0, 1) in all dimensions.

Out-of-bounds coordinates are clamped, i.e. replaced with the closest in-bound coordinate. Example: a fetch for (-3.3, 33.3) on the previous texture would retrieve (0, 31). With normalized coordinates, textures can be set to wrap out of bounds coordinates. A fetch for (1.25, -1.25) would retrieve (0.25, 0.75).

Coordinates that do not fall exactly on a texel can return either the nearest neighbour or a value interpolated linearly from the neighbouring texels.

Some high-level examples. A 1-dimensional texture of float elements, returning the corresponding element type:

texture<float, 1, cudaReadModeElementType> posTex;

A 2-dimensional texture of char4 elements, returning a float4 with components in [-1.0, 1.0]:

texture<char4, 2, cudaReadModeNormalizedFloat> pixTex;

Texture references are always static (they have file scope) and must be global (i.e. do not declare them inside a function or structure).

The texture<> construct is a high-level interface to the structure

struct textureReference {
    int                          normalized;
    enum cudaTextureFilterMode   filterMode;
    enum cudaTextureAddressMode  addressMode[3];
    struct cudaChannelFormatDesc channelDesc;

that allow you to choose if you want to normalize coordinates (normalized = 1), interpolate coordinates (filterMode = cudaFilterModeLinear) or set out-of-bounds coordinates to wrap in any particular direction (addressMode[i] = cudaAddressModeWrap).

Binding textures

Examples for binding textures to linear memory:

texture<float, 1> oneTex;
texture<float, 2> twoTex;
float *dVector;

cudaMalloc(&dVector, width*height*sizeof(float));
cudaBindTexture(NULL, oneTex, dVector, vecsize);
cudaBindTexture2D(NULL, twoTex, dVector, twoTex.channelDesc,
                          width, height, pitch);

In the 2D case it is necessary to specify the pitch, i.e. the byte length of a row. This is typically width*sizeof(element), but may be larger is the rows are padded in memory to ensure a given alignment.

The first parameter is used to retrieve the offset that must be used to access elements, but it's only needed when the memory was not allocated with cudaMalloc (e.g. a texture pointing to a subset of an existing memory area) to comply with the memory alignment requirements of textures.

Transposing an image, with and without textures:

Further texture examples: increasing the depth of an image, and various address modes

PAM is a simple but flexible uncompressed bitmap format developed within the netpbm image manipulation toolkit. ImageMagick can be used to display and convert PAM images.

CUDA Arrays

CUDA arrays are memory areas dedicate to textures. They are read-only for the GPU (and are only accessible through texture fetches), and can be written to by the CPU using cudaMemcpyToArray. Devices with capability 2.0 or higher can write to CUDA arrays using surfaces, a read-write feature similar to textures.

Allocating a CUDA array requires the specification of a channel format description matching the one used by the texture that needs to be mapped on it.


texture<> someTex;
cudaArray *dArray;
cudaMallocArray(&dArray, &someTex.channelDesc, width, height);

Use (can copy at an offset wo, ho in the array):

cudaMemcpyToArray(dArray, wo, ho, source, size, cudaMemcpyHostToDevice);
cudaBindTextureToArray(someTex, dArray);



The channel format description

The cudaChannelFormatDesc describes the format of a texture element.

struct cudaChannelFormatDesc {
    int x, y, z, w;
    enum cudaChannelFormatKind f;

where x, y, z, w are set to the number of bits for each component, and f is one of cudaChannelFormatKindSigned, cudaChannelFormatKindUnsigned, cudaChannelFormatKindFloat.

Example, for float texels we could create a channel with

cudaCreateChannelDesc(32, 0, 0, 0, cudaChannelFormatKindFloat);

while for short4 texels this would be

cudaCreateChannelDesc(16, 16, 16, 16, cudaChannelFormatKindSigned);