General-purpose programming on GPU

CUDA driver interface

Giuseppe Bilotta, Eugenio Rustico, Alexis Hérault

DMI — Università di Catania
Sezione di Catania — INGV

Driver interface

CUDA runtime API:

high-level
C and C++
relies on nvcc to split CPU and GPU code

CUDA driver API:

low-level
C only (no overloading, no templating, etc)
GPU code and CPU code reside in different files

Using the driver API, the program can (and actually must):

manage the GPU via an appropriate context
load and manage modules
set up kernel launches

Contexts

A context is an environment through which the program manages the GPU resources: memory allocation, module loading, kernel launches.

A program must create at least one context, but it can create more than one (on the same or on different GPUs):

CUresult cuCtxCreate(CUcontext *ctx, unsigned int flags, CUdevice dev);

Each CPU thread has an associated stack of contexts. The topmost context is the current context, which most driver API functions work on. Manage the stack with

CUresult cuCtxPopCurrent(CUcontext *ctx); // float the current context, assigning it to ctx, and pop it from the stack
CUresult cuCtxPushCurrent(CUcontext ctx); // push the floating context ctx unto the stack

Contexts are reference-counted; threads manage the reference counting with:

CUresult cuCtxAttach(CUcontext *ctx, unsigned int flags);
CUresult cuCtxDetach(CUcontext ctx);

A context with refcount 0 is destroyed. A context with refcount 1 can be destroyed by the thread for which it is current with

CUresult cuCtxDestroy(CUcontext ctx);

Memory management

Device memory pointers have their own type (CUdeviceptr). No information is available about the data type pointed to.

Allocation/deallocation:

CUresult cuMemAlloc(CUdeviceptr *devptr, size_t size);
CUresult cuMemAllocPitch(CUdeviceptr *devptr, size_t *pitch, size_t width, size_t height, unsigned int elementSize);
CUresult cuMemFree(CUdeviceptr devptr);

Memory setting and copying have synchronous and asynchronous versions, we only show here the synchronous version.

Memset (pointer, value, number of elements):

CUresult cuMemsetD8    (CUdeviceptr dst, unsigned char uc, size_t N)
CUresult cuMemsetD16   (CUdeviceptr dst, unsigned short us, size_t N)
CUresult cuMemsetD32   (CUdeviceptr dst, unsigned int ui, size_t N)

2-D memset (pointer, pitch, value, width and height in elements):

CUresult cuMemsetD2D16 (CUdeviceptr dst, size_t pitch, unsigned short us, size_t Width, size_t Height)
CUresult cuMemsetD2D32 (CUdeviceptr dst, size_t pitch, unsigned int ui, size_t Width, size_t Height)
CUresult cuMemsetD2D8  (CUdeviceptr dst, size_t pitch, unsigned char uc, size_t Width, size_t Height)

Copy:

CUresult cuMemcpyDtoD(CUdeviceptr *dst, CUdeviceptr *src, size_t size);
CUresult cuMemcpyDtoH(void *dst, CUdeviceptr *src, size_t size);
CUresult cuMemcpyHtoD(CUdeviceptr *dst, void *src, size_t size);

2-D copy:

CUresult cuMemcpy2D(const CUDA_MEMCPY2D *pattern);

typedef struct CUDA_MEMCPY2D_st {
    unsigned int srcXInBytes, srcY;
    CUmemorytype srcMemoryType;
    /* assign only one of these depending on type of source
        const void *srcHost;
        CUdeviceptr srcDevice;
        CUarray srcArray;

    unsigned int srcPitch;

    unsigned int dstXInBytes, dstY;
    CUmemorytype dstMemoryType;
    /* assign only one of these depending on type of destination
        void *dstHost;
        CUdeviceptr dstDevice;
        CUarray dstArray;

    unsigned int dstPitch;

    unsigned int WidthInBytes;
    unsigned int Height;
} CUDA_MEMCPY2D;

typedef enum CUmemorytype_enum {
    CU_MEMORYTYPE_HOST = 0x01,
    CU_MEMORYTYPE_DEVICE = 0x02,
    CU_MEMORYTYPE_ARRAY = 0x03
} CUmemorytype;

Module management

Modules are libraries that contain kernels, constant memory declarations and texture declarations.

They can be stored:

inside the program itself, as a sequence of bytes
externally (another file)

They can be in

architecture-specific precompiled binary format (cubin)
low-level ISA representation (PTX, Parallel Thread eXecution)

cubin and ptx files are generated by nvcc. ptx data must be compiled into architecture-specific binary format before launch.

(There is also a "fat cubin" format that contains multiple cubin version of the same device code, for different architectures.)

Loading a module:

/* external module */
CUresult cuModuleLoad(CUmodule *module, const char *fname);
/* load from a sequence of bytes */
CUresult cuModuleLoadData(CUmodule *module, const void *image);
/* load from a sequence of bytes, with custom just-in-time compilation options */
CUresult cuModuleLoadDataEx(CUmodule *module, const void *image, unsigned int numOptions, CUjit_option *options, void **optionValues);
/* load from a fat cubin */
CUresult cuModuleLoadFatBinary(CUmodule *module, const void *fatCubin);

Unloading a module:

CUresult cuModuleUnload(CUmodule hmod);

Accessing data from a module:

/* get the address and size of a global symbol (constant) */
CUresult cuModuleGetGlobal (CUdeviceptr *dptr, size_t *bytes, CUmodule hmod, const char *name)
/* get a texture reference */
CUresult cuModuleGetTexRef (CUtexref *pTexRef, CUmodule hmod, const char *name)
/* get a handle to a function */
CUresult cuModuleGetFunction (CUfunction *hfunc, CUmodule hmod, const char *name)

Kernel execution

Launching a kernel requires the following steps:

getting a handle to it (cuModuleGetFunction)
setting the parameters
setting the block shape
launching the kernel on a grid

Setting parameters

Functions to set parameters:

/* add a float parameter */
CUresult cuParamSetf (CUfunction hfunc, int offset, float value)
/* add an int parameter */
CUresult cuParamSeti (CUfunction hfunc, int offset, unsigned int value)
/* add a parameter of arbitrary type and size */
CUresult cuParamSetv (CUfunction hfunc, int offset, void *ptr, unsigned int numbytes)

/* set the size of parameter block */
CUresult cuParamSetSize (CUfunction hfunc, unsigned int numbytes)

Each parameter must be set at the appropriate offset, commanded by alignment requirements. NVIDIA suggests the following macro:

#define ALIGN_UP(offset, alignment) \
   (offset) = ((offset) + (alignment)-1) & ~((alignment)-1)

to update the offset so that it has the correct alignment. You would then use something like:

ALIGN_UP(offset, __alignof(dDst));
cuParamSetv(someKernel, offset, &dDst, sizeof(dDst));
offset += sizeof(dDst);

ALIGN_UP(offset, __alignof(width));
cuParamSeti(someKernel, offset, width);
offset += sizeof(width);

cuParamSetSize(someKernel, offset);

Blocks and grids

The block shape is set with

CUresult cuFuncSetBlockShape(CUfunction funct, int x, int y, int z);

and the kernel is launched with one of

/* 1x1x1 grid */
CUresult cuLaunch(CUfunction f);
/* WxHx1 grid, blocks waiting for previous calls to return */
CUresult cuLaunchGrid(CUfunction f, int W, int H);
/* WxHx1 grid on a different stream, queue execution without blocking */
CUresult cuLaunchGridAsync(CUfunction f, int W, int H, CUstream hStream);

Other kernel-related functions

You also have

CUresult cuFuncSetSharedSize (CUfunction hfunc, unsigned int bytes);

to set the size of extern shared memory, and

CUresult cuFuncSetCacheConfig (CUfunction hfunc, CUfunc_cache config);

to set the L1 cache/shared memory preference on devices 2.0 and above, which can be set context-wide using

CUresult cuCtxSetCacheConfig (CUfunc_cache config);

Finally, you can get information about the kernel with

CUresult cuFuncGetAttribute(int *pi, CUfunction_attribute attrib, CUfunction hfunc);

where attrib is one of:

CU_FUNC_ATTRIBUTE_MAX_THREADS_PER_BLOCK: Maximum threads per block (depends on kernel properties and device capabilities)
CU_FUNC_ATTRIBUTE_SHARED_SIZE_BYTES: Static shared memory size
CU_FUNC_ATTRIBUTE_CONST_SIZE_BYTES: Constant memory required by the kernel
CU_FUNC_ATTRIBUTE_LOCAL_SIZE_BYTES: Local memory required by the kernel, per thread (very bad for performance)
CU_FUNC_ATTRIBUTE_NUM_REGS: Registers per thread used by the kernel
CU_FUNC_ATTRIBUTE_PTX_VERSION: Virtual architecture for which the PTX was compiled (e.g., if it's less than 1.3 it won't use doubles)
CU_FUNC_ATTRIBUTE_BINARY_VERSION: Binary architecture for which the binary was compiled (as above)

Concluding remarks

The low level interface is much more complex to use. Most of its features are availabe in the runtime API, though. For example, the <<<...>>> syntax is converted by nvcc into runtime API calls:

cudaSetupArgument to define arguments for a kernel call
cudaConfigureCall to define the block, grid, shared memory
cudaLaunch to actually launch the kernel set up with the above

Also, the kernel attributes can be recovered with cudaFuncGetAttributes, which retrieves all attributes in a cudaFuncAttributes structure.

Why use the driver API then?

it allows direct access to features (contexts and modules) which allow more sophisticated management of the resources;
it gives a deeper understanding of the inner workings of GPU management for general-purpose computing.

These concepts are useful when thinking about general-purpose computing on GPUs in a more general way, not bound to CUDA, as we will see wih OpenCL.