CUDA runtime API:
CUDA driver API:
Using the driver API, the program can (and actually must):
A context is an environment through which the program manages the GPU resources: memory allocation, module loading, kernel launches.
A program must create at least one context, but it can create more than one (on the same or on different GPUs):
CUresult cuCtxCreate(CUcontext *ctx, unsigned int flags, CUdevice dev);
Each CPU thread has an associated stack of contexts. The topmost context is the current context, which most driver API functions work on. Manage the stack with
CUresult cuCtxPopCurrent(CUcontext *ctx); // float the current context, assigning it to ctx, and pop it from the stack
CUresult cuCtxPushCurrent(CUcontext ctx); // push the floating context ctx unto the stack
Contexts are reference-counted; threads manage the reference counting with:
CUresult cuCtxAttach(CUcontext *ctx, unsigned int flags);
CUresult cuCtxDetach(CUcontext ctx);
A context with refcount 0 is destroyed. A context with refcount 1 can be destroyed by the thread for which it is current with
CUresult cuCtxDestroy(CUcontext ctx);
Device memory pointers have their own type (CUdeviceptr
). No information is
available about the data type pointed to.
Allocation/deallocation:
CUresult cuMemAlloc(CUdeviceptr *devptr, size_t size);
CUresult cuMemAllocPitch(CUdeviceptr *devptr, size_t *pitch, size_t width, size_t height, unsigned int elementSize);
CUresult cuMemFree(CUdeviceptr devptr);
Memory setting and copying have synchronous and asynchronous versions, we only show here the synchronous version.
Memset (pointer, value, number of elements):
CUresult cuMemsetD8 (CUdeviceptr dst, unsigned char uc, size_t N)
CUresult cuMemsetD16 (CUdeviceptr dst, unsigned short us, size_t N)
CUresult cuMemsetD32 (CUdeviceptr dst, unsigned int ui, size_t N)
2-D memset (pointer, pitch, value, width and height in elements):
CUresult cuMemsetD2D16 (CUdeviceptr dst, size_t pitch, unsigned short us, size_t Width, size_t Height)
CUresult cuMemsetD2D32 (CUdeviceptr dst, size_t pitch, unsigned int ui, size_t Width, size_t Height)
CUresult cuMemsetD2D8 (CUdeviceptr dst, size_t pitch, unsigned char uc, size_t Width, size_t Height)
Copy:
CUresult cuMemcpyDtoD(CUdeviceptr *dst, CUdeviceptr *src, size_t size);
CUresult cuMemcpyDtoH(void *dst, CUdeviceptr *src, size_t size);
CUresult cuMemcpyHtoD(CUdeviceptr *dst, void *src, size_t size);
2-D copy:
CUresult cuMemcpy2D(const CUDA_MEMCPY2D *pattern);
typedef struct CUDA_MEMCPY2D_st {
unsigned int srcXInBytes, srcY;
CUmemorytype srcMemoryType;
/* assign only one of these depending on type of source
const void *srcHost;
CUdeviceptr srcDevice;
CUarray srcArray;
unsigned int srcPitch;
unsigned int dstXInBytes, dstY;
CUmemorytype dstMemoryType;
/* assign only one of these depending on type of destination
void *dstHost;
CUdeviceptr dstDevice;
CUarray dstArray;
unsigned int dstPitch;
unsigned int WidthInBytes;
unsigned int Height;
} CUDA_MEMCPY2D;
typedef enum CUmemorytype_enum {
CU_MEMORYTYPE_HOST = 0x01,
CU_MEMORYTYPE_DEVICE = 0x02,
CU_MEMORYTYPE_ARRAY = 0x03
} CUmemorytype;
Modules are libraries that contain kernels, constant memory declarations and texture declarations.
They can be stored:
They can be in
cubin and ptx files are generated by nvcc. ptx data must be compiled into architecture-specific binary format before launch.
(There is also a "fat cubin" format that contains multiple cubin version of the same device code, for different architectures.)
Loading a module:
/* external module */
CUresult cuModuleLoad(CUmodule *module, const char *fname);
/* load from a sequence of bytes */
CUresult cuModuleLoadData(CUmodule *module, const void *image);
/* load from a sequence of bytes, with custom just-in-time compilation options */
CUresult cuModuleLoadDataEx(CUmodule *module, const void *image, unsigned int numOptions, CUjit_option *options, void **optionValues);
/* load from a fat cubin */
CUresult cuModuleLoadFatBinary(CUmodule *module, const void *fatCubin);
Unloading a module:
CUresult cuModuleUnload(CUmodule hmod);
Accessing data from a module:
/* get the address and size of a global symbol (constant) */
CUresult cuModuleGetGlobal (CUdeviceptr *dptr, size_t *bytes, CUmodule hmod, const char *name)
/* get a texture reference */
CUresult cuModuleGetTexRef (CUtexref *pTexRef, CUmodule hmod, const char *name)
/* get a handle to a function */
CUresult cuModuleGetFunction (CUfunction *hfunc, CUmodule hmod, const char *name)
Launching a kernel requires the following steps:
cuModuleGetFunction
)Functions to set parameters:
/* add a float parameter */
CUresult cuParamSetf (CUfunction hfunc, int offset, float value)
/* add an int parameter */
CUresult cuParamSeti (CUfunction hfunc, int offset, unsigned int value)
/* add a parameter of arbitrary type and size */
CUresult cuParamSetv (CUfunction hfunc, int offset, void *ptr, unsigned int numbytes)
/* set the size of parameter block */
CUresult cuParamSetSize (CUfunction hfunc, unsigned int numbytes)
Each parameter must be set at the appropriate offset, commanded by alignment requirements. NVIDIA suggests the following macro:
#define ALIGN_UP(offset, alignment) \
(offset) = ((offset) + (alignment)-1) & ~((alignment)-1)
to update the offset so that it has the correct alignment. You would then use something like:
ALIGN_UP(offset, __alignof(dDst));
cuParamSetv(someKernel, offset, &dDst, sizeof(dDst));
offset += sizeof(dDst);
ALIGN_UP(offset, __alignof(width));
cuParamSeti(someKernel, offset, width);
offset += sizeof(width);
cuParamSetSize(someKernel, offset);
The block shape is set with
CUresult cuFuncSetBlockShape(CUfunction funct, int x, int y, int z);
and the kernel is launched with one of
/* 1x1x1 grid */
CUresult cuLaunch(CUfunction f);
/* WxHx1 grid, blocks waiting for previous calls to return */
CUresult cuLaunchGrid(CUfunction f, int W, int H);
/* WxHx1 grid on a different stream, queue execution without blocking */
CUresult cuLaunchGridAsync(CUfunction f, int W, int H, CUstream hStream);
You also have
CUresult cuFuncSetSharedSize (CUfunction hfunc, unsigned int bytes);
to set the size of extern
shared memory, and
CUresult cuFuncSetCacheConfig (CUfunction hfunc, CUfunc_cache config);
to set the L1 cache/shared memory preference on devices 2.0 and above, which can be set context-wide using
CUresult cuCtxSetCacheConfig (CUfunc_cache config);
Finally, you can get information about the kernel with
CUresult cuFuncGetAttribute(int *pi, CUfunction_attribute attrib, CUfunction hfunc);
where attrib
is one of:
CU_FUNC_ATTRIBUTE_MAX_THREADS_PER_BLOCK
Maximum threads per block (depends on kernel properties and device capabilities)
CU_FUNC_ATTRIBUTE_SHARED_SIZE_BYTES
Static shared memory size
CU_FUNC_ATTRIBUTE_CONST_SIZE_BYTES
Constant memory required by the kernel
CU_FUNC_ATTRIBUTE_LOCAL_SIZE_BYTES
Local memory required by the kernel, per thread (very bad for performance)
CU_FUNC_ATTRIBUTE_NUM_REGS
Registers per thread used by the kernel
CU_FUNC_ATTRIBUTE_PTX_VERSION
Virtual architecture for which the PTX was compiled (e.g., if it's less than 1.3 it won't use doubles)
CU_FUNC_ATTRIBUTE_BINARY_VERSION
Binary architecture for which the binary was compiled (as above)
The low level interface is much more complex to use. Most of its
features are availabe in the runtime API, though. For example, the
<<<...>>>
syntax is converted by nvcc
into runtime API calls:
cudaSetupArgument
to define arguments for a kernel callcudaConfigureCall
to define the block, grid, shared memorycudaLaunch
to actually launch the kernel set up with the aboveAlso, the kernel attributes can be recovered with
cudaFuncGetAttributes
, which retrieves all attributes in a
cudaFuncAttributes
structure.
Why use the driver API then?
These concepts are useful when thinking about general-purpose computing on GPUs in a more general way, not bound to CUDA, as we will see wih OpenCL.