General-purpose programming on GPU

OpenCL

Giuseppe Bilotta, Eugenio Rustico, Alexis Hérault

DMI — Università di Catania
Sezione di Catania — INGV

OpenCL, the Open Computing Language

OpenCL: language to allow cross-platform, hardware-independent parallel computing programming. Composed of two parts:

Registered trademark of Apple, it's developed by a consortium (Khronos Group) with contributions from AMD, NVIDIA, IBM, Intel.


The OpenCL programming languages is based on C99. Differences:

Additions to C99:

Cross-platform and heterogenous programming

OpenCL ICD: Installable Client Driver. Vendor-provided component that interfaces between the OpenCL library and the actual hardware.

Example:

Multiple ICDs can coexist on the same machine, the programmer can choose which one(s) to use.


Application
OpenCL library
NVIDIA ICDAMD ICDIntel ICD
NVIDIA GPUsATI GPUsx86(-64) CPU
The OpenCL ICD structure

Task-based vs data-based parallelism

OpenCL offers two flavors of parallelism:

data-based parallelism

as in CUDA, the same (set of) operation(s) is executed in parallel over a large set of data

task-based parallelism

multiple independent tasks are launched in parallel on independent sets of data

Or you can mix-n-match, with multiple data-parallel tasks.

From CUDA to OpenCL

CUDA vs OpenCL terminology
CUDAOpenCL
threadwork-item
blockwork-group
global memory
constant memory
shared memorylocal memory
local memoryprivate memory

CUDA vs OpenCL syntax
CUDAOpenCL
__global__ (function)__kernel
__device__ (function)not needed
__constant__ (variable)__constant
__device__ (variable)__global
__shared__ (variable)__local


CUDA vs OpenCL thread/work-item indexing
CUDAOpenCL
gridDimget_num_groups()
blockDimget_local_size()
blockIdxget_group_id()
threadIdxget_local_id()
threadIdx + blockIdx*BlockDimget_global_id()
gridDim*blockDimget_global_size()


The equivalent of __syncthreads() from CUDA is

void barrier(cl_mem_fence_flags flags);

that waits on all work-items in a work-goup. The additional flags are a combination of CLK_LOCAL_MEM_FENCE and CLK_GLOBAL_MEM_FENCE and are used to also guarantee that all accesses to local and global memory (respectively) are completed.

If the only requirement is for correct memory access ordering, one can use the memory fence instructions:

void mem_fence(cl_mem_fence_flags flags);
void read_mem_fence(cl_mem_fence_flags flags);
void write_mem_fenc(cl_mem_fence_flags flags);

OpenCL also defines functions for asynchronous copies between global and local memory:

event_t async_work_group_copy(
    __local gentype *dst, const __global gentype *src,
    size_t numelements, event_t event);
event_t async_work_group_copy(
    __global gentype *dst, const __local gentype *src,
    size_t numelements, event_t event);
void wait_group_events(int num_events, event_t *event);

(think about the use of shared memory as cache in CUDA devices with capabilities 1.x)

gentype means a generic type supported by OpenCL.


Aside from the C99 scalar types, OpenCL supports vector data types, but only with power-of-two components. Examples: char4, float2, uint8, short16.

These types are only available in OpenCL sources. The host application can use cl_type (e.g. cl_long4).

Components in vector types are accessed using .[xyzw] or .s[0-9a-f].

More than one component can be accessed at a time:

float4 pos = (float4)(1.0f, 2.0f, 3.0f, 4.0f);

float2 bog = pos.wy; // (4.0f, 2.0f)

float4 dup = pos.xxyy; // (1.0f, 1.0f, 2.0f, 2.0f)

lvalues must not have repeated components: you can assign to pos.xz but not to pos.xx.


s-notation:

float8 fviii; float16 fxvi;

fxvi.sad = fvii.s01;

assigns the first two components of fvii to the 10th and 11th components of fxvi.

There are also the suffixes .lo, .hi, .odd and .even:

float4 vf;
float2 low = vf.lo; // = vf.xy
float2 high = vf.hi; // = vf.zw
float2 ev = vf.even; // = vf.s02 = vf.xz
float2 od = vf.odd; // = vf.s13 = vf.yw

OpenCL supports most common operations on all vector types, both component-by-component with two vector operands, and component-with-scalar with a vector and a scalar operand.

Types can automatically promote with rules similar to those in C99. Explicit conversion can be done with convert_<type>[_flag](), where type is the destination type, and flag is one of:


Examples:

OpenCL also provides reinterpret casts: as_uint(1.0f) returns 0x3f800000 (the binary representation of the floating-point value 1.0).

A vector example:

float4 f, g;
int4 is_less = f < g;
f = as_float4(is_less && as_int4(f));

keeps each component of f that was less than the corresponding component of g and sets the other to 0.


Additional built-in functions:


CUDA vs OpenCL API objects
CUDAOpenCL
CUdevicecl_device_id
CUcontextcl_context
CUmodulecl_program
CUfunctioncl_kernel
CUdeviceptrcl_mem
not available (streams come close)cl_command_queue

OpenCL application structure

An OpenCL application typically


Differences from CUDA:

Device types

When querying for devices or creating contexts, the programmer can choose the type of device:

CL_DEVICE_TYPE_DEFAULT

the default device type for the specific platform

CL_DEVICE_TYPE_CPU

a host CPU

CL_DEVICE_TYPE_GPU

a device with graphical capability

CL_DEVICE_TYPE_ACCELERATOR

a dedicated computing device (note: NVIDIA Teslas are reported as GPUs rather than accelerators)

CL_DEVICE_TYPE_ALL

any kind of device

Context creation

To create a context, the programmer can either choose a specific set of devices, or let the platform choose based on a given device type:

OpenCL context creation APIs

    cl_context clCreateContext(
        cl_context_properties *props,
        cl_uint num_devs,
        const cl_device_id *devs,
        void *pfn_notify(const char *errinfo,
                         const void *private_info,
                         size_t cb, void *user_data),
        void *user_data,
        cl_int *error);

    cl_context clCreateContextFromType(
        cl_context_properties *props,
        cl_device_type device_type,
        void *pfn_notify(const char *errinfo,
                         const void *private_info,
                         size_t cb, void *user_data),
        void *user_data,
        cl_int *error);

pfn_notify can be used to register a callback function that the OpenCL implementation can use to report errors to the application.

Command queues

Device commands (memoory copies, kernel launches) are managed in queues. A command queue is created with

cl_command_queue clCreateCommandQueue(
    cl_context ctx,
    cl_device_id device,
    cl_command_queue_properties properties,
    cl_int *error);

where the possible properites are:

CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE

used to allow out-of-order command executions on devices that support it

CL_QUEUE_PROFILING_ENABLE

used to enable profiling of commands


If out-of-order execution is not enabled, the application can assume that commands are executed in the order they are submitted. With out-of-order execution, synchronization between commands is enabled by associating commands with events and listing events to wait on.


Profiling information, when enabled, is retrieved with:

cl_int clGetEventProfilingInfo(
    cl_event event,
    cl_profiling_info param_name,
    size_t param_value_size,
    void *param_value
    size_t param_value_size_set);

where event is a cl_event associated with the command to be profiled.


Currently all the parameters returned by the API are cl_ulongs and express times in nanoseconds according to a device time counter. Possible values (with CL_PROFILING_COMMAND_ prefix):

QUEUED

when the command was put in the queue

SUBMIT

when the command was submitted to the device

START

when execution started

END

when execution completed

Memory allocation

Memory is allocated with

cl_mem clCreateBuffer(cl_context ctx, cl_mem_flags flags,
                      size_t size, void *host_ptr, cl_int *error);

Flags (prefixed with CL_MEM_) are used to specify how the devices will use the memory (READ_WRITE, READ_ONLY, WRITE_ONLY), and how to interact with host memory:

USE_HOST_PTR

the memory pointed to by host_ptr is used as device memory; implementations can cache it to device memory;

ALLOC_HOST_PTR

the buffer should be allocated from host-accessible memory; this is mutually exclusive with USE_HOST_PTR;

COPY_HOST_PTR

the buffer is initialized by copying the memory pointed at by host_ptr;

Memory copy

Data is copied from a buffer (read) and to a buffer (write) with the following APIs:

OpenCL device/host memory copies

    cl_int clEnqueueReadBuffer(
        cl_command_queue command_queue,
        cl_mem buffer,
        cl_bool blocking,
        size_t offset,
        size_t numbytes,
        void *ptr,
        cl_uint num_events_in_wait_list,
        const cl_event *event_wait_list,
        cl_event *event)

    cl_int clEnqueueWriteBuffer(
        cl_command_queue command_queue,
        cl_mem buffer,
        cl_bool blocking,
        size_t offset,
        size_t numbytes,
        const void *ptr,
        cl_uint num_events_in_wait_list,
        const cl_event *event_wait_list,
        cl_event *event)

Events

The event_wait_list and the preceding counter are optional events the memory copy must wait on before being executed. event is the event associated with the memory copy command, and can be used to allow other commands to wait on them, as well as for timing.

This pattern (event_wait_list and event) is available on all command queueing APIs and is the foundation of command synchronization, particularly when the command queue is created out-of-order execution enabled.

OpenCL events do not need specific API for creation/deletion (i.e., no cuEventCreate()/cuEventDestroy(), and are automatically managed by the command queue APIs.

Programs

Programs are collections of kernels, auxiliary functions and constant memory declarations.

A program can be created from strings representing an OpenCL source using

cl_program clCreateProgramWithSource(
    cl_context ctx,
    cl_uint count,
    const char **strings,
    const size_t *lengths,
    cl_int *error);

in which case the program is created for all devices in the context. strings is an array of C strings, or of arbitrary strings whose lengths are stored in lengths.


A program can also be created from binary blobs using

cl_program clCreateProgramWithBinary(
    cl_context ctx,
    cl_uint num_devs,
    const cl_device_id *device_list,
    const size_t *lengths,
    const unsigned char **binaries,
    cl_int *binary_status,
    cl_int *error);

and the programmer must then specify lists of devices and of binaries of equal length. The length of each blob is stored in the corresponding entry in lengths and the result of the load is stored in the corresponding entry of binary_status.


A source program is compiled with

cl_int clBuilProgram(
    cl_program program,
    cl_uint num_devices,
    const cl_device_id *device_list,
    const char *options,
    void (*pfn_notify)(cl_program, void *user_data),
    void *user_data)

that builds program for the devices specified in device_list.

Options that would need to be given to the compiler can be passed in the C string options. These can include standard preprocessor options (e.g. -Dmacro, -I/path/to/include/files), OpenCL compiler options (e.g. -cl-opt-disable to disable optimizations, -cl-fast-relaxed-math for fast but less accurate math) or vendor-specific options


If compilation fails (returns something other than CL_SUCCESS, the output from the compiler can be retrieved with a code such as this:

if (error == CL_BUILD_PROGRAM_FAILURE) {
    size_t logSize = 0;
    char *log;
    clGetProgramBuildInfo(program, device,
                          CL_PROGRAM_BUILD_LOG, 0, NULL, &logSize);
    log = (char *)malloc(logSize);
    clGetProgramBuildInfo(program, device,
                          CL_PROGRAM_BUILD_LOG, logSize, log, NULL);
    fprintf(stderr, log);
}

Kernels

Kernels are created from (compiled) programs using

cl_kernel clCreateKernel(cl_program program, const char *kernel_name, cl_int *error);

and they can be inspected with

cl_int clGetKernelInfo(
    cl_kernel kernel,
    cl_kernel_info param_name,
    size_t param_value_size,
    void *param_value,
    size_t *param_value_size_ret);

to get the kernel FUNCTION_NAME, NUM_ARGS, REFERENCE_COUNT, associated CONTEXT and PROGRAM (with param_name = CL_KERNEL_xxxx).


The programmer can also query the kernel for workgroup information. This is in general device-specific:

cl_int clGetKernelWorkGroupInfo(
    cl_kernel kernel, cl_device_id device,
    cl_kernel_work_group_info param_name,
    size_t param_value_size,
    void *param_value,
    size_t *param_value_size_ret);

and the possible information is the maximum WORK_GROUP_SIZE, the COMPILE_WORK_GROUP_SIZE, and the LOCAL_MEM_SIZE. The compile-time workgroup size can be specified to be (X, Y, Z) using

__attibute__((reqd_work_group_size(X, Y, Z)))

when declaring the kernel in the source file.

Launching the kernel

Kernel arguments are loaded with

cl_int clSetKernelArg(cl_kernel kernel, cl_uint arg_index,
                      size_t arg_size, const void *arg_value)

where arg_size is the size of the argument an arg_value is the value.

An exception are arguments that are declared __local by the kernel: in this case, arg_size is the amount of local memory to allocate, and arg_value must be NULL. This is equivalent to declaring shared memory extern in CUDA and setting its size with cuFuncSetSharedSize.


Data-parallel kernels are queued with

cl_int clEnqueueNDRangeKernel(
    cl_command_queue command_queue,
    cl_kernel kernel,
    cl_uint work_dim,
    const size_t *global_work_offset,
    const size_t *global_work_size,
    const size_t *local_work_size,
    cl_uint num_events_in_wait_list,
    const cl_event *event_wait_list,
    cl_event *event);

where work_dim is the number of dimensions (0 < work_dim ≤ 3), global_work_offset must currently be NULL, and the global_work_size and local_work_size are arrays with the total number of work-items (in each dimension) and the number of work-items in a work-group (in each dimension), respectively.


The local_work_size can be NULL, in which case the implementation decides the size of the work-group. If it is specified, global_work_size must be a multiple of local_work_size.

As with memory operations, it is possible to specify events to wait on, and an event associated with the kernel launch itself.


Task-parallel kernels are queued with

cl_int clEnqueueTask(
    cl_command_queue command_queue,
    cl_kernel kernel,
    cl_uint num_events_in_wait_list,
    const cl_event *event_wait_list,
    cl_event *event);

In this case, a single instance of the function is invoked.

This is particularly useful on OpenCL devices that support concurrent running tasks and out-of-order execution (think multi-core CPUs and lightweight thread dispatching).

Cleaning up

Completion of the command queue is ensured by using

cl_int clFinish(cl_command_queue queue);

Allocated resources are freed with:

cl_int clReleaseMemObject(cl_mem buffer);
cl_int clReleaseProgram(cl_program program);
cl_int clReleaseCommanQueue(cl_command_queue queue);
cl_int clReleaseContext(cl_context ctx);