General-purpose programming on GPU

OpenCL

Giuseppe Bilotta, Eugenio Rustico, Alexis Hérault

DMI — Università di Catania
Sezione di Catania — INGV

OpenCL, the Open Computing Language

OpenCL: language to allow cross-platform, hardware-independent parallel computing programming. Composed of two parts:

the actual programming language used to write computing kernels,
C and C++ bindings for managing the CPU/computing devices interactions.

Registered trademark of Apple, it's developed by a consortium (Khronos Group) with contributions from AMD, NVIDIA, IBM, Intel.

The OpenCL programming languages is based on C99. Differences:

no function pointers,
no recursion,
no bitfields,
no variable-length array,
no standard includes.

Additions to C99:

vector types,
memory region qualifiers (__global, __local, __constant, etc),
built-in support for parallelization (directives for thread indexing, synchronization, barriers, etc).

Cross-platform and heterogenous programming

OpenCL ICD: Installable Client Driver. Vendor-provided component that interfaces between the OpenCL library and the actual hardware.

Example:

NVIDIA CUDA also offers an OpenCL ICD, allowing OpenCL access to its GPUs.
AMD offers an OpenCL ICD as part of its Accelerated Parallel Processing (APP) (formerly known as ATI Stream) solution, allowing OpenCL access to AMD and ATI video cards as well as x86 and x86-64 CPUs
Intel offers an OpenCL ICD for its CPUs and for the Sandy Bridge (integrated CPU/GPU)

Multiple ICDs can coexist on the same machine, the programmer can choose which one(s) to use.

The OpenCL ICD structure
Application
OpenCL library
NVIDIA ICD	AMD ICD		Intel ICD
NVIDIA GPUs	ATI GPUs	x86(-64) CPU

Task-based vs data-based parallelism

OpenCL offers two flavors of parallelism:

data-based parallelism: as in CUDA, the same (set of) operation(s) is executed in parallel over a large set of data
task-based parallelism: multiple independent tasks are launched in parallel on independent sets of data

Or you can mix-n-match, with multiple data-parallel tasks.

From CUDA to OpenCL

CUDA vs OpenCL terminology
CUDA	OpenCL
thread	work-item
block	work-group
global memory
constant memory
shared memory	local memory
local memory	private memory

CUDA vs OpenCL syntax
CUDA	OpenCL
`__global__` (function)	`__kernel`
`__device__` (function)	not needed
`__constant__` (variable)	`__constant`
`__device__` (variable)	`__global`
`__shared__` (variable)	`__local`

CUDA vs OpenCL thread/work-item indexing
CUDA	OpenCL
`gridDim`	`get_num_groups()`
`blockDim`	`get_local_size()`
`blockIdx`	`get_group_id()`
`threadIdx`	`get_local_id()`
`threadIdx + blockIdx*BlockDim`	`get_global_id()`
`gridDim*blockDim`	`get_global_size()`

The equivalent of __syncthreads() from CUDA is

void barrier(cl_mem_fence_flags flags);

that waits on all work-items in a work-goup. The additional flags are a combination of CLK_LOCAL_MEM_FENCE and CLK_GLOBAL_MEM_FENCE and are used to also guarantee that all accesses to local and global memory (respectively) are completed.

If the only requirement is for correct memory access ordering, one can use the memory fence instructions:

void mem_fence(cl_mem_fence_flags flags);
void read_mem_fence(cl_mem_fence_flags flags);
void write_mem_fenc(cl_mem_fence_flags flags);

OpenCL also defines functions for asynchronous copies between global and local memory:

event_t async_work_group_copy(
    __local gentype *dst, const __global gentype *src,
    size_t numelements, event_t event);
event_t async_work_group_copy(
    __global gentype *dst, const __local gentype *src,
    size_t numelements, event_t event);
void wait_group_events(int num_events, event_t *event);

(think about the use of shared memory as cache in CUDA devices with capabilities 1.x)

gentype means a generic type supported by OpenCL.

Aside from the C99 scalar types, OpenCL supports vector data types, but only with power-of-two components. Examples: char4, float2, uint8, short16.

These types are only available in OpenCL sources. The host application can use cl_type (e.g. cl_long4).

Components in vector types are accessed using .[xyzw] or .s[0-9a-f].

More than one component can be accessed at a time:

float4 pos = (float4)(1.0f, 2.0f, 3.0f, 4.0f);

float2 bog = pos.wy; // (4.0f, 2.0f)

float4 dup = pos.xxyy; // (1.0f, 1.0f, 2.0f, 2.0f)

lvalues must not have repeated components: you can assign to pos.xz but not to pos.xx.

s-notation:

float8 fviii; float16 fxvi;

fxvi.sad = fvii.s01;

assigns the first two components of fvii to the 10th and 11th components of fxvi.

There are also the suffixes .lo, .hi, .odd and .even:

float4 vf;
float2 low = vf.lo; // = vf.xy
float2 high = vf.hi; // = vf.zw
float2 ev = vf.even; // = vf.s02 = vf.xz
float2 od = vf.odd; // = vf.s13 = vf.yw

OpenCL supports most common operations on all vector types, both component-by-component with two vector operands, and component-with-scalar with a vector and a scalar operand.

Types can automatically promote with rules similar to those in C99. Explicit conversion can be done with convert_<type>[_flag](), where type is the destination type, and flag is one of:

rte: round to even
rtz: round towards zero
rtp and rtn: round towards positive/negative infinity
sat: saturate (only when converting to integer; out-of-range is converted to the nearest representable value);

Examples:

float4 f; convert_int4_sat(f) converts the float4 f to an int4 rounding to nearest even (the default) and saturating.
float4 f; convert_int4_sat_rtz(f) as above, but rounding towards zero;

OpenCL also provides reinterpret casts: as_uint(1.0f) returns 0x3f800000 (the binary representation of the floating-point value 1.0).

A vector example:

float4 f, g;
int4 is_less = f < g;
f = as_float4(is_less && as_int4(f));

keeps each component of f that was less than the corresponding component of g and sets the other to 0.

Additional built-in functions:

abs(x): absolute value of the argument;
abs_diff(x, y): absolute value of the difference, without overflow;
hadd(x, y): midpoint, without overflow;
max(x, y), min(x, y): maximum / minimum;
rotate(v, i): rotate each component of v by the number of bits specified in i;
upsample(hi, lo): combines hi and lo promoting type (e.g. a short from two chars, a long from two ints
dot(v1, v2): dot product
cross(float4 v1, float4 v2): cross product (ignoring 4th component)
[fast_]length(v): length of the vector v;
[fast_]normalize(v): vector with same direction of v but length 1;
[fast_]distance(v1, v2): length of the difference;
select(a, b, c): each component is taken from b if the most significant bit of c is set, from a otherwise;

CUDA vs OpenCL API objects
CUDA	OpenCL
`CUdevice`	`cl_device_id`
`CUcontext`	`cl_context`
`CUmodule`	`cl_program`
`CUfunction`	`cl_kernel`
`CUdeviceptr`	`cl_mem`
not available (streams come close)	`cl_command_queue`

OpenCL application structure

An OpenCL application typically

creates one or more contexts
creates one or more command queues
loads and builds programs
allocates memory buffers
launches kernels or tasks

Differences from CUDA:

no need to inizialize library (no cuInit(0));
a single context can manage more than one device;
there is no context stack: OpenCL API calls that operate on a context have a parameter to specify the context to use;
a command queue is used to manage the commands (memory copies, kernel launches) to be executed on a specific device;
CUDA modules must be precompiled to either PTX or binary format, OpenCL programs can be in pure OpenCL source form, to be compiled at runtime.

Device types

When querying for devices or creating contexts, the programmer can choose the type of device:

CL_DEVICE_TYPE_DEFAULT: the default device type for the specific platform
CL_DEVICE_TYPE_CPU: a host CPU
CL_DEVICE_TYPE_GPU: a device with graphical capability
CL_DEVICE_TYPE_ACCELERATOR: a dedicated computing device (note: NVIDIA Teslas are reported as GPUs rather than accelerators)
CL_DEVICE_TYPE_ALL: any kind of device

Context creation

To create a context, the programmer can either choose a specific set of devices, or let the platform choose based on a given device type:

OpenCL context creation APIs
`cl_context clCreateContext( cl_context_properties props, cl_uint num_devs, const cl_device_id devs, void pfn_notify(const char errinfo, const void private_info, size_t cb, void user_data), void user_data, cl_int error);`	`cl_context clCreateContextFromType( cl_context_properties props, cl_device_type device_type, void pfn_notify(const char errinfo, const void private_info, size_t cb, void user_data), void user_data, cl_int *error);`


    cl_context clCreateContext(
        cl_context_properties *props,
        cl_uint num_devs,
        const cl_device_id *devs,
        void *pfn_notify(const char *errinfo,
                         const void *private_info,
                         size_t cb, void *user_data),
        void *user_data,
        cl_int *error);


    cl_context clCreateContextFromType(
        cl_context_properties *props,
        cl_device_type device_type,
        void *pfn_notify(const char *errinfo,
                         const void *private_info,
                         size_t cb, void *user_data),
        void *user_data,
        cl_int *error);

pfn_notify can be used to register a callback function that the OpenCL implementation can use to report errors to the application.

Command queues

Device commands (memoory copies, kernel launches) are managed in queues. A command queue is created with

cl_command_queue clCreateCommandQueue(
    cl_context ctx,
    cl_device_id device,
    cl_command_queue_properties properties,
    cl_int *error);

where the possible properites are:

CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE: used to allow out-of-order command executions on devices that support it
CL_QUEUE_PROFILING_ENABLE: used to enable profiling of commands

If out-of-order execution is not enabled, the application can assume that commands are executed in the order they are submitted. With out-of-order execution, synchronization between commands is enabled by associating commands with events and listing events to wait on.

Profiling information, when enabled, is retrieved with:

cl_int clGetEventProfilingInfo(
    cl_event event,
    cl_profiling_info param_name,
    size_t param_value_size,
    void *param_value
    size_t param_value_size_set);

where event is a cl_event associated with the command to be profiled.

Currently all the parameters returned by the API are cl_ulongs and express times in nanoseconds according to a device time counter. Possible values (with CL_PROFILING_COMMAND_ prefix):

QUEUED: when the command was put in the queue
SUBMIT: when the command was submitted to the device
START: when execution started
END: when execution completed

Memory allocation

Memory is allocated with

cl_mem clCreateBuffer(cl_context ctx, cl_mem_flags flags,
                      size_t size, void *host_ptr, cl_int *error);

Flags (prefixed with CL_MEM_) are used to specify how the devices will use the memory (READ_WRITE, READ_ONLY, WRITE_ONLY), and how to interact with host memory:

USE_HOST_PTR: the memory pointed to by host_ptr is used as device memory; implementations can cache it to device memory;
ALLOC_HOST_PTR: the buffer should be allocated from host-accessible memory; this is mutually exclusive with USE_HOST_PTR;
COPY_HOST_PTR: the buffer is initialized by copying the memory pointed at by host_ptr;

Memory copy

Data is copied from a buffer (read) and to a buffer (write) with the following APIs:

OpenCL device/host memory copies
`cl_int clEnqueueReadBuffer( cl_command_queue command_queue, cl_mem buffer, cl_bool blocking, size_t offset, size_t numbytes, void ptr, cl_uint num_events_in_wait_list, const cl_event event_wait_list, cl_event *event)`	`cl_int clEnqueueWriteBuffer( cl_command_queue command_queue, cl_mem buffer, cl_bool blocking, size_t offset, size_t numbytes, const void ptr, cl_uint num_events_in_wait_list, const cl_event event_wait_list, cl_event *event)`


    cl_int clEnqueueReadBuffer(
        cl_command_queue command_queue,
        cl_mem buffer,
        cl_bool blocking,
        size_t offset,
        size_t numbytes,
        void *ptr,
        cl_uint num_events_in_wait_list,
        const cl_event *event_wait_list,
        cl_event *event)


    cl_int clEnqueueWriteBuffer(
        cl_command_queue command_queue,
        cl_mem buffer,
        cl_bool blocking,
        size_t offset,
        size_t numbytes,
        const void *ptr,
        cl_uint num_events_in_wait_list,
        const cl_event *event_wait_list,
        cl_event *event)

Events

The event_wait_list and the preceding counter are optional events the memory copy must wait on before being executed. event is the event associated with the memory copy command, and can be used to allow other commands to wait on them, as well as for timing.

This pattern (event_wait_list and event) is available on all command queueing APIs and is the foundation of command synchronization, particularly when the command queue is created out-of-order execution enabled.

OpenCL events do not need specific API for creation/deletion (i.e., no cuEventCreate()/cuEventDestroy(), and are automatically managed by the command queue APIs.

Programs

Programs are collections of kernels, auxiliary functions and constant memory declarations.

A program can be created from strings representing an OpenCL source using

cl_program clCreateProgramWithSource(
    cl_context ctx,
    cl_uint count,
    const char **strings,
    const size_t *lengths,
    cl_int *error);

in which case the program is created for all devices in the context. strings is an array of C strings, or of arbitrary strings whose lengths are stored in lengths.

A program can also be created from binary blobs using

cl_program clCreateProgramWithBinary(
    cl_context ctx,
    cl_uint num_devs,
    const cl_device_id *device_list,
    const size_t *lengths,
    const unsigned char **binaries,
    cl_int *binary_status,
    cl_int *error);

and the programmer must then specify lists of devices and of binaries of equal length. The length of each blob is stored in the corresponding entry in lengths and the result of the load is stored in the corresponding entry of binary_status.

A source program is compiled with

cl_int clBuilProgram(
    cl_program program,
    cl_uint num_devices,
    const cl_device_id *device_list,
    const char *options,
    void (*pfn_notify)(cl_program, void *user_data),
    void *user_data)

that builds program for the devices specified in device_list.

Options that would need to be given to the compiler can be passed in the C string options. These can include standard preprocessor options (e.g. -Dmacro, -I/path/to/include/files), OpenCL compiler options (e.g. -cl-opt-disable to disable optimizations, -cl-fast-relaxed-math for fast but less accurate math) or vendor-specific options

If compilation fails (returns something other than CL_SUCCESS, the output from the compiler can be retrieved with a code such as this:

if (error == CL_BUILD_PROGRAM_FAILURE) {
    size_t logSize = 0;
    char *log;
    clGetProgramBuildInfo(program, device,
                          CL_PROGRAM_BUILD_LOG, 0, NULL, &logSize);
    log = (char *)malloc(logSize);
    clGetProgramBuildInfo(program, device,
                          CL_PROGRAM_BUILD_LOG, logSize, log, NULL);
    fprintf(stderr, log);
}

Kernels

Kernels are created from (compiled) programs using

cl_kernel clCreateKernel(cl_program program, const char *kernel_name, cl_int *error);

and they can be inspected with

cl_int clGetKernelInfo(
    cl_kernel kernel,
    cl_kernel_info param_name,
    size_t param_value_size,
    void *param_value,
    size_t *param_value_size_ret);

to get the kernel FUNCTION_NAME, NUM_ARGS, REFERENCE_COUNT, associated CONTEXT and PROGRAM (with param_name = CL_KERNEL_xxxx).

The programmer can also query the kernel for workgroup information. This is in general device-specific:

cl_int clGetKernelWorkGroupInfo(
    cl_kernel kernel, cl_device_id device,
    cl_kernel_work_group_info param_name,
    size_t param_value_size,
    void *param_value,
    size_t *param_value_size_ret);

and the possible information is the maximum WORK_GROUP_SIZE, the COMPILE_WORK_GROUP_SIZE, and the LOCAL_MEM_SIZE. The compile-time workgroup size can be specified to be (X, Y, Z) using

__attibute__((reqd_work_group_size(X, Y, Z)))

when declaring the kernel in the source file.

Launching the kernel

Kernel arguments are loaded with

cl_int clSetKernelArg(cl_kernel kernel, cl_uint arg_index,
                      size_t arg_size, const void *arg_value)

where arg_size is the size of the argument an arg_value is the value.

An exception are arguments that are declared __local by the kernel: in this case, arg_size is the amount of local memory to allocate, and arg_value must be NULL. This is equivalent to declaring shared memory extern in CUDA and setting its size with cuFuncSetSharedSize.

Data-parallel kernels are queued with

cl_int clEnqueueNDRangeKernel(
    cl_command_queue command_queue,
    cl_kernel kernel,
    cl_uint work_dim,
    const size_t *global_work_offset,
    const size_t *global_work_size,
    const size_t *local_work_size,
    cl_uint num_events_in_wait_list,
    const cl_event *event_wait_list,
    cl_event *event);

where work_dim is the number of dimensions (0 < work_dim ≤ 3), global_work_offset must currently be NULL, and the global_work_size and local_work_size are arrays with the total number of work-items (in each dimension) and the number of work-items in a work-group (in each dimension), respectively.

The local_work_size can be NULL, in which case the implementation decides the size of the work-group. If it is specified, global_work_size must be a multiple of local_work_size.

As with memory operations, it is possible to specify events to wait on, and an event associated with the kernel launch itself.

Task-parallel kernels are queued with

cl_int clEnqueueTask(
    cl_command_queue command_queue,
    cl_kernel kernel,
    cl_uint num_events_in_wait_list,
    const cl_event *event_wait_list,
    cl_event *event);

In this case, a single instance of the function is invoked.

This is particularly useful on OpenCL devices that support concurrent running tasks and out-of-order execution (think multi-core CPUs and lightweight thread dispatching).

Cleaning up

Completion of the command queue is ensured by using

cl_int clFinish(cl_command_queue queue);

Allocated resources are freed with:

cl_int clReleaseMemObject(cl_mem buffer);
cl_int clReleaseProgram(cl_program program);
cl_int clReleaseCommanQueue(cl_command_queue queue);
cl_int clReleaseContext(cl_context ctx);