OpenCL: language to allow cross-platform, hardware-independent parallel computing programming. Composed of two parts:
Registered trademark of Apple, it's developed by a consortium (Khronos Group) with contributions from AMD, NVIDIA, IBM, Intel.
The OpenCL programming languages is based on C99. Differences:
Additions to C99:
__global
, __local
, __constant
, etc),OpenCL ICD: Installable Client Driver. Vendor-provided component that interfaces between the OpenCL library and the actual hardware.
Example:
Multiple ICDs can coexist on the same machine, the programmer can choose which one(s) to use.
Application | |||
OpenCL library | |||
NVIDIA ICD | AMD ICD | Intel ICD | |
NVIDIA GPUs | ATI GPUs | x86(-64) CPU |
OpenCL offers two flavors of parallelism:
as in CUDA, the same (set of) operation(s) is executed in parallel over a large set of data
multiple independent tasks are launched in parallel on independent sets of data
Or you can mix-n-match, with multiple data-parallel tasks.
CUDA | OpenCL |
---|---|
thread | work-item |
block | work-group |
global memory | |
constant memory | |
shared memory | local memory |
local memory | private memory |
CUDA | OpenCL |
---|---|
__global__ (function) | __kernel |
__device__ (function) | not needed |
__constant__ (variable) | __constant |
__device__ (variable) | __global |
__shared__ (variable) | __local |
CUDA | OpenCL |
---|---|
gridDim | get_num_groups() |
blockDim | get_local_size() |
blockIdx | get_group_id() |
threadIdx | get_local_id() |
threadIdx + blockIdx*BlockDim | get_global_id() |
gridDim*blockDim | get_global_size() |
The equivalent of __syncthreads()
from CUDA is
void barrier(cl_mem_fence_flags flags);
that waits on all work-items in a work-goup. The additional flags
are a combination of CLK_LOCAL_MEM_FENCE
and CLK_GLOBAL_MEM_FENCE
and are
used to also guarantee that all accesses to local and global memory
(respectively) are completed.
If the only requirement is for correct memory access ordering, one can use the memory fence instructions:
void mem_fence(cl_mem_fence_flags flags);
void read_mem_fence(cl_mem_fence_flags flags);
void write_mem_fenc(cl_mem_fence_flags flags);
OpenCL also defines functions for asynchronous copies between global and local memory:
event_t async_work_group_copy(
__local gentype *dst, const __global gentype *src,
size_t numelements, event_t event);
event_t async_work_group_copy(
__global gentype *dst, const __local gentype *src,
size_t numelements, event_t event);
void wait_group_events(int num_events, event_t *event);
(think about the use of shared memory as cache in CUDA devices with capabilities 1.x)
gentype means a generic type supported by OpenCL.
Aside from the C99 scalar types, OpenCL supports vector data types, but only
with power-of-two components. Examples: char4
, float2
, uint8
,
short16
.
These types are only available in OpenCL sources. The host application
can use cl_
type (e.g. cl_long4
).
Components in vector types are accessed using .[xyzw]
or .s[0-9a-f]
.
More than one component can be accessed at a time:
float4 pos = (float4)(1.0f, 2.0f, 3.0f, 4.0f);
float2 bog = pos.wy; // (4.0f, 2.0f)
float4 dup = pos.xxyy; // (1.0f, 1.0f, 2.0f, 2.0f)
lvalues must not have repeated components: you can assign to pos.xz
but not to pos.xx
.
s-notation:
float8 fviii; float16 fxvi;
fxvi.sad = fvii.s01;
assigns the first two components of fvii
to the 10th and 11th
components of fxvi
.
There are also the suffixes .lo
, .hi
, .odd
and .even
:
float4 vf;
float2 low = vf.lo; // = vf.xy
float2 high = vf.hi; // = vf.zw
float2 ev = vf.even; // = vf.s02 = vf.xz
float2 od = vf.odd; // = vf.s13 = vf.yw
OpenCL supports most common operations on all vector types, both component-by-component with two vector operands, and component-with-scalar with a vector and a scalar operand.
Types can automatically promote with rules similar to those in C99.
Explicit conversion can be done with convert_<type>[_flag]()
,
where type is the destination type, and flag is one of:
rte
: round to evenrtz
: round towards zerortp
and rtn
: round towards positive/negative infinitysat
: saturate (only when converting to integer; out-of-range is
converted to the nearest representable value);Examples:
float4 f; convert_int4_sat(f)
converts the float4 f
to an
int4
rounding to nearest even (the default) and saturating.float4 f; convert_int4_sat_rtz(f)
as above, but rounding towards
zero;OpenCL also provides reinterpret casts: as_uint(1.0f)
returns
0x3f800000
(the binary representation of the floating-point value
1.0).
A vector example:
float4 f, g;
int4 is_less = f < g;
f = as_float4(is_less && as_int4(f));
keeps each component of f
that was less than the corresponding
component of g
and sets the other to 0.
Additional built-in functions:
abs(x)
: absolute value of the argument;abs_diff(x, y)
: absolute value of the difference, without overflow;hadd(x, y)
: midpoint, without overflow;max(x, y)
, min(x, y)
: maximum / minimum;rotate(v, i)
: rotate each component of v
by the number of bits
specified in i
;upsample(hi, lo)
: combines hi
and lo
promoting type (e.g. a
short
from two char
s, a long
from two int
sdot(v1, v2)
: dot productcross(float4 v1, float4 v2)
: cross product (ignoring 4th component)[fast_]length(v)
: length of the vector v
;[fast_]normalize(v)
: vector with same direction of v
but length 1;[fast_]distance(v1, v2)
: length of the difference;select(a, b, c)
: each component is taken from b
if the most
significant bit of c
is set, from a
otherwise;CUDA | OpenCL |
---|---|
CUdevice | cl_device_id |
CUcontext | cl_context |
CUmodule | cl_program |
CUfunction | cl_kernel |
CUdeviceptr | cl_mem |
not available (streams come close) | cl_command_queue |
An OpenCL application typically
Differences from CUDA:
cuInit(0)
);When querying for devices or creating contexts, the programmer can choose the type of device:
CL_DEVICE_TYPE_DEFAULT
the default device type for the specific platform
CL_DEVICE_TYPE_CPU
a host CPU
CL_DEVICE_TYPE_GPU
a device with graphical capability
CL_DEVICE_TYPE_ACCELERATOR
a dedicated computing device (note: NVIDIA Teslas are reported as GPUs rather than accelerators)
CL_DEVICE_TYPE_ALL
any kind of device
To create a context, the programmer can either choose a specific set of devices, or let the platform choose based on a given device type:
|
|
pfn_notify
can be used to register a callback function that the OpenCL
implementation can use to report errors to the application.
Device commands (memoory copies, kernel launches) are managed in queues. A command queue is created with
cl_command_queue clCreateCommandQueue(
cl_context ctx,
cl_device_id device,
cl_command_queue_properties properties,
cl_int *error);
where the possible properites are:
CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE
used to allow out-of-order command executions on devices that support it
CL_QUEUE_PROFILING_ENABLE
used to enable profiling of commands
If out-of-order execution is not enabled, the application can assume that commands are executed in the order they are submitted. With out-of-order execution, synchronization between commands is enabled by associating commands with events and listing events to wait on.
Profiling information, when enabled, is retrieved with:
cl_int clGetEventProfilingInfo(
cl_event event,
cl_profiling_info param_name,
size_t param_value_size,
void *param_value
size_t param_value_size_set);
where event
is a cl_event
associated with the command to be
profiled.
Currently all the parameters returned by the API are cl_ulong
s and express
times in nanoseconds according to a device time counter. Possible values
(with CL_PROFILING_COMMAND_
prefix):
QUEUED
when the command was put in the queue
SUBMIT
when the command was submitted to the device
START
when execution started
END
when execution completed
Memory is allocated with
cl_mem clCreateBuffer(cl_context ctx, cl_mem_flags flags,
size_t size, void *host_ptr, cl_int *error);
Flags (prefixed with CL_MEM_
) are used to specify how the devices will
use the memory (READ_WRITE
, READ_ONLY
, WRITE_ONLY
), and how to
interact with host memory:
USE_HOST_PTR
the memory pointed to by host_ptr
is used as device memory;
implementations can cache it to device memory;
ALLOC_HOST_PTR
the buffer should be allocated from host-accessible memory;
this is mutually exclusive with USE_HOST_PTR
;
COPY_HOST_PTR
the buffer is initialized by copying the memory pointed at by
host_ptr
;
Data is copied from a buffer (read) and to a buffer (write) with the following APIs:
|
|
The event_wait_list
and the preceding counter are optional events the
memory copy must wait on before being executed. event
is the event
associated with the memory copy command, and can be used to allow other
commands to wait on them, as well as for timing.
This pattern (event_wait_list
and event
) is available on all
command queueing APIs and is the foundation of command synchronization,
particularly when the command queue is created out-of-order execution
enabled.
OpenCL events do not need specific API for creation/deletion (i.e., no
cuEventCreate()
/cuEventDestroy(
), and are automatically managed by
the command queue APIs.
Programs are collections of kernels, auxiliary functions and constant memory declarations.
A program can be created from strings representing an OpenCL source using
cl_program clCreateProgramWithSource(
cl_context ctx,
cl_uint count,
const char **strings,
const size_t *lengths,
cl_int *error);
in which case the program is created for all devices in the context.
strings
is an array of C strings, or of arbitrary strings whose
lengths are stored in lengths
.
A program can also be created from binary blobs using
cl_program clCreateProgramWithBinary(
cl_context ctx,
cl_uint num_devs,
const cl_device_id *device_list,
const size_t *lengths,
const unsigned char **binaries,
cl_int *binary_status,
cl_int *error);
and the programmer must then specify lists of devices and of binaries of
equal length. The length of each blob is stored in the corresponding
entry in lengths
and the result of the load is stored in the
corresponding entry of binary_status
.
A source program is compiled with
cl_int clBuilProgram(
cl_program program,
cl_uint num_devices,
const cl_device_id *device_list,
const char *options,
void (*pfn_notify)(cl_program, void *user_data),
void *user_data)
that builds program
for the devices specified in device_list
.
Options that would need to be given to the compiler can be passed in the
C string options
. These can include standard preprocessor options
(e.g. -Dmacro
, -I/path/to/include/files
), OpenCL compiler options
(e.g. -cl-opt-disable
to disable optimizations,
-cl-fast-relaxed-math
for fast but less accurate math) or
vendor-specific options
If compilation fails (returns something other than CL_SUCCESS
, the
output from the compiler can be retrieved with a code such as this:
if (error == CL_BUILD_PROGRAM_FAILURE) {
size_t logSize = 0;
char *log;
clGetProgramBuildInfo(program, device,
CL_PROGRAM_BUILD_LOG, 0, NULL, &logSize);
log = (char *)malloc(logSize);
clGetProgramBuildInfo(program, device,
CL_PROGRAM_BUILD_LOG, logSize, log, NULL);
fprintf(stderr, log);
}
Kernels are created from (compiled) programs using
cl_kernel clCreateKernel(cl_program program, const char *kernel_name, cl_int *error);
and they can be inspected with
cl_int clGetKernelInfo(
cl_kernel kernel,
cl_kernel_info param_name,
size_t param_value_size,
void *param_value,
size_t *param_value_size_ret);
to get the kernel FUNCTION_NAME
, NUM_ARGS
, REFERENCE_COUNT
,
associated CONTEXT
and PROGRAM
(with param_name
=
CL_KERNEL_xxxx
).
The programmer can also query the kernel for workgroup information. This is in general device-specific:
cl_int clGetKernelWorkGroupInfo(
cl_kernel kernel, cl_device_id device,
cl_kernel_work_group_info param_name,
size_t param_value_size,
void *param_value,
size_t *param_value_size_ret);
and the possible information is the maximum WORK_GROUP_SIZE
, the
COMPILE_WORK_GROUP_SIZE
, and the LOCAL_MEM_SIZE
. The compile-time
workgroup size can be specified to be (X, Y, Z) using
__attibute__((reqd_work_group_size(X, Y, Z)))
when declaring the kernel in the source file.
Kernel arguments are loaded with
cl_int clSetKernelArg(cl_kernel kernel, cl_uint arg_index,
size_t arg_size, const void *arg_value)
where arg_size
is the size of the argument an arg_value
is the
value.
An exception are arguments that are declared __local
by
the kernel: in this case, arg_size
is the amount of local memory to
allocate, and arg_value
must be NULL
. This is equivalent to
declaring shared memory extern
in CUDA and setting its size with
cuFuncSetSharedSize
.
Data-parallel kernels are queued with
cl_int clEnqueueNDRangeKernel(
cl_command_queue command_queue,
cl_kernel kernel,
cl_uint work_dim,
const size_t *global_work_offset,
const size_t *global_work_size,
const size_t *local_work_size,
cl_uint num_events_in_wait_list,
const cl_event *event_wait_list,
cl_event *event);
where work_dim
is the number of dimensions (0 < work_dim
≤
3), global_work_offset
must currently be NULL
, and the
global_work_size
and local_work_size
are arrays with the total number of
work-items (in each dimension) and the number of work-items in a
work-group (in each dimension), respectively.
The local_work_size
can be NULL
, in which case the implementation
decides the size of the work-group. If it is specified,
global_work_size
must be a multiple of local_work_size
.
As with memory operations, it is possible to specify events to wait on, and an event associated with the kernel launch itself.
Task-parallel kernels are queued with
cl_int clEnqueueTask(
cl_command_queue command_queue,
cl_kernel kernel,
cl_uint num_events_in_wait_list,
const cl_event *event_wait_list,
cl_event *event);
In this case, a single instance of the function is invoked.
This is particularly useful on OpenCL devices that support concurrent running tasks and out-of-order execution (think multi-core CPUs and lightweight thread dispatching).
Completion of the command queue is ensured by using
cl_int clFinish(cl_command_queue queue);
Allocated resources are freed with:
cl_int clReleaseMemObject(cl_mem buffer);
cl_int clReleaseProgram(cl_program program);
cl_int clReleaseCommanQueue(cl_command_queue queue);
cl_int clReleaseContext(cl_context ctx);