Open Computing Language: cross-platform, multi-vendor language and API for parallel computing on massively parallel hardware (GPU, accelerators).
Started by Apple, since 2008 developed by Khronos, a consortium gathering all major hardware and software vendors (Apple, IBM, NVIDIA, AMD, Intel, ARM, etc).
Replaces proprietary interfaces for GPGPU (NVIDIA CUDA, AMD CAL). Also supports CPU, APU, accelerators.
Has even be extended to clusters (SnuCL, VirtualCL), maintaining the same architecture.
A host may be connected to devices from different vendors. The ICD (Installable Client Driver) system allows runtime selection of the platform (vendor driver) to use. Each platform may expose only a subset of the devices in the system.
Application | ||||||
OpenCL library | ||||||
NVIDIA ICD |
AMD ICD |
Intel ICD |
||||
GPU NVIDIA |
GPU AMD | APU AMD | CPU x86(-64) | IGP Intel (HD 4000+) | Xeon Phi |
Kernels are written in OpenCL C, programming language based on C99, with extensions for shared-memory parallel programming model.
Differences:
Additions for vector and parallel programming on GPU
int
, float
etc for );global
, local
, constant
);number of work-items launched (global, work-group);
identify each work-item and work-group;
synchronize work-items in the same work-group.
The code for the device is separate from the code for the host (separate source file, or text string embedded in host source).
An OpenCL program must load and compile the device code at runtime.
Disadvantage: most OpenCL programs share large chunks of initialization code (boilerplate). One of the most boring parts of OpenCL.
select platform and device(s);
create context and command queue used to control the devices;
allocate memory for the devices;
copy data between host and device.
prepare kernel for execution;
launch kernels;
measure kernel performance.
OpenCL supports both
SPMD: each single kernel can be executed by multiple work-items; but also SIMD, native support for vector data types and instructions.
and
multiple kernels executed by one work-item each, possibly on different devices.
parallelism.
OpenCL supports both
kernel running on one device;
and
kernels running on different devices.
models.
simple approach limited by individual device capabilities; different paradigm (GPU → multi-GPU), scalability limited by cost;
low on single device; high between devices;
device and host work independently (a form of task-based parallelism); the host manages synchronization (with and between devices).
Somewhat low-level interface, more complex to use, more boiler plate.
New approach, still in development.
Smaller software ecosystem, less mature tools. But things are improving (see e.g. Bolt, clMath, ArrayFire).
Great flexibility.
Excellent investment. Easy vectorized, parallel programming for both CPU and accelerators.
HPC is moving in this direction.
(The two most powerful supercomputer are based on accelerators (Intel Xeon Phi) and GPUs (NVIDIA Tesla K20x).)
Start from the simple stuff:
We’ll start from C and move on to the parallel world of OpenCL.
Device code resides in its own file (e.g. vecinit.ocl
):
cl_kernel vecinit;
cl_event run_vecinit(cl_mem vec, cl_int numels) {
cl_event vecinit_evt; /* event for this run */
clSetKernelArg(vecinit, 0, sizeof(vec), &vec);
size_t gws[1] = { numels }; /* size of the launch grid */
cl_int err = clEnqueueNDRangeKernel(que, vecinit,
1, NULL, gws, NULL,
0, NULL, &vecinit_evt);
return vecinit_evt;
}
cl_int err = clEnqueueNDRangeKernel(
que, /* command queue */
vecinit, /* kernel */
1, /* dimensions in the grid (1, 2 or 3) */
NULL, /* offset, none in our case */
gws, /* global work size */
NULL, /* local work size, automatic in our case */
0, NULL, /* number and list of events to wait for */
&vecinit_evt); /* event associated with this command */
Similarly, to download data:
cl_int *host_vec = calloc(numels, sizeof(cl_int));
cl_event wait_list[] = { vecinit_evt };
cl_int err = clEnqueueReadBuffer(
que, /* command queue */
vec, /* buffer */
CL_TRUE, /* blocking, or CL_FALSE, non-blocking */
0, /* offset, in bytes, to start copying from */
numels*sizeof(cl_int), /* bytes to copy */
host_vec,
1, wait_list, /* number and list of events to wait for */
&downoad_evt); /* event associated with this command */
Something to help. Use the given file to reduce boilerplate in your main program to:
#include "ocl_boiler.h" /* also includes CL/cl.h */
int main(int argc, char *argv[]) {
cl_platform_id p = select_platform();
cl_device_id d = select_device(p);
cl_context ctx = create_context(p, d);
cl_command_queue que = create_queue(ctx, d);
cl_program prog = create_program("kernels.ocl", ctx, d);
/* Here starts the custom part: extract kernels,
* allocate buffers, run kernels, get results */
return 0;
}
To launch a kernel, the host needs a handle to the kernel from the (runtime-compiled) device code:
Buffers are abstract objects associated with the context, not with a specific device: they can migrate from one device to another depending on where the kernels that use them are run.
Buffers have use flags to hint how they will be used (for reading, for writing, or both).
Buffers can also have an associated host array, either for initialization or for mapping.
CL_MEM_READ_ONLY
, CL_MEM_WRITE_ONLY
, CL_MEM_READ_WRITE
how the buffer will be used on the device;
CL_MEM_USE_HOST_PTR
the memory pointed to by host_ptr
is used as device memory; implementations can cache it to device memory;
CL_MEM_ALLOC_HOST_PTR
the buffer should be allocated from host-accessible memory; this is mutually exclusive with USE_HOST_PTR
;
CL_MEM_COPY_HOST_PTR
the buffer is initialized by copying the memory pointed at by host_ptr
.
The host can synchronize with devices using:
clFlush(queue)
wait until all commands in the queue have been submitted to the devices;
clFinish(queue)
wait until all commands in the queue have been completed on the devices;
clWaitForEvents(num, wait_list)
wait until all events in the wait_list
have been completed on the devices.
When profiling is enabled for a command queue, the events associated with each command provide the following information:
CL_PROFILING_COMMAND_QUEUED
when the command was put in the queue
CL_PROFILING_COMMAND_SUBMIT
when the command was submitted to the device
CL_PROFILING_COMMAND_START
when execution started
CL_PROFILING_COMMAND_END
when execution completed
ocl_boiler.h
provides runtime_ms
to get a command runtime in milliseconds, runtime_ns
to get a command runtime in nanoseconds.
The number of bytes read and written by a command, divided by the runtime in nanoseconds, gives the effective bandwidth in GB/sec.
Work-items in the same work-group can share data quickly using local memory.
Local memory is allocated per work-group, with contents not initialized at the beginning of the kernel. Contents do not persist after the kernel terminates.
The amount of local memory used by each work-group can be specified either statically (inside the kernel), or dynamically (by the host when launching the kernel). The latter is more common, as the amount typically relates to the work-group size and this may change from device to device.
Work-items must synchronize before reading data written by other work-items, using a memory fence: barrier(CLK_LOCAL_MEM_FENCE)
.
kernel void invert_lds( global int * restrict output,
global const int * restrict input,
local int * lds)
{
const int gid = get_global_id(0); /* global index of the work-item */
const int lid = get_local_id(0); /* local index of the work-item */
lds[lid] = input[gid]; /* load data in local memory from global memory */
/* wait for the other work-items in the work-group: */
barrier(CLK_LOCAL_MEM_FENCE);
/* here we can compute result using also the data
* loaded by other work-items (omitted) */
output[gid] = result; /* write out the final result */
}
obtain one value from a set of values
the operation to obtain the value is commutative and associative (e.g. addition multiplication, minimum, maximum, etc).
Naive approach:
Issues:
Improvements:
Can we improve it still?
But:
in practice, can be done with two launches: