Writing OpenCL Programs and OpenCL C

Now we write some OpenCL Programs. OpenCL provides binding for C, and people, of course, have bind that to many languages, including Python, C++, Rust. Because Rust has a very good binding that is extremely easy to use, we will use Rust to write our programs.

This series focus on practice, and before start writing any OpenCL code, we need to request a platform, request a device, create a context, create a command queue, compile a program, create buffers, send the program...

ocl library provides an excellent ProQue type that eliminates the first few steps, so that we can focus on the essence of parallel programming instead of the trivial.

If you want to know how to do that traditionally, or fetch the details of each component, please refer to their example.

Bindings for other programming languages typically don't have such a convenient feature, so you'll do the whole venial task above.

Now let's start writing some OpenCL programs. The first program is vector sum.

Write an OpenCL Program in Rust

You'll use cargo to create a new project, then add ocl to the dependency.

extern crate ocl;
use ocl::ProQue;

fn task() -> ocl::Result<()> {
    let src = r#"
        __kernel void add(__global float* buffer, float scalar) {
            buffer[get_global_id(0)] += scalar;
        }
    "#;

    let pro_que = ProQue::builder()
        .src(src)
        .dims(1 << 4)
        .build()?;

    let buffer = pro_que.create_buffer::<f32>()?;

    let kernel = pro_que.kernel_builder("add")
        .arg(&buffer)
        .arg(10.0f32)
        .build()?;

    unsafe { kernel.enq()?; }

    let mut vec = vec![0.0f32; buffer.len()];
    buffer.read(&mut vec).enq()?;

    println!("The value at index [{}] is now '{}'!", 1, vec[1]);
    Ok(())
}

fn main() {
    task().unwrap();
}

The program is simple- we just add 10 to each element of the input vector.

Now let's break it down.

The OpenCL C Program

OpenCL C has special syntax. Now you need to know the following ones,

  • __kernel to declare a function to be a kernel function. Only kernel function can be called from the host. Please note that kernel function can only have void as return value, since, it is called from the host. The way in which the host receive the result is by reading from a buffer.
  • __global to declare a pointer to the global memory.
  • __private to declare a pointer to the private memory. If you don't write anything for a pointer, it is default to __private.
  • get_global_id(dim) is to get the index of the work item that this PE is assigned to.

Please distinguish kernel function from kernel. Kernel function is a function that can be later compiled to a kernel (a task to be executed) if we supply it with arguments. Kernel is the actual task that is sent to the device.

The pass-by-value acts the same as in C, it is just that the local variables are stored in the private memory.

For the following program,

__kernel void add(__global float* buffer, float scalar) {
    buffer[get_global_id(0)] += scalar;
}

This means that, for the (i, _, _) work item, it will add scalar to the value at the index i of buffer.

ProQue

ProQue is a utility chimera that has everything we need- we just take the default ProQue, then it will create the platform, device, context, program queue, and give us a command queue with the program. We need to specify our ND range when creating the ProQue, and the program source code.

We said that ND Range is a three element tuple. We can use three dim method to set that up in the builder in order. If we don't set a dimension, it is defaulted to one.

Buffer

We previously claimed that the global and constant global memory of a device are interoperable between itself and the host, which is done by buffer.

The buffer is simple- we allocate a chunk of host memory in the host machine, then its content will be sent to another chunk of memory in the global or global constant memory of the device as we execute the kernel. After the execution, mapped chunk of global memory will be written back to the host memory. Thus, if we operate the global memory buffer, we can later get our value on the host machine.

By default, create_buffer creates a buffer that is of the same length as the product of the ND Range. If you want more customization, use buffer_builder.

let buffer = pro_que.buffer_builder().fill_val(0.0f32).len(1 << 4).build()?;

Compile the Kernel

Previously, we only put the source code into the ProQue as a string. Now we use kernel_builder to create the kernel.

let kernel = pro_que.kernel_builder("add")
    .arg(&buffer)
    .arg(10.0f32)
    .build()?;

We first specify the name, then provide the arguments in order.

Execute the Kernel

Now we just sent the kernel to the device.

unsafe { kernel.enq()?; }

Please note that this operation is asynchronous, that the code doesn't block and wait for the device to finish.

Read the Result

Then we send a command to instruct the device to copy the data back to the host machine.

let mut vec = vec![0.0f32; buffer.len()];
buffer.read(&mut vec).enq()?;

I/O is synchronous, that the code will block and wait for the device to finish.

Keep in mind that queue is FIFO, so the device will always finish the kernel, then read the result to the host.

OpenCL C

Okay, previous part we write our first program. However, there are many OpenCL C things that are different from standard C.

If you need specification, read it here.

You need to remember the following things to write correct OpenCL C,

  • OpenCL C generally follows C99.
  • OpenCL C pointers always have a memory location prefix. If omitted, it is __private.
  • __kernel function must return void.
  • No double pointer.
  • No function pointer.
  • No recursion.
  • struct and union exists and can be used, but may slow down the process.
  • No library.
  • Limited support for inline and static functions.
  • No goto.

And there are some extra functions and types that are important,

Vector Types

OpenCL C has a number of vector types. The most important ones are,

  • <type>2 for 2D vector. This <type> means any native type, for example, float2, int2, etc.
  • <type>3
  • <type>4
  • <type>8
  • <type>16

There is no more than 16.

Work-Item Functions

  • get_global_id(dim) is to get the index of the work item that this PE is assigned to.
  • get_local_id(dim) is to get the offset of the work item that this PE is assigned to within the workgroup.
  • get_group_id(dim) is to get the index of the work group that this PE is assigned to.

The dim should be one of zero, one, or two.

It may seem hard to understand. Let's use a simple example (all count from zero, that is there are 0th)- we have work items 0 to 15, with each work group has a size of 4, and we have 16 PEs (0 to 15). Because the distribution is always in order. For the 6th PE, it will process the work item 6, and so, the get_global_id(0) is 6. The get_local_id(0) is 2, because the 6th PE got the second work item in this workgroup. And for the 6th PE, it's get_group_id(0) is 1, because the 6th PE processes tasks in the first work group.

Still in the previous scenario, if we have work items 0 to 31, in the second go, the get_global_id(0) for the 6th PE will be 22, because it gets the 22th work item in the second go. The get_local_id(0) is still 2 because, while the get_group_id(0) will be 5.

If the work items are not in order, we just logically re-label all the PE and CUs.

In summary, you can think of global_id and group_id as a parameter, where as local_id as a physical parameter that indicates which PE we are using in the CU. This will came into use later.

  • get_global_size(dim) is to get the total length of the ND Range on a dim.
  • get_work_dim() is to get the ND Range dimension.
  • get_local_size(dim) is to get the size of the work group.
  • get_num_work_groups(dim) is to get the number of work groups.

Branching Functions

if-else is extremely slow in GPU- we will talk about that later. So the following functions are preferred,

  • select(c, a, b) is to select a if c is true, otherwise select b.
  • max(a, b) is to select the maximum of a and b.
  • min(a, b) is to select the minimum of a and b.

Atomic Functions

These functions are atomic to resolve race condition- we can't use lock here because it's too slow.

In the following list, p is of type volatile __global int*.

  • atomic_add(p, val) is to add val to p.
  • atomic_sub(p, val) is to subtract val from p.
  • atomic_xchg(p, val) is to exchange p with val.
  • atomic_inc(p) is to increment p by one.
  • atomic_dec(p) is to decrement p by one.
  • atomic_cmpxchg(p, cmp, val) is to compare p with cmp and exchange p with val if they are equal.