Optimizing an OpenCL Program

Actually, after learning the first two parts, you have already exhausted the most of the OpenCL features, and you can write many parallel programs. However, you may make tons of mistakes that make your program run slower. This chapter introduce common mistakes and how to fix them. Optimization is usually related to the hardware details.

Matrix Multiplication

Most of them can be demonstrated with a simple case- matrix multiplication. We first write the simplest program,

extern crate ocl;
use ocl::{ProQue};
use std::time::Instant;

const src_1: &str = r#"
__kernel void matrix_multiply(__global float* A, __global float* B, __global float* C, const unsigned int N) {
    int row = get_global_id(0);
    int col = get_global_id(1);
    for (int i = 0; i < N; i++) {
        C[row * N + col] += A[row * N + i] * B[i * N + col];
    }
}
"#;

fn task() -> ocl::Result<()> {

    let m_n = 2048;
    let matrix_size = m_n * m_n;

    let mut m_a = vec![0.0f32; matrix_size];
    let mut m_b = vec![0.0f32; matrix_size];

    for i in 0..m_n {
        for j in 0..m_n {
            m_a[i * m_n + j] = (i as f32) + (j as f32);
            m_b[i * m_n + j] = (i as f32) - (j as f32);
        }
    }

    let pro_que = ProQue::builder()
        .src(src_1)
        .dims([m_n, m_n])
        .build()?;

    let buffer_a = pro_que.create_buffer::<f32>()?;
    let buffer_b = pro_que.create_buffer::<f32>()?;
    let buffer_c = pro_que.create_buffer::<f32>()?;

    buffer_a.write(&m_a).enq()?;
    buffer_b.write(&m_b).enq()?;

    let kernel = pro_que.kernel_builder("matrix_multiply")
        .arg(&buffer_a)
        .arg(&buffer_b)
        .arg(&buffer_c)
        .arg(m_n as u32)
        .build()?;

    let start = Instant::now();

    unsafe { kernel.enq()?; }

    let mut m_c = vec![0.0f32; matrix_size];
    buffer_c.read(&mut m_c).enq()?;

    let duration = start.elapsed();
    println!("Matrix multiplication took: {:?}", duration);

    println!("C[1][1] = {}", m_c[1 * m_n + 1]);

    Ok(())
}

fn main() {
    task().unwrap();
}

Which takes around 2s to complete on my machine- very bad. How we can build any AI with such a slow performance? We must optimize it.

Avoid Global Memory Unless Necessary

Global memory access is extremely slow. How does this hardware fact help? Well, in our previous code, we have,

C[row * N + col] += A[row * N + i] * B[i * N + col];

Where C is a pointer to the global memory. Each loop, we are writing to the global memory. But it can be avoided by using,

float sum = 0.0f;

for (int i = 0; i < N; i++) {
    sum += A[row * N + i] * B[i * N + col];
}

C[row * N + col] = sum;

Local variables are stored in the private memory.

Now let's try again- 950ms, which is less than halved, but still not good.

Memory Coalescing

We can further reduce the global memory access by a trick called memory coalescing.

When fetching from global memory, the work is done in the unit of wavefront (half-wavefront in earlier architectures). And if the PEs in this wavefront wants a continuous data chunk (for example, 0-31 or 32-63), it will done in one go, instead of one by one.

What this teaches us? Well, when fetching from global memory, we should ensure that PE fetches should be as continuous as possible, in order to take advantage of this.

Let's check our code with global memory fetching-

sum += A[row * N + i] * B[i * N + col];

From A, for each wavefront, we retrieve a continuous space of data. But for B, it is not. How to fix? It's simple, use B_T instead.

__kernel void matrix_multiply(
    __global float* A, 
    __global float* B_T,
    __global float* C, 
    const unsigned int N
) {
    int row = get_global_id(0);
    int col = get_global_id(1);
    float sum = 0.0f;

    for (int i = 0; i < N; i++) {
        sum += A[row * N + i] * B_T[col * N + i];
    }

    C[row * N + col] = sum;
}

I'm on Apple Silicon so the architecture is different, thus memory coalescing doesn't yield improvement, but it is important for most GPUs.

Tweaking Work Group Size

For each architecture, when the work group size matches the PEs in a CU, it typically yield good improvement. However, this is also relative to the global memory fetching and other details. But using matched size can be a good start. Then we can further try to optimize the work group size.

I chose,

let kernel = pro_que.kernel_builder("matrix_multiply")
    .arg(&buffer_a)
    .arg(&buffer_b)
    .arg(&buffer_c)
    .arg(m_n as u32)
    .local_work_size([8, 8])
    .build()?;

And the time taken is 250ms, which is a huge improvement.

Utilizing Local Memory

This is another trick- previously, we talked about how memory coalescing can reduce global memory fetching. Can we do it further? Yes.

The secret is using local memory. We can make a workgroup fetch global memory only once, store it in the local memory, so that instead of each work item fetching it, each workgroup fetches it only once.

How do we do that? We need to calculate the size of the workgroup (I hardcoded it but you can get it using functions). And prepare the local memory.

const unsigned int TILE_SIZE = 8;

// Thread indices within the workgroup
int row = get_local_id(0);
int col = get_local_id(1);

// Global indices for A and B_T in the entire matrix
int global_row = get_group_id(0) * TILE_SIZE + row;
int global_col = get_group_id(1) * TILE_SIZE + col;

// Local memory to store tiles of A and B_T
__local float local_A[TILE_SIZE][TILE_SIZE];
__local float local_B_T[TILE_SIZE][TILE_SIZE];

Then, we ask each PE to load a block of A and B_T from global memory to local memory. Because of memory coalescing, it is done in one go.

 for (int k = 0; k < N / TILE_SIZE; k++) {
    // Load a block of A from global memory to local memory
    local_A[row][col] = A[global_row * N + (k * TILE_SIZE + col)];

    // Load a block of B_T (transposed B) from global memory to local memory
    local_B_T[row][col] = B_T[global_col * N + (k * TILE_SIZE + row)];

    // Synchronize threads to make sure all threads have finished loading data
    barrier(CLK_LOCAL_MEM_FENCE);

    // Perform the multiplication and accumulation for the tile
    for (int i = 0; i < TILE_SIZE; i++) {
        sum += local_A[row][i] * local_B_T[i][col];
    }

    // Synchronize to ensure all threads are done with the current computation before moving to the next
    barrier(CLK_LOCAL_MEM_FENCE);
}

We must use barrier(CLK_LOCAL_MEM_FENCE); to synchronize workgroup. Otherwise, because PE may store the fetched data into the local memory at different time (fetching is done in one but not storing), the result will be wrong.

The function of this line is to synchronize all across the workgroup. A wavefront shares a clock, not a workgroup. After we write this line, all PEs will wait until every PE in the same workgroup reaches this line.

Whenever you are about to write to the local memory, a rule of thumb is to do synchronization before and after. Of course, synchronization costs time, so do as few as you can.

Originally, we need 2N times global fetching. But now, we only need 2N / TILE_SIZE times global fetching.

__kernel void matrix_multiply(
    __global float* A, 
    __global float* B_T,
    __global float* C, 
    const unsigned int N
) {
    // Tile size (assuming 8x8 workgroup size)
    const unsigned int TILE_SIZE = 8;

    // Thread indices within the workgroup
    int row = get_local_id(0);
    int col = get_local_id(1);

    // Global indices for A and B_T in the entire matrix
    int global_row = get_group_id(0) * TILE_SIZE + row;
    int global_col = get_group_id(1) * TILE_SIZE + col;

    // Local memory to store tiles of A and B_T
    __local float local_A[TILE_SIZE][TILE_SIZE];
    __local float local_B_T[TILE_SIZE][TILE_SIZE];

    float sum = 0.0f;

    // Loop through the tiles of A and B_T
    for (int k = 0; k < N / TILE_SIZE; k++) {
        // Load a block of A from global memory to local memory
        local_A[row][col] = A[global_row * N + (k * TILE_SIZE + col)];

        // Load a block of B_T (transposed B) from global memory to local memory
        local_B_T[row][col] = B_T[global_col * N + (k * TILE_SIZE + row)];

        // Synchronize threads to make sure all threads have finished loading data
        barrier(CLK_LOCAL_MEM_FENCE);

        // Perform the multiplication and accumulation for the tile
        for (int i = 0; i < TILE_SIZE; i++) {
            sum += local_A[row][i] * local_B_T[i][col];
        }

        // Synchronize to ensure all threads are done with the current computation before moving to the next
        barrier(CLK_LOCAL_MEM_FENCE);
    }

    // Write the result to the global memory (C matrix)
    if (global_row < N && global_col < N) {
        C[global_row * N + global_col] = sum;
    }
}

Not the time reduced to 60ms! This is great achievements. But please note that to make local memory efficient, you must first do memory coalescing, which, albeit didn't yield much improvement, is still important.

Using Constant Memory

Constant Memory, well, is obviously faster. When you can use constant memory, use it. Otherwise, use global memory.

To use constant memory, just change the __global to __constant.

const src_5: &str = r#"
__kernel void matrix_multiply(
    __constant float* A, 
    __constant float* B_T,
    __global float* C, 
    const unsigned int N
) {
    // Tile size (assuming 8x8 workgroup size)
    const unsigned int TILE_SIZE = 8;

    // Thread indices within the workgroup
    int row = get_local_id(0);
    int col = get_local_id(1);

    // Global indices for A and B_T in the entire matrix
    int global_row = get_group_id(0) * TILE_SIZE + row;
    int global_col = get_group_id(1) * TILE_SIZE + col;

    // Local memory to store tiles of A and B_T
    __local float local_A[TILE_SIZE][TILE_SIZE];
    __local float local_B_T[TILE_SIZE][TILE_SIZE];

    float sum = 0.0f;

    // Loop through the tiles of A and B_T
    for (int k = 0; k < N / TILE_SIZE; k++) {
        // Load a block of A from global memory to local memory
        local_A[row][col] = A[global_row * N + (k * TILE_SIZE + col)];

        // Load a block of B_T (transposed B) from global memory to local memory
        local_B_T[row][col] = B_T[global_col * N + (k * TILE_SIZE + row)];

        // Synchronize threads to make sure all threads have finished loading data
        barrier(CLK_LOCAL_MEM_FENCE);

        // Perform the multiplication and accumulation for the tile
        for (int i = 0; i < TILE_SIZE; i++) {
            sum += local_A[row][i] * local_B_T[i][col];
        }

        // Synchronize to ensure all threads are done with the current computation before moving to the next
        barrier(CLK_LOCAL_MEM_FENCE);
    }

    // Write the result to the global memory (C matrix)
    if (global_row < N && global_col < N) {
        C[global_row * N + global_col] = sum;
    }
}
"#;

This yields a bit of improvement- but this works in related to how little we have modified the code. But this may be crucial in certain cases.

Avoid Bank Conflicts

We aren't done with local memories- in reality, local memories are stored in banks- usually the same size as wavefront.

The way of storing is lower-bit based. For example, index modulus wavefront zero will be stored in bank 0, and index modulus wavefront one will be stored in bank 1, etc.

Each bank has its own memory controller, and they are independent. Let's consider the following case,

Assume the wavefront size is 4. We assume x by y means index $y * row + x$ . If we use row length 4.

Local Id 0 wants to read from bank 0, at memory address 0 by 0
Local Id 1 also wants to read from bank 0 at memory address 1 by 0
Local Id 2 wants to read from bank 1 at memory address 0 by 1
Local Id 3 also wants to read from bank 1 at memory address 1 by 1

What would happen? Local Id 0 and Local Id 2 will get their data from the bank 0 and 1, then bank 0 and 1 serves Local Id 1 and 3, which is called, a bank conflict. Even if they are accessing different memory location, their work are blocked.

How to solve this? The rule is simple- don't read from the same bank at the same time. A common trick is to introduce extra padding. Here if we add 2 padding, then,

Local Id 0 wants to read from bank 0, at memory address 0 by 0
Local Id 1 wants to read from bank 2 at memory address 1 by 0
Local Id 2 wants to read from bank 1 at memory address 0 by 1
Local Id 3 wants to read from bank 3 at memory address 1 by 1

Thus resolving the bank conflict.

Please note that bank conflict is specific to local memory and a wavefront. For the whole workgroup, it is not a problem.

In addition, if many work item wants the same local memory location, due to hardware broadcast, even if they want access the same bank, there will be no bank conflict.

Let's consider this line,

for (int i = 0; i < TILE_SIZE; i++) {
    sum += local_A[row][i] * local_B_T[i][col];
}

Is there a bank conflict?

The answer is, there might be.

This is because, for a workgroup of size eight by eight. If we assume that a wavefront is 32, then row number covers zero to three. There are bank conflicts, obviously. However, because there are only four rows in a wavefront, and within each wavefront, there can be broadcast that optimizes each row. It mitigates the issue to a quarter.

However, for col number, there is no broadcast, so more bank conflicts.

However, both of them failed to utilize all the banks. We can conclude that it is not serious, but not optimal.

The way to solve? It's simple, just change to,

__local float local_A[TILE_SIZE * (TILE_SIZE + 1)];
__local float local_B_T[TILE_SIZE * (TILE_SIZE + 1)];

This mitigates the bank conflict. Of course, this issue was not serious from the very start, so there are no obvious improvements. In addition that this breaks the originally aligned memory, which cases cache miss.

However, in other cases, resolving the bank conflict is very important.

We will not reconcile this issue here.

Now we call it an end for now. We continue the optimization in the next chapter.

Keys	Action
`?`	Open this help
`n`	Next page
`p`	Previous page
`s`	Search