Further Optimizing an OpenCL Program

Okay, let's do a recapitulation- we introduced divergence in the third chapter, then memory coalescing, utilizing memory hierarchy and tweaking work group size in chapter four to optimize matrix multiplication.

We ended up with,

__kernel void matrix_multiply(
    __constant float* A, 
    __constant float* B_T,
    __global float* C, 
    const unsigned int N
) {
    // Tile size (assuming 8x8 workgroup size)
    const unsigned int TILE_SIZE = 8;

    // Thread indices within the workgroup
    int row = get_local_id(0);
    int col = get_local_id(1);

    // Global indices for A and B_T in the entire matrix
    int global_row = get_group_id(0) * TILE_SIZE + row;
    int global_col = get_group_id(1) * TILE_SIZE + col;

    // Local memory to store tiles of A and B_T
    __local float local_A[TILE_SIZE][TILE_SIZE];
    __local float local_B_T[TILE_SIZE][TILE_SIZE];

    float sum = 0.0f;

    // Loop through the tiles of A and B_T
    for (int k = 0; k < N / TILE_SIZE; k++) {
        // Load a block of A from global memory to local memory
        local_A[row][col] = A[global_row * N + (k * TILE_SIZE + col)];

        // Load a block of B_T (transposed B) from global memory to local memory
        local_B_T[row][col] = B_T[global_col * N + (k * TILE_SIZE + row)];

        // Synchronize threads to make sure all threads have finished loading data
        barrier(CLK_LOCAL_MEM_FENCE);

        // Perform the multiplication and accumulation for the tile
        for (int i = 0; i < TILE_SIZE; i++) {
            sum += local_A[row][i] * local_B_T[i][col];
        }

        // Synchronize to ensure all threads are done with the current computation before moving to the next
        barrier(CLK_LOCAL_MEM_FENCE);
    }

    // Write the result to the global memory (C matrix)
    if (global_row < N && global_col < N) {
        C[global_row * N + global_col] = sum;
    }
}

We do not test time for the following method because it is meaningless now- the kernel is already very efficient and these tricks doesn't yield any significant speedup. However, these are still good to know, and would come in handy in the other cases.

SIMDs

SIMD stands for single instruction, multiple data. We consider the following two codes,

float sum = 0.0f;
for (int i = 0; i < TILE_SIZE; i++) {
    sum += local_A[row][i] * local_B_T[col][i];
}

And,

float sum = 0.0f;
for (int i = 0; i < TILE_SIZE; i += 2) {
    float2 a = local_A[row][i];
    float2 b = local_B_T[col][i];
    sum += a * b;
}

The latter one sends less sum instruction to the command queue, and thus faster.

However, the code above is different from our code, thus requiring more tweaks.

Because our TILE_SIZE is eight, we can actually discard the loop, and use,

_kernel void matrix_multiply(
    __global float* A, 
    __global float* B_T,
    __global float* C, 
    const unsigned int N
) {
    const unsigned int TILE_SIZE = 8;
    const int row = get_local_id(0);
    const int col = get_local_id(1);
    const int global_row = get_group_id(0) * TILE_SIZE + row;
    const int global_col = get_group_id(1) * TILE_SIZE + col;

    // Local memory with transposed storage for B_T
    __local float local_A[TILE_SIZE][TILE_SIZE];
    __local float local_B_T[TILE_SIZE][TILE_SIZE];  // Stored transposed

    float8 sum_vec = (float8)(0.0f);

    for (int k = 0; k < (N + TILE_SIZE - 1)/TILE_SIZE; k++) {
        // Load tiles with boundary checks
        const int load_col = k * TILE_SIZE + col;
        const int load_row = k * TILE_SIZE + row;

        local_A[row][col] = (load_col < N) ? A[global_row * N + load_col] : 0.0f;
        local_B_T[col][row] = (load_row < N) ? B_T[global_col * N + load_row] : 0.0f;

        barrier(CLK_LOCAL_MEM_FENCE);

        float8 a_vec = vload8(0, &local_A[row][0]);
        float8 b_vec = vload8(0, &local_B_T[row][0]);
        sum_vec += a_vec * b_vec;

        barrier(CLK_LOCAL_MEM_FENCE);
    }

    // Horizontal sum
    float sum = dot(sum_vec.lo, (float4)(1.0f)) + dot(sum_vec.hi, (float4)(1.0f));

    if(global_row < N && global_col < N) {
        C[global_row * N + global_col] = sum;
    }
}

Unrolling

This is of no use here, because our loop runs a small number of times. However, for large loops, you can use,

#pragma unroll 4
for (int i = 0; i < TILE_SIZE; i++) {
    sum += local_A[row][i] * local_B_T[i][col];
}

or,

#pragma unroll
for (int i = 0; i < TILE_SIZE; i++) {
    sum += local_A[row][i] * local_B_T[i][col];
}

To expand the loop, thus avoiding the comparison.

Double Buffer for Prefetching

The last optimization is very complex, and not really effective for this case- it yields even worse performance. But this part is necessary because it introduces event mechanism.

We need to remember that every case is different, and unlike traditional programming, parallel programming requires us to do some works that compilers often do in traditional programming.

Back to the topic, double buffer is a trick, by using two buffer to store data, one for the currently computing data, the other for the data for the next iteration. Because memory access can be asynchronous, the data in the buffer can be ready for the next iteration before the current iteration is done, thus avoiding the data dependency and thus avoiding the stall.

This trick is sometimes also called memory access hiding in a more general sense. Because the computation time hides the memory access time. Thus, when you write OpenCL code, it is better that you intersect the computation and memory access, instead of having a dedicated part for memory access and then computation.

OpenCL provides async operation and event mechanism, which is very useful for this kind of optimization.

For example, OpenCL has a native async function async_work_group_copy, which allows you to copy data from one buffer to another asynchronously. This is useful when you want to copy data from one buffer to another, but don't want to wait for the copy to finish.

event_t event = async_work_group_copy(__local gentype *dst, const __global gentype *src, size_t num_gentypes, event_t event);

Then when we need the data, we can wait for the event to finish.

wait_group_events(int num_events, event_t *event_list)

This is similar to Javascript,

const promise = copy_work(...);

await promise;

Please note that wait_group_events only waits for the operation, and do not synchronize the data. You still need to barrier. But if you use barrier, the events are automatically synchronized. So it can be ignored in this case, but it is good to know.

__kernel void matrix_multiply(
    __global float* A,
    __global float* B_T,
    __global float* C,
    const unsigned int N
) {
    const unsigned int TILE_SIZE = 8;
    const int row = get_local_id(0);
    const int col = get_local_id(1);
    const int global_row = get_group_id(0) * TILE_SIZE + row;
    const int global_col = get_group_id(1) * TILE_SIZE + col;

    __local float local_A[2][TILE_SIZE][TILE_SIZE];
    __local float local_B_T[2][TILE_SIZE][TILE_SIZE];
    int current = 0;
    float8 sum_vec = (float8)(0.0f);

    int load_col = 0 * TILE_SIZE + col;
    int load_row = 0 * TILE_SIZE + row;
    local_A[current][row][col] = (load_col < N) ? A[global_row * N + load_col] : 0.0f;
    local_B_T[current][col][row] = (load_row < N) ? B_T[global_col * N + load_row] : 0.0f;
    barrier(CLK_LOCAL_MEM_FENCE);

    for (int k = 0; k < (N + TILE_SIZE - 1) / TILE_SIZE; ++k) {
        int next = 1 - current;

        if (k + 1 < (N + TILE_SIZE - 1) / TILE_SIZE) {
            int next_load_col = (k + 1) * TILE_SIZE + col;
            int next_load_row = (k + 1) * TILE_SIZE + row;
            local_A[next][row][col] = (next_load_col < N) ? A[global_row * N + next_load_col] : 0.0f;
            local_B_T[next][col][row] = (next_load_row < N) ? B_T[global_col * N + next_load_row] : 0.0f;
        }

        float8 a_vec = vload8(0, &local_A[current][row][0]);
        float8 b_vec = vload8(0, &local_B_T[current][row][0]);
        sum_vec += a_vec * b_vec;

        barrier(CLK_LOCAL_MEM_FENCE);
        current = next;
    }

    if (global_row < N && global_col < N) {
        float sum = dot(sum_vec.lo, (float4)(1.0f)) + dot(sum_vec.hi, (float4)(1.0f));
        C[global_row * N + global_col] = sum;
    }
}