Parallel Prefix Sum
Prefix Sum si a classical problem in computer science. Given an array of numbers, we want to compute the sum of each prefix of the array. For example, given the array [1, 2, 3, 4, 5]
, the prefix sum is [1, 3, 6, 10, 15]
.
By calculating the prefix sum, we can get the sum of any sub array with complexity. However the computing the prefix sum has a cost of .
Prefix sum is obviously, sequential, consider the following code,
int n = data.size();
vector<int> pre(n, 0);
for(int i = 0; i < n; i++) {
if(i == 0) {
pre[i] = data[0];
}
else {
pre[i] = pre[i - 1] + data[i];
}
}
This is a typical reduce pattern with data dependency. We need to do data parallelization to make it parallel.
The Hillis-Steele algorithm can resolve this with lower complexity. It brings data parallelism to this problem.
Idea
Suppose we have,
Hillis-Steele tells us to first, chunk into small sub arrays of size two. Then we calculate the prefix sum of each chunk.
Then we treat every four elements in as one chunk, and calculate the prefix of the latter half by adding the prefix of the former half.
That is to say, for the four element chunk,
The prefix of former half is . And the latter half adds this sum element-wise
Writing it all out is too long, we use the following notation,
And if we apply this operation, we will have,
You can see that we can get the correct prefix sum for each chunk. If we repeat this process, we can get the final result.
And of course, you don't need to treat our first operation as special and write a dedicated kernel- the first operation is just 2 element chunk while presuming that is chunked every 1 element.
This process repeats times, and we do a transformation on each element in parallel, so the complexity is .
In parallel programming, there are two types of complexity, the parallel complexity and the non-parallel complexity (or serial complexity).
Serial complexity complexity of an algorithm when run on a single processor (or core). The goal here is to measure how the execution time of the algorithm grows as the size of the input increases, assuming no parallelism is used.
Parallel complexity, on the other hand, refers to how the algorithm’s runtime behaves when it is parallelized and executed on multiple processors (or cores) simultaneously. The idea is to exploit concurrency to decrease the overall runtime, ideally without increasing the work done.
In that sense, parallel complexity is the real complexity of an algorithm. But because real devices have a finite number of PEs, when the number of task exceeds the number of the PEs, the time will still grown as the serial complexity.
But a lack of PEs is very easy to solve if you have the money. So we usually focus on the parallel complexity in parallel programming. Most of the time, we design methods that has greater serial complexity but lower parallel complexity.
For example, in this prefix sum algorithm. The serial complexity is , even greater than brutal force, but the parallel complexity is .
Implementation
#include <array>
#include <iostream>
#include <vector>
#include <sycl/sycl.hpp>
using namespace sycl;
int main() {
queue q;
constexpr int N = 1 << 28;
std::vector<int> d(N, 1);
buffer<int> buf_prev(
d.data(), range{N}
);
buffer<int> buf_next{range{N}};
int chunk_size = 1;
while(chunk_size < N) {
auto e = q.submit([&](handler& h) {
accessor prev{buf_prev, h, read_only};
accessor next{buf_next, h, write_only};
h.parallel_for(range{N}, [=](id<1> i) {
int prev_chunk_id = i / chunk_size;
if (prev_chunk_id % 2 == 0) {
next[i] = prev[i]; // Copy the value directly
return;
}
int prefix_sum_id = prev_chunk_id * chunk_size - 1;
if (prefix_sum_id >= 0) { // Ensure that prefix_sum_id is valid
next[i] = prev[i] + prev[prefix_sum_id];
} else {
next[i] = prev[i];
}
});
});
e.wait();
// Swap buffers using a copy operation
auto mv = q.submit([&](handler& h) {
accessor prev{buf_prev, h, write_only};
accessor next{buf_next, h, read_only};
h.copy(next, prev); // Copy back the results to the original buffer
});
mv.wait();
chunk_size <<= 1; // Double the chunk size
}
host_accessor h_access{buf_prev};
std::cout << h_access[N - 1] << '\n';
return 0;
}
Instead of coping the data from buffer to buffer, you can also set up a flip bit. If flip bit is true, use the first buffer as previous buffer and the second as next buffer, and vice versa is flip bit is false. So that you can save the cost of copying.
This trick is called double buffering.
Segment tree, finding maximum or minimum element can all be done in the same manner.