Skip to main content

Parallel Merge Sort

Divide and conquer are all reduce pattern. And thus, when it comes to parallel sorting, the first thing we come to is the merge sort as a parallel divide-and-conquer.

Idea

There is not much to say about the general idea- when we divide, we dispatch the work in parallel, and that's all. Here let's just calculate its complexity here.

Let's imagine the divide-and-conquer tree. For each layer, every node can be executed in parallel. However, to merge two chunk, it costs O(n)O(n). Suppose we are at the kk layer (root as zero), the merging costs O(n2k)O(\frac{n}{2^k}).

Thus,

T(n)=T(n/2)+O(n)T(n) = T(n / 2) + O(n)

The complexity is O(n)O(n).

tip

Master theorem,

For a recursive function,

T(n)=aT(nb)+f(n)T(n) = a T(\frac{n}{b}) + f(n)

Where nb\frac{n}{b} can also be nb\lceil \frac{n}{b} \rceil or nb\lfloor \frac{n}{b} \rfloor.

Consider the function,

g(n)=nlogbag(n) = n^{\log_b a}

If,

  • f(n)<g(n)f(n) < g(n), T(n)=Θ(nlogba)T(n) = \Theta(n^{\log_b a}).
  • f(n)=g(n)f(n) = g(n), T(n)=Θ(f(n)logn)T(n) = \Theta(f(n) \log n).
  • f(n)>g(n)f(n) > g(n), T(n)=Θ(f(n))T(n) = \Theta(f(n)).

f(n)>g(n)f(n) > g(n) means of higher order in a polynomial sense. That is, there exists such ϵ>0\epsilon > 0 that f(n)=Ω(g(n)nϵ)f(n) = \Omega(g(n)n^{\epsilon})

f(n)<g(n)f(n) < g(n) means of lower order in a polynomial sense. That is, there exists such ϵ>0\epsilon > 0 that f(n)=O(g(n)nϵ)f(n) = O(g(n)n^{-\epsilon})

Implementation

#include <vector>
#include <array>
#include <iostream>
#include <algorithm>
#include <sycl/sycl.hpp>

using namespace sycl;

int main() {
queue q;
std::cout << "Selected device: "
<< q.get_device().get_info<info::device::name>() << "\n";
constexpr int N = 1 << 28;
std::vector<int> d(N, 0);
buffer prev{d.data(), range{N}};
buffer<int> next{range{N}};

auto gen = q.submit([&](handler& h) {
accessor prev_acc{prev, h, write_only};
h.parallel_for(N, [=](id<1> i) {
prev_acc[i] = N - i * (
1 + (i % 3 == 0) + (i % 5 == 0) + (i % 7 == 0) + (i % 11 == 0)
);
});
});
gen.wait();


int chunk_size = 1;
while (chunk_size < N) {
auto merge = q.submit([&](handler& h) {
accessor prev_acc{prev, h, read_only};
accessor next_acc{next, h, write_only};
h.parallel_for(N / (2 * chunk_size), [=](id<1> i) {
int left_start = i * 2 * chunk_size;
int right_start = left_start + chunk_size;
int left_end = right_start - 1;
int right_end = right_start + chunk_size - 1;
int l = left_start;
int r = right_start;
int cur = left_start;
while (l <= left_end || r <= right_end) {
if (l <= left_end && (r > right_end || prev_acc[l] <= prev_acc[r])) {
next_acc[cur] = prev_acc[l];
++l;
} else {
next_acc[cur] = prev_acc[r];
++r;
}
++cur;
}
});
});
merge.wait();
auto mv = q.submit([&](handler& h) {
accessor prev_acc{prev, h};
accessor next_acc{next, h};
h.copy(next_acc, prev_acc);
});
mv.wait();

chunk_size <<= 1;
}

host_accessor h_acc{prev, read_only};
std::cout << h_acc[N - 1] << std::endl;
}