Physics Example - 2D Heat Transfer
In this case we simulate a heat transfer in 2D. This is just for a practice.
Physics and Math
Here we omits some complex numerical details. We will use forward Euler method- you don't need to care what that is, you only need to follow my equations.
The equation is,
We take as time step and as length step. We suppose n
as time steps, p
as x-axis length steps, and q
as y-axis length steps.
The forward euler method is only stable if,
When this equation doesn't hold, we use backward euler method. However, backward euler method requires more math to solve, and converges slower.
Let's expand the equation to discreet form under forward euler method,
Obviously, we can easily solve given the previous time step. That is,
We enforce the following boundary conditions, and also add a heat source and initial condition,
And initially,
To make it more complex, we suppose the transfer coefficient is a second order function of temperature, that is,
Implementation
use ocl::ProQue;
use plotly::{common::{ColorScale, ColorScalePalette}, HeatMap, Layout, Plot};
use indicatif::ProgressBar;
const KERNEL_SRC: &str = include_str!("kernel.cl");
fn main() -> ocl::Result<()> {
// Simulation parameters
let size_x = 512; // Grid size in x-direction
let size_y = 512; // Grid size in y-direction
let dt = 0.001f32; // Time step
let dl = 1f32; // Spatial step
let a0 = 10.0f32; // α₀ coefficient
let a1 = 0.01f32; // α₁ coefficient
let a2 = 0.0001f32; // α₂ coefficient
let num_steps = 1000000; // Number of time steps
// Initialize OpenCL
let pro_que = ProQue::builder()
.src(KERNEL_SRC)
.dims((size_x, size_y))
.build()?;
// Create initial temperature field
let temp_data = vec![0.0f32; size_x * size_y];
// Create double buffers
let buffers = vec![
pro_que.create_buffer::<f32>()?,
pro_que.create_buffer::<f32>()?,
];
// Initialize buffers
buffers[0].write(&temp_data).enq()?;
buffers[1].write(&temp_data).enq()?;
// Create kernel
let kernel = pro_que.kernel_builder("next_step")
.arg(&buffers[0]) // Initial T_current buffer
.arg(&buffers[1]) // Initial T_next buffer
.arg(dt)
.arg(dl)
.arg(a0)
.arg(a1)
.arg(a2)
.local_work_size([16, 16])
.build()?;
let pb = ProgressBar::new(num_steps as u64);
let mut computation_time = std::time::Duration::new(0, 0);
let mut start_time = std::time::Instant::now();
// Simulation loop
for step in 0..num_steps {
pb.inc(1);
let (current_idx, next_idx) = (step % 2, (step + 1) % 2);
// Set kernel arguments
kernel.set_arg(0, &buffers[current_idx])?;
kernel.set_arg(1, &buffers[next_idx])?;
// Execute kernel
unsafe { kernel.enq()?; }
// Read results
let mut temp_result = vec![0.0f32; size_x * size_y];
buffers[next_idx].read(&mut temp_result).enq()?;
computation_time += start_time.elapsed();
start_time = std::time::Instant::now();
// Visualize
if step % (num_steps / 10) == ((num_steps / 10) - 1){
plot_temperature(&temp_result, size_x, size_y, step).unwrap();
}
}
println!("Time elapsed: {:.2?}", computation_time);
Ok(())
}
fn plot_temperature(
temperature: &[f32],
size_x: usize,
size_y: usize,
step: usize
) -> Result<(), Box<dyn std::error::Error>> {
// Prepare data matrix (transposed for correct orientation)
let z: Vec<Vec<f32>> = (0..size_y)
.map(|y| (0..size_x).map(|x| temperature[x * size_y + y]).collect())
.collect();
// Create heatmap trace
let trace = HeatMap::new(
(0..size_x).map(|x| x as i32).collect(), // X coordinates
(0..size_y).map(|y| y as i32).collect(), // Y coordinates
z
)
.name("Temperature")
.color_scale(ColorScale::Palette(ColorScalePalette::RdBu));
// Create plot layout
let layout = Layout::new()
.title(format!("Heat Distribution - Step {}", step).as_str())
.x_axis(plotly::layout::Axis::new().title("X Position"))
.y_axis(plotly::layout::Axis::new().title("Y Position"))
.width(800)
.height(800);
// Create and save plot
let mut plot = Plot::new();
plot.add_trace(trace);
plot.set_layout(layout);
plot.write_html(format!("output/heatmap_step_{}.html", step));
Ok(())
}
__kernel void next_step(
__constant float* T_current,
__global float* T_next,
const float dt,
const float dl,
const float a0,
const float a1,
const float a2
) {
int p = get_global_id(0);
int q = get_global_id(1);
int size_x = get_global_size(0);
int size_y = get_global_size(1);
if (p == 0 || q == 0) {
T_next[p * size_y + q] = 100.0f;
} else if (p == size_x - 1 || q == size_y - 1) {
T_next[p * size_y + q] = 0.0f;
} else {
float current_T = T_current[p * size_y + q];
float T_p_plus1 = T_current[(p + 1) * size_y + q];
float T_p_minus1 = T_current[(p - 1) * size_y + q];
float T_q_plus1 = T_current[p * size_y + (q + 1)];
float T_q_minus1 = T_current[p * size_y + (q - 1)];
float divider = dl * dl;
float laplacian_x = (T_p_plus1 - 2.0f * current_T + T_p_minus1);
float laplacian_y = (T_q_plus1 - 2.0f * current_T + T_q_minus1);
float alpha = a0 + a1 * current_T + a2 * current_T * current_T;
float delta_T = alpha * (laplacian_x + laplacian_y) * dt / divider;
T_next[p * size_y + q] = current_T + delta_T;
}
}
Time taken 418s. The result is as below,
Divergence
Here we need to introduce the concept of diveregence- since it happens here.
For each CU, it has smaller units called either warp (Nvidia) or wavefront (AMD). We use the term wavefront in the future passage.
Each wavefront shares one PC (program counter) register. And within one wavefront, all PE uses the same clock. A typical value of a wavefront is 32.
Each wavefront shares a PC register, so let's consider the following code,
int x = 2;
if(mask[get_global_id(0)]) {
x = 1;
}
else {
x = 0;
}
If a wavefront meets if(mask[get_global_id(0)])
, and they have the same mask value, then it is fine. But what if not? They share the same PC register that they can't take the same path. This is called a warefront divergence.
Encountered by divergence, the actual execution will be like,
int x = 2;
int x_branch_true = 1;
int x_branch_false = 0;
x = select(mask[get_global_id(0)], x_branch_true, x_branch_false);
Yes, both if and else branches will be executed. So it slows down the performance. Thus, we should avoid using if-else
in our programs.
The above code is simple, but what if there are thousands of lines? It will be disastrous. Thus, put in as few code in the branch as possible.
This doesn't mean you shouldn't use branches- they are unavoidable. But you need to remember that instead of the cost of one branch, the actual cost is the sum of all the branches.
So, in our previous code, divergence happens, which is necessary consdering our case, and low cost because the divergence only cases two extra lines. However, divergence is a problem you should consider, and do put as few code in a;; the branches as possible.