When Parallel: Pull, Don't Push

I’ve noticed a small pattern across a few of my projects where I had vectorized and parallelized some code. The original algorithm had a “push” approach, the optimized version instead took a “pull” approach. In this article I’ll describe what I mean, though it’s mostly just so I can show off some pretty videos, pictures, and demos.

Sandpiles

A good place to start is the Abelian sandpile model, which, like many before me, completely captured my attention for awhile. It’s a cellular automaton where each cell is a pile of grains of sand — a sandpile. At each step, any sandpile with more than four grains of sand spill one grain into its four 4-connected neighbors, regardless of the number of grains in those neighboring cell. Cells at the edge spill their grains into oblivion, and those grains no longer exist.

With excess sand falling over the edge, the model eventually hits a stable state where all piles have three or fewer grains. However, until it reaches stability, all sorts of interesting patterns ripple though the cellular automaton. In certain cases, the final pattern itself is beautiful and interesting.

Numberphile has a great video describing how to form a group over recurrent configurations (also). In short, for any given grid size, there’s a stable identity configuration that, when “added” to any other element in the group will stabilize back to that element. The identity configuration is a fractal itself, and has been a focus of study on its own.

Computing the identity configuration is really just about running the simulation to completion a couple times from certain starting configurations. Here’s an animation of the process for computing the 64x64 identity configuration:

As a fractal, the larger the grid, the more self-similar patterns there are to observe. There are lots of samples online, and the biggest I could find was this 3000x3000 on Wikimedia Commons. But I wanted to see one that’s even bigger, damnit! So, skipping to the end, I eventually computed this 10000x10000 identity configuration:

This took 10 days to compute using my optimized implementation:

https://github.com/skeeto/scratch/blob/master/animation/sandpiles.c

I picked an algorithm described in a code golf challenge:

``````f(ones(n)*6 - f(ones(n)*6))
``````

Where `f()` is the function that runs the simulation to a stable state.

I used OpenMP to parallelize across cores, and SIMD to parallelize within a thread. Each thread operates on 32 sandpiles at a time. To compute the identity sandpile, each sandpile only needs 3 bits of state, so this could potentially be increased to 85 sandpiles at a time on the same hardware. The output format is my old mainstay, Netpbm, including the video output.

Sandpile push and pull

So, what do I mean about pushing and pulling? The naive approach to simulating sandpiles looks like this:

``````for each i in sandpiles {
if input[i] < 4 {
output[i] = input[i]
} else {
output[i] = input[i] - 4
for each j in neighbors {
output[j] = output[j] + 1
}
}
}
``````

As the algorithm examines each cell, it pushes results into neighboring cells. If we’re using concurrency, that means multiple threads of execution may be mutating the same cell, which requires synchronization — locks, atomics, etc. That much synchronization is the death knell of performance. The threads will spend all their time contending for the same resources, even if it’s just false sharing.

The solution is to pull grains from neighbors:

``````for each i in sandpiles {
if input[i] < 4 {
output[i] = input[i]
} else {
output[i] = input[i] - 4
}
for each j in neighbors {
if input[j] >= 4 {
output[i] = output[i] + 1
}
}
}
``````

Each thread only modifies one cell — the cell it’s in charge of updating — so no synchronization is necessary. It’s shader-friendly and should sound familiar if you’ve seen my WebGL implementation of Conway’s Game of Life. It’s essentially the same algorithm. If you chase down the various Abelian sandpile references online, you’ll eventually come across a 2017 paper by Cameron Fish about running sandpile simulations on GPUs. He cites my WebGL Game of Life article, bringing everything full circle. We had spoken by email at the time, and he shared his interactive simulation with me.

Vectorizing this algorithm is straightforward: Load multiple piles at once, one per SIMD channel, and use masks to implement the branches. In my code I’ve also unrolled the loop. To avoid bounds checking in the SIMD code, I pad the state data structure with zeros so that the edge cells have static neighbors and are no longer special.

WebGL Fire

Back in the old days, one of the cool graphics tricks was fire animations. It was so easy to implement on limited hardware. In fact, the most obvious way to compute it was directly in the framebuffer, such as in the VGA buffer, with no outside state.

There’s a heat source at the bottom of the screen, and the algorithm runs from bottom up, propagating that heat upwards randomly. Here’s the algorithm using traditional screen coordinates (top-left corner origin):

``````func rand(min, max) // random integer in [min, max]

for each x, y from bottom {
buf[y-1][x+rand(-1, 1)] = buf[y][x] - rand(0, 1)
}
``````

As a push algorithm it works fine with a single-thread, but it doesn’t translate well to modern video hardware. So convert it to a pull algorithm!

``````for each x, y {
sx = x + rand(-1, 1)
sy = y + rand(1, 2)
output[y][x] = input[sy][sx] - rand(0, 1)
}
``````

Cells pull the fire upward from the bottom. Though this time there’s a catch: This algorithm will have subtly different results.

• In the original, there’s a single state buffer and so a flame could propagate upwards multiple times in a single pass. I’ve compensated here by allowing a flames to propagate further at once.

• In the original, a flame only propagates to one other cell. In this version, two cells might pull from the same flame, cloning it.

In the end it’s hard to tell the difference, so this works out.

source code and instructions

There’s still potentially contention in that `rand()` function, but this can be resolved with a hash function that takes `x` and `y` as inputs.

Have a comment on this article? Start a discussion in my public inbox by sending an email to ~skeeto/public-inbox@lists.sr.ht , or see existing discussions.