nullprogram.com/blog/2020/04/30/
This article was discussed on Hacker News.
I’ve noticed a small pattern across a few of my projects where I had
vectorized and parallelized some code. The original algorithm had a
“push” approach, the optimized version instead took a “pull” approach.
In this article I’ll describe what I mean, though it’s mostly just so I
can show off some pretty videos, pictures, and demos.
Sandpiles
A good place to start is the Abelian sandpile model, which, like
many before me, completely captured my attention for awhile.
It’s a cellular automaton where each cell is a pile of grains of sand —
a sandpile. At each step, any sandpile with more than four grains of
sand spill one grain into its four 4-connected neighbors, regardless of
the number of grains in those neighboring cell. Cells at the edge spill
their grains into oblivion, and those grains no longer exist.
With excess sand falling over the edge, the model eventually hits a
stable state where all piles have three or fewer grains. However, until
it reaches stability, all sorts of interesting patterns ripple though
the cellular automaton. In certain cases, the final pattern itself is
beautiful and interesting.
Numberphile has a great video describing how to form a group over
recurrent configurations (also). In short, for any given grid
size, there’s a stable identity configuration that, when “added” to
any other element in the group will stabilize back to that element. The
identity configuration is a fractal itself, and has been a focus of
study on its own.
Computing the identity configuration is really just about running the
simulation to completion a couple times from certain starting
configurations. Here’s an animation of the process for computing the
64x64 identity configuration:
As a fractal, the larger the grid, the more self-similar patterns there
are to observe. There are lots of samples online, and the biggest I
could find was this 3000x3000 on Wikimedia Commons. But I wanted
to see one that’s even bigger, damnit! So, skipping to the end, I
eventually computed this 10000x10000 identity configuration:
This took 10 days to compute using my optimized implementation:
https://github.com/skeeto/scratch/blob/master/animation/sandpiles.c
I picked an algorithm described in a code golf challenge:
f(ones(n)*6 - f(ones(n)*6))
Where f()
is the function that runs the simulation to a stable state.
I used OpenMP to parallelize across cores, and SIMD to parallelize
within a thread. Each thread operates on 32 sandpiles at a time.
To compute the identity sandpile, each sandpile only needs 3 bits of
state, so this could potentially be increased to 85 sandpiles at a time
on the same hardware. The output format is my old mainstay, Netpbm,
including the video output.
Sandpile push and pull
So, what do I mean about pushing and pulling? The naive approach to
simulating sandpiles looks like this:
for each i in sandpiles {
if input[i] < 4 {
output[i] = input[i]
} else {
output[i] = input[i] - 4
for each j in neighbors {
output[j] = output[j] + 1
}
}
}
As the algorithm examines each cell, it pushes results into
neighboring cells. If we’re using concurrency, that means multiple
threads of execution may be mutating the same cell, which requires
synchronization — locks, atomics, etc. That much
synchronization is the death knell of performance. The threads will
spend all their time contending for the same resources, even if it’s
just false sharing.
The solution is to pull grains from neighbors:
for each i in sandpiles {
if input[i] < 4 {
output[i] = input[i]
} else {
output[i] = input[i] - 4
}
for each j in neighbors {
if input[j] >= 4 {
output[i] = output[i] + 1
}
}
}
Each thread only modifies one cell — the cell it’s in charge of updating
— so no synchronization is necessary. It’s shader-friendly and should
sound familiar if you’ve seen my WebGL implementation of Conway’s Game
of Life. It’s essentially the same algorithm. If you chase down
the various Abelian sandpile references online, you’ll eventually come
across a 2017 paper by Cameron Fish about running sandpile simulations
on GPUs. He cites my WebGL Game of Life article, bringing
everything full circle. We had spoken by email at the time, and he
shared his interactive simulation with me.
Vectorizing this algorithm is straightforward: Load multiple piles at
once, one per SIMD channel, and use masks to implement the branches. In
my code I’ve also unrolled the loop. To avoid bounds checking in the
SIMD code, I pad the state data structure with zeros so that the edge
cells have static neighbors and are no longer special.
WebGL Fire
Back in the old days, one of the cool graphics tricks was fire
animations. It was so easy to implement on limited hardware. In
fact, the most obvious way to compute it was directly in the
framebuffer, such as in the VGA buffer, with no outside state.
There’s a heat source at the bottom of the screen, and the algorithm
runs from bottom up, propagating that heat upwards randomly. Here’s the
algorithm using traditional screen coordinates (top-left corner origin):
func rand(min, max) // random integer in [min, max]
for each x, y from bottom {
buf[y-1][x+rand(-1, 1)] = buf[y][x] - rand(0, 1)
}
As a push algorithm it works fine with a single-thread, but
it doesn’t translate well to modern video hardware. So convert it to a
pull algorithm!
for each x, y {
sx = x + rand(-1, 1)
sy = y + rand(1, 2)
output[y][x] = input[sy][sx] - rand(0, 1)
}
Cells pull the fire upward from the bottom. Though this time there’s a
catch: This algorithm will have subtly different results.
-
In the original, there’s a single state buffer and so a flame could
propagate upwards multiple times in a single pass. I’ve compensated
here by allowing a flames to propagate further at once.
-
In the original, a flame only propagates to one other cell. In this
version, two cells might pull from the same flame, cloning it.
In the end it’s hard to tell the difference, so this works out.
source code and instructions
There’s still potentially contention in that rand()
function, but this
can be resolved with a hash function that takes x
and y
as
inputs.