A flexible, lightweight, spin-lock barrier

This article was discussed on Hacker News.

The other day I wanted try the famous memory reordering experiment for myself. It’s the double-slit experiment of concurrency, where a program can observe an “impossible” result on common hardware, as though a thread had time-traveled. While getting thread timing as tight as possible, I designed a possibly-novel thread barrier. It’s purely spin-locked, the entire footprint is a zero-initialized integer, it automatically resets, it can be used across processes, and the entire implementation is just three to four lines of code.

Here’s the entire barrier implementation for two threads in C11.

// Spin-lock barrier for two threads. Initialize *barrier to zero.
void barrier_wait(_Atomic uint32_t *barrier)
{
    uint32_t v = ++*barrier;
    if (v & 1) {
        for (v &= 2; (*barrier&2) == v;);
    }
}

Or in Go:

func BarrierWait(barrier *uint32) {
    v := atomic.AddUint32(barrier, 1)
    if v&1 == 1 {
        v &= 2
        for atomic.LoadUint32(barrier)&2 == v {
        }
    }
}

Even more, these two implementations are compatible with each other. C threads and Go goroutines can synchronize on a common barrier using these functions. Also note how it only uses two bits.

When I was done with my experiment, I did a quick search online for other spin-lock barriers to see if anyone came up with the same idea. I found a couple of subtly-incorrect spin-lock barriers, and some straightforward barrier constructions using a mutex spin-lock.

Before diving into how this works, and how to generalize it, let’s discuss the circumstance that let to its design.

Experiment

Here’s the setup for the memory reordering experiment, where w0 and w1 are initialized to zero.

thread#1    thread#2
w0 = 1      w1 = 1
r1 = w1     r0 = w0

Considering all the possible orderings, it would seem that at least one of r0 or r1 is 1. There seems to be no ordering where r0 and r1 could both be 0. However, if raced precisely, this is a frequent or possibly even majority occurrence on common hardware, including x86 and ARM.

How to go about running this experiment? These are concurrent loads and stores, so it’s tempting to use volatile for w0 and w1. However, this would constitute a data race — undefined behavior in at least C and C++ — and so we couldn’t really reason much about the results, at least not without first verifying the compiler’s assembly. These are variables in a high-level language, not architecture-level stores/loads, even with volatile.

So my first idea was to use a bit of inline assembly for all accesses that would otherwise be data races. x86-64:

static int experiment(int *w0, int *w1)
{
    int r1;
    __asm volatile (
        "movl  $1, %1\n"
        "movl  %2, %0\n"
        : "=r"(r1), "=m"(*w0)
        : "m"(*w1)
    );
    return r1;
}

ARM64 (to try on my Raspberry Pi):

static int experiment(int *w0, int *w1)
{
    int r1 = 1;
    __asm volatile (
        "str  %w0, %1\n"
        "ldr  %w0, %2\n"
        : "+r"(r1), "=m"(w0)
        : "m"(w1)
    );
    return r1;
}

This is from the point-of-view of thread#1, but I can swap the arguments for thread#2. I’m expecting this to be inlined, and encouraging it with static.

Alternatively, I could use C11 atomics with a relaxed memory order:

static int experiment(_Atomic int *w0, _Atomic int *w1)
{
    atomic_store_explicit(w0, 1, memory_order_relaxed);
    return atomic_load_explicit(w1, memory_order_relaxed);
}

Since this is a race and I want both threads to run their two experiment instructions as simultaneously as possible, it would be wise to use some sort of starting barrier… exactly the purpose of a thread barrier! It will hold the threads back until they’re both ready.

int w0, w1, r0, r1;

// thread#1                   // thread#2
w0 = w1 = 0;
BARRIER;                      BARRIER;
r1 = experiment(&w0, &w1);    r0 = experiment(&w1, &w0);
BARRIER;                      BARRIER;

if (!r0 && !r1) {
    puts("impossible!");
}

The second thread goes straight into the barrier, but the first thread does a little more work to initialize the experiment and a little more at the end to check the result. The second barrier ensures they’re both done before checking.

Running this only once isn’t so useful, so each thread loops a few million times, hence the re-initialization in thread#1. The barriers keep them lockstep.

Barrier selection

On my first attempt, I made the obvious decision for the barrier: I used pthread_barrier_t. I was already using pthreads for spawning the extra thread, including on Windows, so this was convenient.

However, my initial results were disappointing. I only observed an “impossible” result around one in a million trials. With some debugging I determined that the pthreads barrier was just too damn slow, throwing off the timing. This was especially true with winpthreads, bundled with Mingw-w64, which in addition to the per-barrier mutex, grabs a global lock twice per wait to manage the barrier’s reference counter.

All pthreads implementations I used were quick to yield to the system scheduler. The first thread to arrive at the barrier would go to sleep, the second thread would wake it up, and it was rare they’d actually race on the experiment. This is perfectly reasonable for a pthreads barrier designed for the general case, but I really needed a spin-lock barrier. That is, the first thread to arrive spins in a loop until the second thread arrives, and it never interacts with the scheduler. This happens so frequently and quickly that it should only spin for a few iterations.

Barrier design

Spin locking means atomics. By default, atomics have sequentially consistent ordering and will provide the necessary synchronization for the non-atomic experiment variables. Stores (e.g. to w0, w1) made before the barrier will be visible to all other threads upon passing through the barrier. In other words, the initialization will propagate before either thread exits the first barrier, and results propagate before either thread exits the second barrier.

I know statically that there are only two threads, simplifying the implementation. The plan: When threads arrive, they atomically increment a shared variable to indicate such. The first to arrive will see an odd number, telling it to atomically read the variable in a loop until the other thread changes it to an even number.

At first with just two threads this might seem like a single bit would suffice. If the bit is set, the other thread hasn’t arrived. If clear, both threads have arrived.

void broken_wait1(_Atomic unsigned *barrier)
{
    ++*barrier;
    while (*barrier&1);
}

Or to avoid an extra load, use the result directly:

void broken_wait2(_Atomic unsigned *barrier)
{
    if (++*barrier & 1) {
        while (*barrier&1);
    }
}

Neither of these work correctly, and the other mutex-free barriers I found all have the same defect. Consider the broader picture: Between atomic loads in the first thread spin-lock loop, suppose the second thread arrives, passes through the barrier, does its work, hits the next barrier, and increments the counter. Both threads see an odd counter simultaneously and deadlock. No good.

To fix this, the wait function must also track the phase. The first barrier is the first phase, the second barrier is the second phase, etc. Conveniently the rest of the integer acts like a phase counter! Writing this out more explicitly:

void barrier_wait(_Atomic unsigned *barrier)
{
    unsigned observed = ++*barrier;
    unsigned thread_count = observed & 1;
    if (thread_count != 0) {
        // not last arrival, watch for phase change
        unsigned init_phase = observed >> 1;
        for (;;) {
            unsigned current_phase = *barrier >> 1;
            if (current_phase != init_phase) {
                break;
            }
        }
    }
}

The key: When the last thread arrives, it overflows the thread counter to zero and increments the phase counter in one operation.

By the way, I’m using unsigned since it may eventually overflow, and even _Atomic int overflow is undefined for the ++ operator. However, if you use atomic_fetch_add or C++ std::atomic then overflow is defined and you can use int.

Threads can never be more than one phase apart by definition, so only one bit is needed for the phase counter, making this effectively a two-phase, two-bit barrier. In my final implementation, rather than shift (>>), I mask (&) the phase bit with 2.

With this spin-lock barrier, the experiment observes r0 = r1 = 0 in ~10% of trials on my x86 machines and ~75% of trials on my Raspberry Pi 4.

Generalizing to more threads

Two threads required two bits. This generalizes to log2(n)+1 bits for n threads, where n is a power of two. You may have already figured out how to support more threads: spend more bits on the thread counter.

// Spin-lock barrier for n threads, where n is a power of two.
// Initialize *barrier to zero.
void barrier_waitn(_Atomic unsigned *barrier, int n)
{
    unsigned v = ++*barrier;
    if (v & (n - 1)) {
        for (v &= n; (*barrier&n) == v;);
    }
}

Note: It never makes sense for n to exceed the logical core count! If it does, then at least one thread must not be actively running. The spin-lock ensures it does not get scheduled promptly, and the barrier will waste lots of resources doing nothing in the meantime.

If the barrier is used little enough that you won’t overflow the overall barrier integer — maybe just use a uint64_t — an implementation could support arbitrary thread counts with the same principle using modular division instead of the & operator. The denominator is ideally a compile-time constant in order to avoid paying for division in the spin-lock loop.

While C11 _Atomic seems like it would be useful, unsurprisingly it is not supported by one major, stubborn implementation. If you’re using C++11 or later, then go ahead use std::atomic<int> since it’s well-supported. In real, practical C programs, I will continue using dual implementations: interlocked functions on MSVC, and GCC built-ins (also supported by Clang) everywhere else.

#if __GNUC__
#  define BARRIER_INC(x) __atomic_add_fetch(x, 1, __ATOMIC_SEQ_CST)
#  define BARRIER_GET(x) __atomic_load_n(x, __ATOMIC_SEQ_CST)
#elif _MSC_VER
#  define BARRIER_INC(x) _InterlockedIncrement(x)
#  define BARRIER_GET(x) _InterlockedOr(x, 0)
#endif

// Spin-lock barrier for n threads, where n is a power of two.
// Initialize *barrier to zero.
static void barrier_wait(int *barrier, int n)
{
    int v = BARRIER_INC(barrier);
    if (v & (n - 1)) {
        for (v &= n; (BARRIER_GET(barrier)&n) == v;);
    }
}

This has the nice bonus that the interface does not have the _Atomic qualifier, nor std::atomic template. It’s just a plain old int, making the interface simpler and easier to use. It’s something I’ve grown to appreciate from Go.

If you’d like to try the experiment yourself: reorder.c. If you’d like to see a test of Go and C sharing a thread barrier: coop.go.

I’m intentionally not providing the spin-lock barrier as a library. First, it’s too trivial and small for that, and second, I believe context is everything. Now that you understand the principle, you can whip up your own, custom-tailored implementation when the situation calls for it, just as the one in my experiment is hard-coded for exactly two threads.

Have a comment on this article? Start a discussion in my public inbox by sending an email to ~skeeto/public-inbox@lists.sr.ht [mailing list etiquette] , or see existing discussions.

null program

Chris Wellons

wellons@nullprogram.com (PGP)
~skeeto/public-inbox@lists.sr.ht (view)