nullprogram.com/blog/2026/05/06/
Readers will be familiar with Mask-Step-Index (MSI) hash tables, a
technique for building fast, open-addressed hash tables in a dozen lines
of code. If multiple threads or processes access an MSI table with
at least one still inserting elements, care must be taken to avoid data
races. This article will show how to add atomic operations to MSI tables
in order to support different concurrency constraints.
Let’s begin with the simplest case: An integer hash set, no deletions,
only one insert thread (single producer), and consumers do not care about
insert order. That is, the producer inserts A then B, but consumers may
observe B in the table before A. Suppose this is the hash table in the
single-threaded case:
int32_t *lookup(int32_t key, int32_t *table, int exp)
{
uint64_t hash = ((uint64_t)key * 1111111111111111111u) >> 32;
uint32_t mask = ((uint32_t)1 << exp) - 1;
uint32_t step = (hash >> (32 - exp)) | 1;
for (uint32_t index = hash;;) {
index = (index + step) & mask;
if (!table[index] || table[index]==key) {
return table + index;
}
}
}
Keys must be non-zero, and tables are zero-initialized. Usage example:
// Initialization
enum { exp = 8 };
int32_t table[1<<8] = {};
// Producer
for (int i = 0; i < nkeys; i++) {
*lookup(keys[i], table, exp) = keys[i];
}
// Consumer
int32_t key = 1234;
bool present = *lookup(key, table, exp);
The only problem is the data race on table slots. Since consumers can
tolerate out-of-order insertions, ordering does not matter and relaxed
atomics eliminate the data race. Insert and query now have different
requirements, so it makes sense to distinguish them. Starting with the
latter:
bool contains(int32_t key, int32_t *table, int exp)
{
uint64_t hash = ((uint64_t)key * 1111111111111111111u) >> 32;
uint32_t mask = ((uint32_t)1 << exp) - 1;
uint32_t step = (hash >> (32 - exp)) | 1;
for (uint32_t index = hash;;) {
index = (index + step) & mask;
int32_t k = __atomic_load_n(table+index, __ATOMIC_RELAXED);
if (!k) {
return false;
} else if (k == key) {
return true;
}
}
}
Note how all elements are accessed by atomic loads, as a producer may
store to any slot at any time. Now producers:
bool insert(int32_t key, int32_t *table, int exp)
{
uint64_t hash = ((uint64_t)key * 1111111111111111111u) >> 32;
uint32_t mask = ((uint32_t)1 << exp) - 1;
uint32_t step = (hash >> (32 - exp)) | 1;
for (uint32_t index = hash;;) {
index = (index + step) & mask;
if (!table[index]) {
__atomic_store_n(table+index, key, __ATOMIC_RELAXED);
return true;
} else if (table[index] == key) {
return false;
}
}
}
This function may load elements non-atomically because there’s only one
producer: the current thread. This idea could not be expressed were the
type system involved, e.g. _Atomic, but GCC atomics do not
involve require such special qualifiers. Stores on the other hand are
concurrent with consumers, requiring an atomic store. Single-producer,
multiple-consumer (SPMC) usage is nearly identical to the single-threaded
case:
// Producer
for (int i = 0; i < nkeys; i++) {
insert(keys[i], table, exp);
}
// Consumer
int32_t key = 1234;
bool present = contains(key, table, exp);
A concurrent integer hash table is contrived and unrealistic. In a real
program a key likely carries some broader semantic meaning. For example,
if that “integer” is actually a memory offset known as a pointer, then it
points at some object, and it is important that stores to that object
happen before consumers observe the pointer in the table:
bool insert(Thing *thing, Thing **table, int exp)
Thing *lookup(Key key, Thing **table, int exp)
Where usage might look like:
// Producer
for (int i = 0; i < nthings; i++) {
things[i].key = ...; // update/init object
insert(things+i, table, exp);
}
// Consumer
bool present = !!find((Key){...}, table, exp);
In this case relaxed atomics are insufficient. Updates to the inserted
object may be reordered after the insertion, and consumers will race on
those updates. In this case we upgrade to acquire-release:
Thing *lookup(Key key, Thing **table, int exp)
{
// ...
for (...) {
// ...
Thing *thing = __atomic_load_n(table+index, __ATOMIC_ACQUIRE);
if (!thing || thing->key==key) {
return thing;
}
}
}
bool insert(Thing *thing, Thing **table, int exp)
{
// ...
for (...) {
// ...
if (!table[index]) {
__atomic_store_n(table+index, thing, __ATOMIC_RELEASE);
return true;
} else if (table[index]->key == thing->key) {
return false;
}
}
}
In this case producer and consumer synchronize on the atomics. Producer
stores are ordered before the release, and consumer loads are ordered
after the acquire. Objects are not modified once in the table, so atomics
are not required for their fields. On some architectures, including x86,
there will be no indication at the ISA level that atomics are in use —
i.e. this likely generates the same code as the single-threaded version —
and these atomics merely constrain the compiler’s instruction scheduling.
As a side effect of synchronizing, consumers will now observe insertions
in the same order as the producer. This is a more realistic and practical
situation than an integer hash table.
Multiple producers
The multiple-producer case (MPMC) is more complicated for producers, but
consumers are unaffected, so we need only modify insertion. Still without
any locks, we will optimistically update the table. We look at the current
slot item, and if nothing is present compare-and-swap the new element in
place. On failure we acquire the element that won the race, continuing
as though it’s what we saw in the first place.
bool insert(Thing *thing, Thing **table, int exp)
{
// ...
for (...) {
// ...
Thing *current = __atomic_load_n(table+index, __ATOMIC_ACQUIRE);
if (!current) {
int pass = __ATOMIC_RELEASE;
int fail = __ATOMIC_ACQUIRE;
if (__atomic_compare_exchange_n(
table+index, ¤t, thing, 0, pass, fail)) {
return true;
}
}
if (current->key == thing->key) {
return false;
}
}
}
This is quite similar my hash trie concurrency enhancement a few
years ago.