## The quick and practical "MSI" hash table

I generally prefer C, so I’m accustomed to building whatever I need on the fly, such as heaps, linked lists, and especially hash tables. Few programs use more than a small subset of a data structure’s features, making their implementation smaller, simpler, and more efficient than the general case, which must handle every edge case. A typical hash table tutorial will describe a relatively lengthy program, but in practice, bespoke hash tables are only a few lines of code. Over the years I’ve worked out some basic principles for hash table construction that aid in quick and efficient implementation. This article covers the technique and philosophy behind what I’ve come to call the “mask-step-index” (MSI) hash table, which is my standard approach.

MSI hash tables are nothing novel, just a double hashed, open
address hash table layered generically atop an external array. It’s
best regarded as a kind of database index — *a lookup index over an
existing array*. The array exists independently, and the hash table
provides an efficient lookup into that array over some property of its
entries.

The core of the MSI hash table is this iterator function:

```
// Compute the next candidate index. Initialize idx to the hash.
int32_t ht_lookup(uint64_t hash, int exp, int32_t idx)
{
uint32_t mask = ((uint32_t)1 << exp) - 1;
uint32_t step = (hash >> (64 - exp)) | 1;
return (idx + step) & mask;
}
```

The name should now make sense. I literally sound it out in my head when I
type it, like a mnemonic. Compute a mask, then a step size, finally an
index. The `exp`

parameter is a power-of-two exponent for the hash table
size, which may look familiar. I’ve used `int32_t`

for the index,
but it’s easy to substitute, say, `size_t`

. I try to optimize for the
common case, where a 31-bit index is more than sufficient, and a signed
type since subscripts should be signed. Internally it uses unsigned
types since overflow is both expected and harmless thanks to the
power-of-two hash table size.

It’s the caller’s responsibility to compute the hash, and the MSI iterator
tells the caller *where to look next*. For insertion, the caller (maybe)
looks either for an existing entry to override, or an empty slot. For
lookup, the caller looks for a matching entry, giving up as soon as it
find an empty slot. An insertion loop looks like this string intern table:

```
#define EXP 15
// Initialize all slots to an "empty" value (null)
#define HT_INIT { {0}, 0 }
struct ht {
char *ht[1<<EXP];
int32_t len;
};
char *intern(struct ht *t, char *key)
{
uint64_t h = hash(key, strlen(key)+1);
for (int32_t i = h;;) {
i = ht_lookup(h, EXP, i);
if (!t->ht[i]) {
// empty, insert here
if ((uint32_t)t->len+1 == (uint32_t)1<<EXP) {
return 0; // out of memory
}
t->len++;
t->ht[i] = key;
return key;
} else if (!strcmp(t->ht[i], key)) {
// found, return canonical instance
return t->ht[i];
}
}
}
```

The caller initializes the iterator to the hash result. This will probably
be out of range, even negative, but that doesn’t matter. The iterator
function will turn it into a valid index before use. This detail is key to
*double hashing*: The low bits of the hash tell it where to start, and the
high bits tell it how to step. The hash table size is a power of two, and
the step size is forced to an odd number (via `| 1`

), so it’s guaranteed
to visit each slot in the table exactly once before restarting. It’s
important that the search halts before looping, such as by guaranteeing
the existence of an empty slot (i.e. the “out of memory” check).

Note: The example out of memory check pushes the hash table to the absolute limit, and in practice you’d want to stop at a smaller load factor — perhaps even as low as 50% since that’s simple and fast. Otherwise it degrades into a linear search as the table approaches capacity.

Even if two keys start or land at the same place, they’ll quickly diverge
due to differing steps. For awhile I used plain linear probing — i.e.
`step=1`

— but double hashing came out ahead every time I benchmarked,
steering me towards this “MSI” construction. Ideally `ht_lookup`

would be
placed so that it’s inlined — e.g. in the same translation unit — so that
the mask and step are not actually recomputed each iteration.

### Deletion

What about deletion? First, consider how infrequently you delete entries
from a hash table. When was the last time you used `del`

on a dictionary
in Python, or `delete`

on a `map`

in Go? This operation is rarely needed.
However, when you *do* need it, reserve a gravestone value in addition to
the empty value.

```
static char gravestone[] = "(deleted)";
char *intern(struct ht *t, char *key)
{
char *dest = 0;
// ...
if (!t->ht[i]) {
// ...
dest = dest ? dest : &t->ht[i];
*dest = key;
return key;
} else if (t->ht[i] == gravestone) {
dest = dest ? dest : &t->ht[i];
} else if (!strcmp(...)) {
// ...
}
// ...
}
char *unintern(struct ht *t, char *key)
{
// ...
if (!t->ht[i]) {
return 0;
} else if (t->ht[i] == gravestone) {
// skip over
} else if (!strcmp(...)) {
char *old = t->ht[i];
t->ht[i] = gravestone;
return old;
}
// ...
}
```

When searching, skip over gravestones. Note that gravestones are compared
with `==`

(identity), so this does not preclude a string `"(deleted)"`

.
When inserting, use the first gravestone found if no entry was found.

### As a database index

Iterating over the example string intern table is simple: Iterate over the underlying array, skipping empty slots (and maybe gravestones). Entries will be in a random order rather than, say, insertion order. This is a useful introductory example, but this isn’t where MSI most shines. As mentioned, it’s best when treated like a database index.

Let’s take a step back and consider the caller of `intern`

. How does it
allocate these strings? Perhaps they’re appended to a buffer, and
`intern`

indicates whether or not the string is unique so far.

```
struct buf {
// lookup table over the buffer
struct ht ht;
// a collection of strings
int32_t len;
char buf[BUFLEN];
};
```

Strings are only appended to the buffer when unique, and the hash table can make that determination in constant time.

```
char *buf_push(struct buf *b, char *s)
{
size_t len = strlen(s) + 1;
if (b->len+len > sizeof(b->buf)) {
return 0; // out of memory
}
char *candidate = b->buf + buf->len;
memcpy(candidate, s, len);
char *result = intern(&b->ht, candidate);
if (result == candidate) {
// string is unique, keep it
b->len += len;
}
return result;
}
```

In my first example, `EXP`

was fixed. This could be converted into a
dynamic allocation and the hash table resized as needed. Here’s a new
constructor, which I’m including since I think it’s instructive:

```
struct ht {
int32_t len;
int exp;
char **ht;
};
static struct ht
ht_new(int exp)
{
struct ht ht = {0, exp, 0};
assert(exp >= 0);
if (exp >= 32) {
return ht; // request too large
}
ht.ht = calloc((size_t)1<<exp, sizeof(ht.ht[0]));
return ht;
}
```

If `intern`

fails, the hash table can be replaced with a new table twice
as large, and since, like a database index, its contents are entirely
redundant, *the hash table can be discarded and rebuilt from scratch*. The
new and old table don’t need to exist simultaneously. Here’s a routine to
populate an empty hash table from the buffer:

```
void buf_rehash(struct buf *b)
{
assert(b->ht.len == 0);
for (int32_t off = 0; off < b->len;) {
char *s = b->buf + off;
int32_t len = strlen(s) + 1;
off += len;
uint64_t h = hash(s, len);
for (int32_t i = h;;) {
i = ht_lookup(h, b->ht.exp, i);
if (!b->ht.ht[i]) {
b->ht.len++;
b->ht.ht[i] = s;
break;
}
}
}
}
```

Note how this iterates in insertion order, which may be useful in other
cases, too. On the rehash it doesn’t need to check for existing entries,
as all entries are already known to be unique. Later when `intern`

hits
its capacity:

```
char *result = intern(&b->ht, candidate);
if (!result) {
free(b->ht.ht);
b->ht = ht_new(ht.exp+1);
if (!b->ht) {
return 0; // out of memory
}
buf_rehash(b);
result = intern(&b->ht, candidate); // cannot fail
}
```

I freed and reallocated the table, but it would be trivial to use a
`realloc`

instead, unlike the case where the old table *isn’t* redundant.

### Multimaps

An MSI hash table is trivially converted into a multimap, a hash table
with multiple values per key. Callers just make one small change: *Don’t
stop searching until an empty slot is found*. Each match is an additional
multimap value. The “value array” is stored along the hash table itself,
in insertion order, without additional allocations.

For example, imagine the strings in the string buffer have a namespace
prefix, delimited by a colon, like `city:Austin`

and `state:Texas`

. We’d
like a fast lookup of all strings under a particular namespace. The
solution is to add another hash table as you would an index to a database
table.

```
struct buf {
// ..
struct ht ns;
// ..
};
```

When a unique string is appended it’s also registered in the namespace multimap. It doesn’t check for an existing key, only for an empty slot, since it’s a multimap:

```
// Check outside the loop since it always inserts.
if (/* ... ns multimap lacks capacity ... */) {
// ... grow+rehash ns mutilmap ...
}
int32_t nslen = strcspn(s, ":") + 1;
uint64_t h = hash(s, nslen);
for (int32_t i = h;;) {
i = ht_lookup(h, b->ns.exp, i);
if (!b->ns.ht[i]) {
b->ns.len++;
b->ns.ht[i] = s;
break;
}
}
```

It includes the `:`

as a terminator which simplifies lookups. Here’s a
lookup loop to print all strings under a namespace (includes terminal `:`

in the key):

```
char *ns = "city:";
int32_t nslen = strlen(ns);
// ...
uint64_t h = hash(ns, nslen);
for (int32_t i = h;;) {
i = ht_lookup(h, b->ns.exp, i);
if (!b->ns.ht[i]) {
break;
} else if (!strncmp(b.ns->ht[i], ns, nslen)) {
puts(b->ns.ht[i]+nslen);
}
}
```

An alternative approach to multimaps is to additionally key over a value
subscript. For example, the first city is keyed `{"city", 0}`

, the next
`{"city", 1}`

, etc. The value subscript could be mixed into the string
hash with an integer permutation (more on this below):

```
uint64_t h = hash64(val_idx ^ hash(s, nslen));
```

The lookup loop would compare both the string and the value subscript, and stop when it finds a match. The underlying hash table is not truly a multimap, but rather a plain hash table with a larger key. This requires extra bookkeeping — tracking individual subscripts and the number of values per key — but provides constant time random access on the multimap value array.

### Hash functions

The MSI iterator leaves hashing up to the caller, who has better knowledge about the input and how to hash it, though this takes a bit of knowledge of how to build a hash function. The good news is that it’s easy, and less is more. Better to do too little than too much, and a faster, weaker hash function is worth a few extra collisions.

The first rule is to never lose sight of the goal: The purpose of the hash function is to uniformly distribute entries over a table. The better you know and exploit your input, the less you need to do in the hash function. Sometimes your keys already contain random data, and so your hash function can be the identity function! For example, if your keys are “version 4” UUIDs, don’t waste time hashing them, just load a few bytes from the end as an integer and you’re done.

```
// "Hash" a v4 UUID
uint64_t uuid4_hash(unsigned char uuid[16])
{
uint64_t h;
memcpy(&h, uuid+8, 8);
return h;
}
```

A reasonable start for strings is FNV-1a, such as this possible
implementation for my `hash()`

function above:

```
uint64_t hash(char *s, int32_t len)
{
uint64_t h = 0x100;
for (int32_t i = 0; i < len; i++) {
h ^= s[i] & 255;
h *= 1111111111111111111;
}
return h ^ h>>32;
}
```

The hash state is initialized to a *basis*, some arbitrary value. This a
useful place to introduce a seed or hash key. It’s best that at least one
bit above the low mix-in bits is set so that it’s not trivially stuck at
zero. Above, I’ve chosen the most trivial basis with reasonable results,
though often I’ll use the digits of π.

Next XOR some input into the low bits. This could be a byte, a Unicode
code point, etc. More is better, since otherwise you’re stuck doing more
work per unit, the main weakness of FNV-1a. Carefully note the byte mask,
`& 255`

, which inhibits sign extension. **Do not mix sign-extended inputs
into FNV-1a** — a widespread implementation mistake.

Multiply by a large, odd random-ish integer. A prime is a reasonable choice, and I usually pick my favorite prime, shown above: 19 ones in base 10.

Finally, my own touch, an xorshift finalizer. The high bits are much
better mixed than the low bits, so this improves the overall quality.
Though if you take time to benchmark, you might find that this finalizer
isn’t necessary. Remember, do *just* enough work to keep the number of
collisions low — not *lowest* — and no more.

If your input is made of integers, or is a short, fixed length, use an integer permutation, particularly multiply-xorshift. It takes very little to get a sufficient distribution. Sometimes one multiplication does the trick. Fixed-sized, integer-permutation hashes tend to be the fastest, easily beating fancier SIMD-based hashes, including AES-NI. For example:

```
// Hash a timestamp-based, version 1 UUID
uint64_t uuid1_hash(unsigned char uuid[16])
{
uint64_t s[2];
memcpy(s, uuid, 16);
s[0] += 0x3243f6a8885a308d; // digits of pi
s[0] *= 1111111111111111111;
s[0] ^= s[0] >> 33;
s[0] += s[1];
s[0] *= 1111111111111111111;
s[0] ^= s[0] >> 33;
return s[0];
}
```

If I benchmarked this in a real program, I would probably cut it down even
further, deleting hash operations one at a time and measuring the overall
hash table performance. This `memcpy`

trick works well with floats, too,
especially packing two single precision floats into one 64-bit integer.

If you ever hesitate to build a hash table when the situation calls, I hope the MSI technique will make the difference next time. I have more hash table tricks up my sleeve, but since they’re not specific to MSI I’ll save them for a future article.

### Benchmarks

There have been objections to my claims about performance, so I’ve assembled some benchmarks. These demonstrate that:

- AES-NI slower than an integer permutation, at least for short keys.
- A custom, 10-line MSI hash table is easily an order of magnitude faster than a typical generic hash table from your language’s standard library. This isn’t because the standard hash table is inferior, but because it wasn’t written for your specific problem.

Have a comment on this article? Start a discussion in my public inbox by sending an email to ~skeeto/public-inbox@lists.sr.ht [mailing list etiquette] , or see existing discussions.