nullprogram.com/blog/2022/08/08/
Follow-up: Solving “Two Sum” in C with a tiny hash table
I generally prefer C, so I’m accustomed to building whatever I need
on the fly, such as heaps, linked lists, and especially hash
tables. Few programs use more than a small subset of a data structure’s
features, making their implementation smaller, simpler, and more
efficient than the general case, which must handle every edge
case. A typical hash table tutorial will describe a relatively lengthy
program, but in practice, bespoke hash tables are only a few lines of
code. Over the years I’ve worked out some basic principles for hash
table construction that aid in quick and efficient implementation. This
article covers the technique and philosophy behind what I’ve come to call
the “mask-step-index” (MSI) hash table, which is my standard approach.
MSI hash tables are nothing novel, just a double hashed, open
address hash table layered generically atop an external array. It’s
best regarded as a kind of database index — a lookup index over an
existing array. The array exists independently, and the hash table
provides an efficient lookup into that array over some property of its
entries.
The core of the MSI hash table is this iterator function:
// Compute the next candidate index. Initialize idx to the hash.
int32_t ht_lookup(uint64_t hash, int exp, int32_t idx)
{
uint32_t mask = ((uint32_t)1 << exp) - 1;
uint32_t step = (hash >> (64 - exp)) | 1;
return (idx + step) & mask;
}
The name should now make sense. I literally sound it out in my head when I
type it, like a mnemonic. Compute a mask, then a step size, finally an
index. The exp
parameter is a power-of-two exponent for the hash table
size, which may look familiar. I’ve used int32_t
for the index,
but it’s easy to substitute, say, size_t
. I try to optimize for the
common case, where a 31-bit index is more than sufficient, and a signed
type since subscripts should be signed. Internally it uses unsigned
types since overflow is both expected and harmless thanks to the
power-of-two hash table size.
It’s the caller’s responsibility to compute the hash, and the MSI iterator
tells the caller where to look next. For insertion, the caller (maybe)
looks either for an existing entry to override, or an empty slot. For
lookup, the caller looks for a matching entry, giving up as soon as it
find an empty slot. An insertion loop looks like this string intern table:
#define EXP 15
// Initialize all slots to an "empty" value (null)
#define HT_INIT { {0}, 0 }
struct ht {
char *ht[1<<EXP];
int32_t len;
};
char *intern(struct ht *t, char *key)
{
uint64_t h = hash(key, strlen(key)+1);
for (int32_t i = h;;) {
i = ht_lookup(h, EXP, i);
if (!t->ht[i]) {
// empty, insert here
if ((uint32_t)t->len+1 == (uint32_t)1<<EXP) {
return 0; // out of memory
}
t->len++;
t->ht[i] = key;
return key;
} else if (!strcmp(t->ht[i], key)) {
// found, return canonical instance
return t->ht[i];
}
}
}
The caller initializes the iterator to the hash result. This will probably
be out of range, even negative, but that doesn’t matter. The iterator
function will turn it into a valid index before use. This detail is key to
double hashing: The low bits of the hash tell it where to start, and the
high bits tell it how to step. The hash table size is a power of two, and
the step size is forced to an odd number (via | 1
), so it’s guaranteed
to visit each slot in the table exactly once before restarting. It’s
important that the search halts before looping, such as by guaranteeing
the existence of an empty slot (i.e. the “out of memory” check).
Note: The example out of memory check pushes the hash table to the
absolute limit, and in practice you’d want to stop at a smaller load
factor — perhaps even as low as 50% since that’s simple and fast.
Otherwise it degrades into a linear search as the table approaches
capacity.
Even if two keys start or land at the same place, they’ll quickly diverge
due to differing steps. For awhile I used plain linear probing — i.e.
step=1
— but double hashing came out ahead every time I benchmarked,
steering me towards this “MSI” construction. Ideally ht_lookup
would be
placed so that it’s inlined — e.g. in the same translation unit — so that
the mask and step are not actually recomputed each iteration.
Deletion
What about deletion? First, consider how infrequently you delete entries
from a hash table. When was the last time you used del
on a dictionary
in Python, or delete
on a map
in Go? This operation is rarely needed.
However, when you do need it, reserve a gravestone value in addition to
the empty value.
static char gravestone[] = "(deleted)";
char *intern(struct ht *t, char *key)
{
char **dest = 0;
// ...
if (!t->ht[i]) {
// ...
dest = dest ? dest : &t->ht[i];
*dest = key;
return key;
} else if (t->ht[i] == gravestone) {
dest = dest ? dest : &t->ht[i];
} else if (!strcmp(...)) {
// ...
}
// ...
}
char *unintern(struct ht *t, char *key)
{
// ...
if (!t->ht[i]) {
return 0;
} else if (t->ht[i] == gravestone) {
// skip over
} else if (!strcmp(...)) {
char *old = t->ht[i];
t->ht[i] = gravestone;
return old;
}
// ...
}
When searching, skip over gravestones. Note that gravestones are compared
with ==
(identity), so this does not preclude a string "(deleted)"
.
When inserting, use the first gravestone found if no entry was found.
As a database index
Iterating over the example string intern table is simple: Iterate over the
underlying array, skipping empty slots (and maybe gravestones). Entries
will be in a random order rather than, say, insertion order. This is a
useful introductory example, but this isn’t where MSI most shines. As
mentioned, it’s best when treated like a database index.
Let’s take a step back and consider the caller of intern
. How does it
allocate these strings? Perhaps they’re appended to a buffer, and
intern
indicates whether or not the string is unique so far.
struct buf {
// lookup table over the buffer
struct ht ht;
// a collection of strings
int32_t len;
char buf[BUFLEN];
};
Strings are only appended to the buffer when unique, and the hash table
can make that determination in constant time.
char *buf_push(struct buf *b, char *s)
{
size_t len = strlen(s) + 1;
if (b->len+len > sizeof(b->buf)) {
return 0; // out of memory
}
char *candidate = b->buf + buf->len;
memcpy(candidate, s, len);
char *result = intern(&b->ht, candidate);
if (result == candidate) {
// string is unique, keep it
b->len += len;
}
return result;
}
In my first example, EXP
was fixed. This could be converted into a
dynamic allocation and the hash table resized as needed. Here’s a new
constructor, which I’m including since I think it’s instructive:
struct ht {
int32_t len;
int exp;
char **ht;
};
static struct ht
ht_new(int exp)
{
struct ht ht = {0, exp, 0};
assert(exp >= 0);
if (exp >= 32) {
return ht; // request too large
}
ht.ht = calloc((size_t)1<<exp, sizeof(ht.ht[0]));
return ht;
}
If intern
fails, the hash table can be replaced with a new table twice
as large, and since, like a database index, its contents are entirely
redundant, the hash table can be discarded and rebuilt from scratch. The
new and old table don’t need to exist simultaneously. Here’s a routine to
populate an empty hash table from the buffer:
void buf_rehash(struct buf *b)
{
assert(b->ht.len == 0);
for (int32_t off = 0; off < b->len;) {
char *s = b->buf + off;
int32_t len = strlen(s) + 1;
off += len;
uint64_t h = hash(s, len);
for (int32_t i = h;;) {
i = ht_lookup(h, b->ht.exp, i);
if (!b->ht.ht[i]) {
b->ht.len++;
b->ht.ht[i] = s;
break;
}
}
}
}
Note how this iterates in insertion order, which may be useful in other
cases, too. On the rehash it doesn’t need to check for existing entries,
as all entries are already known to be unique. Later when intern
hits
its capacity:
char *result = intern(&b->ht, candidate);
if (!result) {
free(b->ht.ht);
b->ht = ht_new(ht.exp+1);
if (!b->ht) {
return 0; // out of memory
}
buf_rehash(b);
result = intern(&b->ht, candidate); // cannot fail
}
I freed and reallocated the table, but it would be trivial to use a
realloc
instead, unlike the case where the old table isn’t redundant.
Multimaps
An MSI hash table is trivially converted into a multimap, a hash table
with multiple values per key. Callers just make one small change: Don’t
stop searching until an empty slot is found. Each match is an additional
multimap value. The “value array” is stored along the hash table itself,
in insertion order, without additional allocations.
For example, imagine the strings in the string buffer have a namespace
prefix, delimited by a colon, like city:Austin
and state:Texas
. We’d
like a fast lookup of all strings under a particular namespace. The
solution is to add another hash table as you would an index to a database
table.
struct buf {
// ..
struct ht ns;
// ..
};
When a unique string is appended it’s also registered in the namespace
multimap. It doesn’t check for an existing key, only for an empty slot,
since it’s a multimap:
// Check outside the loop since it always inserts.
if (/* ... ns multimap lacks capacity ... */) {
// ... grow+rehash ns mutilmap ...
}
int32_t nslen = strcspn(s, ":") + 1;
uint64_t h = hash(s, nslen);
for (int32_t i = h;;) {
i = ht_lookup(h, b->ns.exp, i);
if (!b->ns.ht[i]) {
b->ns.len++;
b->ns.ht[i] = s;
break;
}
}
It includes the :
as a terminator which simplifies lookups. Here’s a
lookup loop to print all strings under a namespace (includes terminal :
in the key):
char *ns = "city:";
int32_t nslen = strlen(ns);
// ...
uint64_t h = hash(ns, nslen);
for (int32_t i = h;;) {
i = ht_lookup(h, b->ns.exp, i);
if (!b->ns.ht[i]) {
break;
} else if (!strncmp(b.ns->ht[i], ns, nslen)) {
puts(b->ns.ht[i]+nslen);
}
}
An alternative approach to multimaps is to additionally key over a value
subscript. For example, the first city is keyed {"city", 0}
, the next
{"city", 1}
, etc. The value subscript could be mixed into the string
hash with an integer permutation (more on this below):
uint64_t h = hash64(val_idx ^ hash(s, nslen));
The lookup loop would compare both the string and the value subscript, and
stop when it finds a match. The underlying hash table is not truly a
multimap, but rather a plain hash table with a larger key. This requires
extra bookkeeping — tracking individual subscripts and the number of
values per key — but provides constant time random access on the multimap
value array.
Hash functions
The MSI iterator leaves hashing up to the caller, who has better knowledge
about the input and how to hash it, though this takes a bit of knowledge
of how to build a hash function. The good news is that it’s easy, and less
is more. Better to do too little than too much, and a faster, weaker hash
function is worth a few extra collisions.
The first rule is to never lose sight of the goal: The purpose of the hash
function is to uniformly distribute entries over a table. The better you
know and exploit your input, the less you need to do in the hash function.
Sometimes your keys already contain random data, and so your hash function
can be the identity function! For example, if your keys are “version 4”
UUIDs, don’t waste time hashing them, just load a few bytes from the
end as an integer and you’re done.
// "Hash" a v4 UUID
uint64_t uuid4_hash(unsigned char uuid[16])
{
uint64_t h;
memcpy(&h, uuid+8, 8);
return h;
}
A reasonable start for strings is FNV-1a, such as this possible
implementation for my hash()
function above:
uint64_t hash(char *s, int32_t len)
{
uint64_t h = 0x100;
for (int32_t i = 0; i < len; i++) {
h ^= s[i] & 255;
h *= 1111111111111111111;
}
return h ^ h>>32;
}
The hash state is initialized to a basis, some arbitrary value. This a
useful place to introduce a seed or hash key. It’s best that at least one
bit above the low mix-in bits is set so that it’s not trivially stuck at
zero. Above, I’ve chosen the most trivial basis with reasonable results,
though often I’ll use the digits of π.
Next XOR some input into the low bits. This could be a byte, a Unicode
code point, etc. More is better, since otherwise you’re stuck doing more
work per unit, the main weakness of FNV-1a. Carefully note the byte mask,
& 255
, which inhibits sign extension. Do not mix sign-extended inputs
into FNV-1a — a widespread implementation mistake.
Multiply by a large, odd random-ish integer. A prime is a reasonable
choice, and I usually pick my favorite prime, shown above: 19 ones in base
10.
Finally, my own touch, an xorshift finalizer. The high bits are much
better mixed than the low bits, so this improves the overall quality.
Though if you take time to benchmark, you might find that this finalizer
isn’t necessary. Remember, do just enough work to keep the number of
collisions low — not lowest — and no more.
If your input is made of integers, or is a short, fixed length, use an
integer permutation, particularly multiply-xorshift. It takes very
little to get a sufficient distribution. Sometimes one multiplication does
the trick. Fixed-sized, integer-permutation hashes tend to be the fastest,
easily beating fancier SIMD-based hashes, including AES-NI. For
example:
// Hash a timestamp-based, version 1 UUID
uint64_t uuid1_hash(unsigned char uuid[16])
{
uint64_t s[2];
memcpy(s, uuid, 16);
s[0] += 0x3243f6a8885a308d; // digits of pi
s[0] *= 1111111111111111111;
s[0] ^= s[0] >> 33;
s[0] += s[1];
s[0] *= 1111111111111111111;
s[0] ^= s[0] >> 33;
return s[0];
}
If I benchmarked this in a real program, I would probably cut it down even
further, deleting hash operations one at a time and measuring the overall
hash table performance. This memcpy
trick works well with floats, too,
especially packing two single precision floats into one 64-bit integer.
If you ever hesitate to build a hash table when the situation
calls, I hope the MSI technique will make the difference next time. I have
more hash table tricks up my sleeve, but since they’re not specific to MSI
I’ll save them for a future article.
Benchmarks
There have been objections to my claims about performance, so I’ve
assembled some benchmarks. These demonstrate that:
- AES-NI slower than an integer permutation, at least for short keys.
- A custom, 10-line MSI hash table is easily an order of magnitude faster
than a typical generic hash table from your language’s standard library.
This isn’t because the standard hash table is inferior, but because it
wasn’t written for your specific problem.