Examples of quick hash tables and dynamic arrays in C

This article durably captures my reddit comment showing techniques for std::unordered_map and std::vector equivalents in C programs. The core, important features of these data structures require only a dozen or so lines of code apiece. They compile quickly, and tend to run faster in debug builds than release builds of their C++ equivalents. What they lack in genericity they compensate in simplicity. Nothing here will be new. Everything has been covered in greater detail previously, which I will reference when appropriate.

For a concrete goal, we will build a data structure representing an process environment, along with related functionality to make it more interesting. That is, we’ll build a string-to-string map.

Allocator

The foundation is our allocator, a simple bump allocator, so we’ll start there:

#define new(a, n, t)    (t *)alloc(a, n, sizeof(t), _Alignof(t))

typedef struct {
    char *beg;
    char *end;
} Arena;

void *alloc(Arena *a, ptrdiff_t count, int size, int align)
{
    int pad = -(uintptr_t)a->beg & (align - 1);
    assert(count < (a->end - a->beg)/size);  // TODO: OOM policy
    void *r = a->beg + pad;
    a->beg += pad + count*size;
    return memset(r, 0, count*size);
}

Allocating through the new macro eliminates several classes of common defects in C programs. If we get our types mixed up we get errors, or at least warnings. Our size calculations cannot overflow. We cannot accidentally use uninitialized memory. We cannot leak memory; deallocating is implicit. The main downside is that it doesn’t fit some less common allocator requirements.

Strings

Next, a string representation. Classic null-terminated strings are an error-prone paradigm, so we’ll use counted strings instead:

#define S(s)    (Str){s, sizeof(s)-1}

typedef struct {
    char     *data;
    ptrdiff_t len;
} Str;

This is equivalent to a std::string_view in C++. The macro allows us to efficiently convert string literals into Str objects. Because our data structures are backed by arenas, we won’t care whether a particular string is backed by a static string, arena, memory map, etc. We’ll also need a function to compare strings for equality:

_Bool equals(Str a, Str b)
{
    if (a.len != b.len) {
        return 0;
    }
    return !a.len || !memcmp(a.data, b.data, a.len);
}

!a.len appears superfluous, but it’s necessary: memcmp arbitrarily forbids null pointers, and we may be passed a zero-initialized Str. Though this is scheduled to be corrected.

We’ll need a string hash function, too:

uint64_t hash64(Str s)
{
    uint64_t h = 0x100;
    for (ptrdiff_t i = 0; i < s.len; i++) {
        h ^= s.data[i] & 255;
        h *= 1111111111111111111;
    }
    return h;
}

This is an FNV-style hash. The “basis” keeps strings of nulls from getting stuck at zero, and the multiplier is my favorite prime number. Character data is fixed to 0–255 rather than allowing the signedness of char to influence the results. As a multiplicative hash, the high bits are mixed better than the low bits, and our maps will take that into account.

Flat hash map

We have a couple string-to-string map options. The more restrictive, but more efficient — in terms of memory use and speed — is a Mask-Step-Index (MSI) hash table. I don’t think it fits our problem as well as the next option, particularly because it puts a hard limit on unique keys, but it’s worth evaluating. Let’s call it FlatEnv:

enum { ENVEXP = 10 };  // support up to 1,000 unique keys
typedef struct {
    Str keys[1<<ENVEXP];
    Str vals[1<<ENVEXP];
} FlatEnv;

It’s nothing more than two fixed-length arrays, storing keys and values separately. Keys with null pointers are empty slots, so a zero-initialized FlatEnv is an empty table. They come out of an arena ready-to-use:

    FlatEnv *env = new(a, 1, FlatEnv);  // new, empty environment

Now we leverage equals and hash64 for a double-hashed, open address search on the keys array:

Str *flatlookup(FlatEnv *env, Str key)
{
    uint64_t hash = hash64(key);
    uint32_t mask = (1<<ENVEXP) - 1;
    uint32_t step = (hash>>(64 - ENVEXP)) | 1;
    for (int32_t i = hash;;) {
        i = (i + step) & mask;
        if (!env->keys[i].data) {
            env->keys[i] = key;
            return env->vals + i;
        } else if (equals(env->keys[i], key)) {
            return env->vals + i;
        }
    }
}

By returning a pointer to the unmodified value slot, this function covers both lookup and insertion. So that’s the entire hash table implementation. To insert, the caller assigns the slot. For mere lookup, check the slot for a null pointer.

    FlatEnv *env = new(a, 1, FlatEnv);

    // insert
    *flatlookup(env, S("hello")) = S("world");

    // lookup
    Str val = *flatlookup(env, key);
    if (val.data) {
        printf("%.*s = %.*s\n", (int)key.len, key.data,
                                (int)val.len, val.data);
    }

To iterate over the map entries, iterate over the arrays, skipping null entries. Per the ENVEXP comment, it’s hard-coded to support up to 1,000 unique keys (1,024 slots, leaving some to spare). The table itself doesn’t enforce this limit and will turn into an infinite loop if you insert too many keys. To support scaling, we could design the map to have dynamic table sizes, track the number of unique keys, and resize the table (allocate new arrays) when the load factor crosses a threshold. Resizing sounds messy and complicated, so fortunately there’s another option.

Hierarchical hash map

If the number of keys is unbounded, hash tries work better. Trees scale well, and we can allocate nodes out of the arena as it grows. We’ll use a 4-ary trie, a good default that balances size and performance:

typedef struct Env Env;
struct Env {
    Env *child[4];
    Str  key;
    Str  value;
};

An empty map is just a null pointer, and so, again, these maps come ready-to-use in their zero state:

    Env *env = 0;  // new, empty environment

The implementation is equally as brief:

Str *lookup(Env **env, Str key, Arena *a)
{
    for (uint64_t h = hash64(key); *env; h <<= 2) {
        if (equals(key, (*env)->key)) {
            return &(*env)->value;
        }
        env = &(*env)->child[h>>62];
    }
    if (!a) return 0;
    *env = new(a, 1, Env);
    (*env)->key = key;
    return &(*env)->value;
}

Like before, this covers both lookup and insertion, though the mode is determined explicitly by the arena pointer. Without an arena, it’s a lookup, which doesn’t require allocation. With an arena, it creates an entry if necessary and, like before, returns a pointer into the map so that the caller can assign it. Usage differs only slightly:

    Env *env = 0;

    // insert
    *lookup(env, S("hello"), &scratch) = S("world");

    // lookup
    Str *val = flatlookup(env, key, 0);
    if (val) {
        printf("%.*s = %.*s\n", (int)key.len, key.data,
                                (int)val->len, val->data);
    }

We’ll come back around to iteration later.

String concatenation

Next I’d like a function that takes an Env and produces an envp data structure as expected by execve(2). Then we can use this map as the environment in a child process. We’ll need some string manipulation, particularly string concatenation. The core is a copy function:

Str copy(Arena *a, Str s)
{
    Str r = s;
    r.data = new(a, s.len, char);
    if (r.len) memcpy(r.data, s.data, r.len);
    return r;
}

Like with memcmp, because it’s memcpy we need to handle the arbitrary special case around null pointers should the input be a zero Str. Now we can easily concatenate strings, in-place if possible:

Str concat(Arena *a, Str head, Str tail)
{
    if (!head.data || head.data+head.len != a->beg) {
        head = copy(a, head);
    }
    head.len += copy(a, tail).len;
    return head;
}

Yet again, !head.data is special check because pointer arithmetic on null (i.e. adding zero to null) is arbitrarily disallowed. Worrying about this is exhausting, isn’t it? That language fix can’t come soon enough. This one’s already fixed in C++.

That’s enough to get the ball rolling on FlatEnv:

char **flat_to_envp(FlatEnv *env, Arena *a)
{
    int    cap  = 1<<ENVEXP;
    char **envp = new(a, cap, char *);
    int len = 0;
    for (int i = 0; i < cap; i++) {
        if (env->vals[i].data) {
            Str pair = env->keys[i];
            pair = concat(a, pair, S("="));
            pair = concat(a, pair, env->vals[i]);
            pair = concat(a, pair, S("\0"));
            envp[len++] = pair.data;
        }
    }
    return envp;
}

Simple, right? Traditional string handling in C is an error-prone pain, but with a better set of primitives it’s a breeze. Plus we’re doing this all with essentially no runtime. In use this might look like:

void shellexec(char *cmd, FlatEnv *env, Arena scratch)
{
    char  *argv[] = {"sh", "-c", cmd, 0};
    char **envp   = flat_to_envp(env, &scratch);
    execve("/bin/sh", argv, envp);
}

By virtue of the scratch arena, the envp object is automatically freed should execve fail. (If that should even matter.) Considering this, if you’re itching to write the fastest shell ever devised, arena allocation and the techniques in this article would probably get you most of the way there. Nobody writes shells this way.

Dynamic arrays

To implement the envp conversion for the hash trie Env, let’s add one more tool to our toolbox: dynamic arrays. Our std::vector equivalent. We’ll start with a familiar slice header:

typedef struct {
    char    **data;
    ptrdiff_t len;
    ptrdiff_t cap;
} EnvpSlice;

The bad news is that we don’t have templates, and so we’ll need to define one such structure for each type of which we want a dynamic array. This one is set up to create an envp array. The good news that manipulation occurs through generic code, so everything else is reusable.

I want a push macro that creates an empty slot in which to insert a new value, evaluating to a pointer to this slot. Usually that means incrementing len, but when out of room it will need to expand the underlying storage. It’s clearer to start with example usage. Imagine using it with the previous flat_to_envp:

char **flat_to_envp(FlatEnv *env, Arena *a)
{
    EnvpSlice r = {0};
    for (int i = 0; i < 1<<ENVEXP; i++) {
        if (env->vals[i].data) {
            // ... concat as before ...
            *push(a, &r) = pair.data;
        }
    }
    push(a, &r);  // terminal null pointer
    return r.data;
}

Continuing the theme, a zero-initialized slice is a ready-to-use empty slice, and most begin life this way. The immediate dereference on push is just like those calls to lookup. If expansion is needed, the push macro’s job is to pull elements off the slice, pass them into a helper function which agnostically, strict-aliasing-legally, manipulates the slice header:

void *push_(Arena *, void *data, ptrdiff_t *pcap, ptrdiff_t size);

#define push(a, s) \
  ((s)->len == (s)->cap \
    ? (s)->data = push_((a), (s)->data, &(s)->cap, sizeof(*(s)->data)), \
      (s)->data + (s)->len++ \
    : (s)->data + (s)->len++)

The internals of that helper look an awful lot like concat, with the same in-place-if-possible behavior:

enum { SLICE_INITIAL_CAP = 4 };

void *push_(Arena *a, void *data, ptrdiff_t *pcap, ptrdiff_t size)
{
    ptrdiff_t cap   = *pcap;
    ptrdiff_t align = _Alignof(void *);

    if (!data || a->beg != (char *)data + cap*size) {
        void *copy = alloc(a, cap, size, align);
        if (data) memcpy(copy, data, cap*size);
        data = copy;
    }

    ptrdiff_t extend = cap ? cap : SLICE_INITIAL_CAP;
    alloc(a, extend, size, align);
    *pcap = cap + extend;
    return data;
}

For unfathomable reasons, standard C does not permit _Alignof on expressions, so slice data is simply pointer-aligned. (The more shrewd might consider max_align_t.) Like concatenation, we copy the object to the beginning of the arena if necessary, and extend the allocation by allocating the usual way, being careful not to increment the capacity until after it succeeds.

We can now use push on any structure with data, len, and cap fields of the appropriate types.

Putting it all together

With that in place, we can define a simple, recursive version of the envp builder for Env:

#define countof(a)  ((ptrdiff_t)(sizeof(a) / sizeof(*(a))))

EnvpSlice env_to_envp_(EnvpSlice r, Env *env, Arena *a)
{
    if (env) {
        Str pair = env->key;
        pair = concat(a, pair, S("="));
        pair = concat(a, pair, env->value);
        pair = concat(a, pair, S("\0"));
        *push(a, &r) = pair.data;
        for (int i = 0; i < countof(env->child); i++) {
            r = env_to_envp_(r, env->child[i], a);
        }
    }
    return r;
}

char **env_to_envp(Env *env, Arena *a)
{
    EnvpSlice r = {0};
    r = env_to_envp_(r, env, a);
    push(a, &r);  // null pointer terminator
    return r.data;
}

As is often the case, the recursive part doesn’t fit the final interface, so the core is a helper, and the caller-facing part is an adapter. I’m not entirely comfortable with this function, though. When working with huge environments — over a ~100k entries — then the recursive implementation will non-deterministically blow the stack if the trie winds up lopsided. Or deterministically for chosen pathological inputs, because the hash function isn’t seeded.

Instead we could use a stack data structure backed by the arena to traverse the trie. If passed a secondary scratch arena, we’d use that arena for this stack, but I’m sticking to the original interface. Here’s what that looks like, with an extra trick thrown in just to show off:

char **env_to_envp_safe(Env *env, Arena *a)
{
    EnvpSlice r = {0};

    typedef struct {
        Env *env;
        int  index;
    } Frame;
    Frame init[16];  // small size optimization

    struct {
        Frame    *data;
        ptrdiff_t len;
        ptrdiff_t cap;
    } stack = {init, 0, countof(init)};

    *push(a, &stack) = (Frame){env, 0};
    while (stack.len) {
        Frame *top = stack.data + stack.len - 1;

        if (!top->env) {
            stack.len--;

        } else if (top->index == countof(top->env->child)) {
            Str pair = top->env->key;
            pair = concat(a, pair, S("="));
            pair = concat(a, pair, top->env->value);
            pair = concat(a, pair, S("\0"));
            *push(a, &r) = pair.data;
            stack.len--;

        } else {
            int i = top->index++;
            *push(a, &stack) = (Frame){top->env->child[i], 0};
        }
    }

    push(a, &r);
    return r.data;
}

The init array is a form of small-size optimization. It’s used at first, and sufficient for nearly all inputs. So no stack litter in the arena. If it’s not enough, then push will automatically move the stack into the arena. I think that’s a super duper neato trick!

Alternative to this, and as discussed in the original hash trie article, we could instead add a next field to Env as an intrusive linked list that chains the nodes together in insertion order. Or another way to look at it, Env is a linked list with an intrusive hash trie for O(log n) searches on the list. That’s a lot simpler, has other useful properties, and only costs one extra pointer per entry. And we wouldn’t need slices, which was my motivation for choosing non-linked-list approach above.

Hash hardening (bonus)

Okay, I lied, this is something new. Think of it as your special treat for sticking with me so far.

Hash map non-determinism comes with a classic security vulnerability: If populated with untrusted keys, an attacker could choose colliding keys and produce worst case behavior in the hash map. That is, MSI hash tables reduce to linear scans, and hash tries reduce to linked lists. Worse, the recursive envp function blows the stack, though we already solved that issue.

If we want to foil such attacks, we can seed the hash so that an attacker cannot devise collisions. They’d need to discover the seed. We might even call that seed a “key,” but this is a non-cryprographic hash so I’m going to avoid that term. The usual implementation of this concept involves generating a seed, sometimes per table, and storing it somewhere. However, we can leverage an existing security mechanism, gaining this feature at basically no cost: Address Space Layout Randomization (ASLR). First, let’s augment the string hash function:

uint64_t hash64(Str s, uint64_t seed)
{
    uint64_t h = seed;
    for (ptrdiff_t i = 0; i < s.len; i++) {
        h ^= s.data[i] & 255;
        h *= 1111111111111111111;
    }
    return h;
}

In flatlookup we can use the address of the FlatEnv as our seed:

Str *flatlookup(FlatEnv *env, Str key)
{
    uint64_t hash = hash64(key, (uintptr_t)env);
    // ...
}

Recall it’s allocated out of our arena (via new), and ASLR gives our arena a random offset. On top of that, a FlatEnv seed depends precisely on the amount of memory allocated earlier. An environment variable name or value being slightly longer or shorter will reshuffle the whole table if allocated in the arena before the FlatEnv.

It’s slightly trickier with hash tries. The root pointer isn’t required to be fixed. For example:

    Env *env = 0;
    // ... insert keys ...
    Env *myenv = env;
    // ... lookup keys in myenv ...

We could disallow this, but it would be easy to forget (e.g. while you’re refactoring and not thinking about it) and difficult to detect. Difficult-to-detect bugs keep me awake at night. Instead we can use the root node to seed the trie:

Str *lookup(Env **env, Str key, Arena *a)
{
    uint64_t seed = env ? (uintptr_t)*env : 0;
    for (uint64_t h = hash64(key, seed); *env; h <<= 2) {
    // ...
}

At first this seems like it couldn’t work, like a chicken-and-egg problem. There’s no root node at first, so we can’t know the seed yet. Though think about it a little longer and it should be obvious: The hash is unused when inserting the very first element. It simply becomes the root of the trie. The seed is irrelevant until the second insert, at which point we’ve established a seed. This delay establishing the seed means hash tries are even more randomized.

With the proper tools and representations, working in C isn’t difficult, even if you need containers and string manipulation. Aside from memcmp and memcpy — each easily replaceable — we did all this without runtime assistance, not even its allocator. What a pleasant way to work!

Source from this article in runnable form, which I used to test my samples: example.c

Have a comment on this article? Start a discussion in my public inbox by sending an email to ~skeeto/public-inbox@lists.sr.ht [mailing list etiquette] , or see existing discussions.

null program

Chris Wellons

wellons@nullprogram.com (PGP)
~skeeto/public-inbox@lists.sr.ht (view)