null program

Speculations on arenas and custom strings in C++

2024-04-14T00:39:18Z

My techniques with arena allocation and strings are oriented around C. I’m always looking for a better way, and lately I’ve been experimenting with building them using C++ features. What are the trade-offs? Are the benefits worth the costs? In this article I lay out my goals, review implementation possibilities, and discuss my findings. Following along will require familiarity with those previous two articles.

Some of C++ is beyond my mental capabilities, and so I cannot wield those parts effectively. Other parts I can wrap my head around, but it requires substantial effort and the inevitable mistakes are difficult to debug. So a general goal is to minimize contact with that complexity, only touching a few higher-value features that I can use confidently.

Existing practice is unimportant. I’ve seen where that goes. Like the C standard library, the C++ standard library offers me little. Its concepts regarding ownership and memory management are irreconcilable (move semantics, smart pointers, etc.), so I have to build from scratch anyway. So absolutely no including C++ headers. The most valuable features are built right into the language, so I won’t need to include library definitions.

No public or private. Still no const beyond what is required to access certain features. This means I can toss out a bunch of keywords like class, friend, etc. It eliminates noisy, repetitive code and interfaces — getters, setters, separate const and non-const — which in my experience means fewer defects.

No references beyond mandatory cases. References hide addresses being taken — or merely implies it, when it’s actually an expensive copy — which is an annoying experience when reading unfamiliar C++. After all, for arenas the explicit address-taking (permanent) or copying (scratch) is a critical part of communicating the interfaces.

In theory constexpr could be useful, but it keeps falling short when I try it out, so I’m ignoring it. I’ll elaborate in a moment.

Minimal template use. They blow up compile times and code size, they’re noisy, and in practice they make debug builds (i.e. -O0) much slower (typically ~10x) because there’s no optimization to clean up the mess. I’ll only use them for a few foundational purposes, such as allocation. (Though this article is about the fundamental stuff.)

No methods aside from limited use of operator overloads. I want to keep a C style, plus methods just look ugly without references: obj->func() vs. func(obj). (Why are we still writing -> in the 21st century?) Function overloading can instead differentiate “methods.” Overloads are acceptable in moderation, especially because I’m paying for it (symbol decoration) whether or not I take advantage.

Finally, no exceptions of course. I assume -fno-exceptions, or the local equivalent, is active.

Allocation

Let’s start with allocation. Since writing that previous article, I’ve streamlined arena allocation in C:

#define new(a, t, n)  (t *)alloc(a, sizeof(t), _Alignof(t), n)

typedef struct {
    byte *beg;
    byte *end;
} arena;

static byte *alloc(arena *a, size objsize, size align, size count)
{
    assert(count >= 0);
    size pad = (uptr)a->end & (align - 1);
    assert(count < (a->end - a->beg - pad)/objsize);  // oom
    return memset(a->end -= objsize*count + pad, 0, objsize*count);
}

(As needed, replace the second assert with whatever out of memory policy is appropriate.) Then allocating, say, a 10k-element hash table (i.e. to keep it off the stack):

    i16 *seen = new(&scratch, i16, 1<<14);

With C++, I initially tried placement new with the arena as the “place” for the allocation:

void *operator new(size_t, arena *);  // avoid this

Then to create a single object:

    object *o = new (&scratch) object{};

This exposes the constructor, but everything else about it is poor. It relies on complex, finicky rules governing new overloads, especially for alignment handling. It’s difficult to tell what’s happening, and it’s too easy to make mistakes that compile. That doesn’t even count the mess that is array new[].

I soon learned it’s better to replace the new macro with a template, which can actually see what it’s doing. I can’t call it new in C++, so I settled on make instead:

template<typename T>
static T *make(arena *a, size count = 1)
{
    assert(count >= 0);
    size objsize = sizeof(T);
    size align   = alignof(T);
    size pad     = (uptr)a->end & (align - 1);
    assert(count < (a->end - a->beg - pad)/objsize);  // oom
    a->end -= objsize*count + pad;
    T *r = (T *)a->end;
    for (size i = 0; i < count; i++) {
        new ((void *)&r[i]) T{};
    }
    return r;
}

Then allocating that hash table becomes:

    i16 *seen = make<i16>(&scratch, 10000);

Or a single object, relying on the default argument:

    object *o = make<object>(&scratch);

Due to placement new, merely for invoking the constructor, these objects aren’t just zero-initialized, but value-initialized. It can only construct objects that define an empty initializer, but in exchange unlocks some interesting possibilities:

struct mat3 {
    f32 data[9] = {
        1, 0, 0,
        0, 1, 0,
        0, 0, 1,
    };
};

struct list {
    node  *head = 0;
    node **tail = &head;
};

When a zero-initialized state isn’t ideal, objects can still initialize to a more useful state straight out of the arena. The second case is even self-referencing, which is specifically supported through placement new. Otherwise you’d need a special-written copy or move constructor.

make could accept constructor arguments and perfect forward them to a constructor. However, that’s too far into the dark arts for my comfort, plus it requires a correct definition of std::forward. In practice that means #include-ing it, and whatever comes in with it. Or ask an expert capable of writing such a definition from scratch, though both are probably too busy.

Update: One of those experts, Jonathan Müller, kindly reached out to say that a static cast is sufficient. This is easy to do:

template<typename T, typename ...A>
static T *make(arena *a, size count = 1, A &&...args)
{
    // ...
        new ((void *)&r[i]) T{(A &&)args...};
    // ...
}

One small gotcha: placement new doesn’t work out of the box, and you need to provide a definition. That means including or writing one out. Fortunately it’s trivial, but the prototype must exactly match, including size_t:

void *operator new(size_t, void *p) { return p; }

Overall I feel the template is a small improvement over the macro.

Strings

Recall my basic C string type, with a macro to wrap literals:

#define countof(a)  (size)(sizeof(a) / sizeof(*(a)))
#define s8(s)       (s8){(u8 *)s, countof(s)-1}

typedef struct {
    u8  *data;
    size len;
} s8;

Since it doesn’t own the underlying buffer — region-based allocation has already solved the ownership problem — this is what C++ long-windedly calls a std::string_view. In C++ we won’t need the countof macro for strings, but it’s still generally useful. Converting it to a template, which is theoretically more robust (rejects pointers):

template<typename T, size N>
size countof(T (&)[N])
{
    return N;
}

The reference — here a reference to an array — is unavoidable, so it’s one of the rare cases. The same concept applies as an s8 constructor to replace the macro:

struct s8 {
    u8  *data = 0;
    size len  = 0;

    s8() = default;

    template<size N>
    s8(const char (&s)[N]) : data{(u8 *)s}, len{N-1} {}
};

I’ve explicitly asked to keep a default zero-initialized (empty) string since it’s useful — and necessary to directly allocate strings using make, e.g. an array of strings. const is required because string literals are const in C++, but it’s immediately stripped off for the sake of simplicity. The new constructor allows:

    s8 version = "1.2.3";

Or even more usefully:

    void print(bufout *, s8);
    // ...
    print(stdout, "hello world\n");

Define operator== and it’s more useful yet:

    b32 operator==(s8 s)
    {
        return len==s.len && (!len || !memcmp(data, s.data, len));
    }

Now this works, and it’s cheap and fast even in debug builds:

    s8 key = ...;
    if (key == "HOME") {
        // ...
    }

That’s more ergonomic than the macro and comparison function. operator[] also improves ergonomics, to subscript a string without going through the data member:

    u8 &operator[](size i)
    {
        assert(i >= 0);
        assert(i < len);
        return data[i];
    }

The reference is again necessary to make subscripts assignable. Since s8span — make a string spanning two pointers — so often appears in my programs, a constructor seems appropriate, too:

    s8(u8 *beg, u8 *end)
    {
        assert(beg <= end);
        data = beg;
        len = end - beg;
    }

By the way, these assertions I’ve been using are great for catching mistakes quickly and early, and they complement fuzz testing.

I’m not sold on it, but an idea for the future: C++23’s multi-index operator[] as a slice operator:

    s8 operator[](size beg, size end)
    {
        assert(beg >= 0);
        assert(beg <= end);
        assert(end <= len);
        return {data+beg, data+end};
    }

Then:

    s8 msg = "foo bar baz";
    msg = msg[4,7];  // msg = "bar"

I could keep going with, say, iterators and such, but each will be more specialized and less useful. (I don’t care about range-based for loops.)

Downside: static initialization

The new string stuff is neat, but I hit a wall trying it out: These fancy constructors do not reliably construct at compile time, not even with a constexpr qualifier in two of the three major C++ implementations. A static lookup table that contains a string is likely constructed at run time in at least some builds. For example, this table:

static s8 keys[] = {"foo", "bar", "baz"};

Requires run-time construction in real world cases I care about, requiring C++ magic and linking runtime gunk. The constructor is therefore a strict downgrade from the macro, which works perfectly in these lookup tables. Once a non-default constructor is defined, I’ve been unable to find an escape hatch back to the original, dumb, reliable behavior.

Update: Jonathan Müller points out the reinterpret cast is forbidden in a constexpr function, so it’s not required to happen at compile time. After some thought, I’ve figured out a workaround using a union:

struct s8 {
    union {
        u8         *data = 0;
        const char *cdata;
    };
    size len = 0;

    template<size N>
    constexpr s8(const char (&s)[N]) : cdata{s}, len{N-1} {}

    // ...
}

In all three C++ implementations, in all configurations, this reliably constructs strings at compile time. The other semantics are unchanged.

Other features

Having a generic dynamic array would be handy, and more ergonomic than my dynamic array macro:

template<typename T>
struct slice {
    T   *data = 0;
    size len  = 0;
    size cap  = 0;

    slice<T> = default;

    template<size N>
    slice<T>(T (&a)[N]) : data{a}, len{N}, cap{N} {}

    T &operator[](size i) { ... }
}

template<typename T>
slice<T> append(arena *, slice<T>, T);

On the other hand, hash maps are mostly solved, so I wouldn’t bother with a generic map.

Function overloads would simplify naming. For example, this in C:

prints8(bufout *, s8);
printi32(bufout *, i32);
printf64(bufout *, f64);
printvec3(bufout *, vec3);

Would hide that stuff behind the scenes in the symbol decoration:

print(bufout *, s8);
print(bufout *, i32);
print(bufout *, f64);
print(bufout *, vec3);

Same goes for a hash() function on different types.

C++ has better null pointer semantics than C. Addition or subtraction of zero with a null pointer produces a null pointer, and subtracting null pointers results in zero. This eliminates some boneheaded special case checks required in C, though not all: memcpy, for instance, arbitrarily still does not accept null pointers even in C++.

Ultimately worth it?

The static data problem is a real bummer, but perhaps it’s worth it for the other features. I still need to put it all to the test in a real, sizable project.

Protecting paths in macro expansions by extending UTF-8

2024-03-05T03:15:12Z

After a year I’ve finally came up with an elegant solution to a vexing u-config problem. The pkg-config format uses macros to generate build flags through recursive expansion. Some flags embed file system paths, but to the macro system it’s all strings. The output is also ultimately just one big string, which the receiving shell splits into fields. If a path contains spaces, or shell metacharacters, u-config must escape them so that shells treat them as part of a token. But how can u-config itself distinguish incidental spaces in paths from deliberate spaces between flags? What about other shell metacharacters in paths? My solution is to extend UTF-8 to encode metadata that survives macro expansion.

As usual, it helps to begin with a concrete example of the problem. The following is a conventional .pc file much like you’d find on your own system:

prefix=/usr
exec_prefix=${prefix}
libdir=${exec_prefix}/lib
includedir=${prefix}/include

Name: Example
Version: 1.0
Description: An example .pc file
Cflags: -I${includedir}
Libs: -L${libdir} -lexample

It begins by defining the library’s installation prefix from which it derives additional paths, which are finally used in the package fields that generate build flags (Cflags, Libs). If I run u-config against this configuration:

$ pkg-config --cflags --libs example
-I/usr/include -L/usr/lib -lexample

Typically prefix is populated by the library’s build system, which knows where the library is to be installed. In some situations that’s not possible, and there is no opportunity to set prefix to a meaningful path. In that case, pkg-config can automatically override it (--define-prefix) with a path relative to the .pc file, making the installation relocatable. This works quite well on Windows, where it’s the default:

$ pkg-config --cflags --libs example
-IC:/Users/me/example/include -LC:/Users/me/example/lib -lexample

This just works… so long as the path does not contain spaces. If so, it risks splitting into separate fields. The .pc format supports quoting to control how such output is escaped. Regions between quotes are escaped in the output so that they retain their spaces when field split in the shell. If a .pc file author is careful, they’d write it with quotes:

Cflags: -I"${includedir}"
Libs: -L"${libdir}" -lexample

The paths are carefully placed within quoted regions so that they come out properly:

$ pkg-config --cflags example
-IC:/Program\ Files/example/include

Almost nobody writes their .pc files this way! The convention is not to quote. My original solution was to implicitly wrap prefix in quotes on assignment, which fixes the vast majority of .pc files. That effectively looks like this in the “virtual” .pc file:

prefix="C:/Program Files/example"
exec_prefix=${prefix}
libdir=${exec_prefix}/lib
includedir=${prefix}/include

So the important region is quoted, its spaces preserved. However, the occasional library author actively supporting Windows inevitably runs into this problem, and their system’s pkg-config implementation does not quote prefix. They soon figure out explicit quoting and apply it, which then undermines u-config’s implicit quoting. The quotes essentially cancel out:

"$includedir" -> ""C:/Program Files/example"/include"

The quoted regions are inverted and nothing happens. Though this is a small minority, the libraries that do this and the ones you’re likely to use on Windows are correlated. I was stumped: How to support quoted and unquoted .pc files simultaneously?

Extending UTF-8

I recently had the thought: What if somehow u-config tracked which spans of string were paths. prefix is initially a path span, and then track it through macro-expansion and concatenation. Soon after that I realized it’s even simpler: Encode the spaces in a path as a value other than space, but also a value that cannot appear in the input. Recall that certain octets can never appear in UTF-8 text: the 8 values whose highest 5 bits are set. That would be the first octet of 5-octet, or longer, code point, but those are forbidden.

11111xxx

When paths enter the macro system, special characters are encoded as one of these 8 values. They’re converted back to their original ASCII values during output encoding, escaped. It doesn’t interact with the pkg-config quoting mechanism, so there’s no quote cancellation. Both quoting cases are supported equally.

For example, if space is mapped onto \xff (255), then:

in:  C:/Program Files/foo    -> C:/Program\xffFiles/foo
out: C:/Program\xffFiles/foo -> C:/Program\ Files/foo

Which prints the same regardless of ${includedir} or "${includedir}". Problem solved!

More metacharacters

That’s not the only complication. Outputs may deliberately include shell metacharacters, though typically these are Makefile fragments. For example, the default value of ${pc_top_builddir} is $(top_builddir), which make will later expand. While these characters are special to a shell, and certainly special to make, they must not be escaped.

What if a path contains these characters? The pkg-config quoting mechanism won’t help. It’s only concerned with spaces, and $(...) prints the same quoted nor not. As before, u-config must track provenance — whether or not such characters originated from a path.

If $PKG_CONFIG_TOP_BUILD_DIR is set, then pc_top_builddir is set to this environment variable, useful when the result isn’t processed by make. In this case it’s a path, and $(...) ought to be escaped. Even without $ it must be quoted, because the parentheses would still invoke a subshell. But who would put parenthesis in a path? Lo and behold!

C:/Program Files (x86)/example

Again, extending UTF-8 solves this as well: Encode $, (, and ) in paths using three of those forbidden octets, and escape them on the way out, allowing unencoded instances to go straight through.

in:  C:/Program\xffFiles\xff\xfdx86\xfe/example
out: C:/Program\ Files\ \(x86\)/example

This makes pc_top_builddir straightforward: default to a raw string, otherwise a path-encoded environment variable (note: s8 is a string type and upsert is a hash map):

    s8 top_builddir = s8("$(top_builddir)");
    if (envvar_set) {
        top_builddir = s8pathencode(envvar, perm);
    }
    *upsert(&global, s8("pc_top_builddir"), perm) = top_builddir;

For a particularly wild case, consider deliberately using a uname -m command substitution to construct a path, i.e. the path contains the target machine architecture (i686, x86_64, etc.):

Cflags: -I${prefix}/$(uname -m)/include

(Not that condone such nonsense. This is merely a reality of real world .pc files.) With prefix automatically set as above, this will print:

-IC:/Program\ Files\ \(x86\)/example/$(uname -m)/include

Path parentheses are escaped because they came from a path, but command substitution passes through because it came from the .pc source. Quite cool!

An improved chkstk function on Windows

2024-02-05T17:56:05Z

If you’ve spent much time developing with Mingw-w64 you’ve likely seen the symbol ___chkstk_ms, perhaps in an error message. It’s a little piece of runtime provided by GCC via libgcc which ensures enough of the stack is committed for the caller’s stack frame. The “function” uses a custom ABI and is implemented in assembly. So is the subject of this article, a slightly improved implementation soon to be included in w64devkit as libchkstk (-lchkstk).

The MSVC toolchain has an identical (x64) or similar (x86) function named __chkstk. We’ll discuss that as well, and w64devkit will include x86 and x64 implementations, useful when linking with MSVC object files. The new x86 __chkstk in particular is also better than the MSVC definition.

A note on spelling: ___chkstk_ms is spelled with three underscores, and __chkstk is spelled with two. On x86, cdecl functions are decorated with a leading underscore, and so may be rendered, e.g. in error messages, with one fewer underscore. The true name is undecorated, and the raw symbol name is identical on x86 and x64. Further complicating matters, libgcc defines a ___chkstk with three underscores. As far as I can tell, this spelling arose from confusion regarding name decoration, but nobody’s noticed for the past 28 years. libgcc’s x64 ___chkstk is obviously and badly broken, so I’m sure nobody has ever used it anyway, not even by accident thanks to the misspelling. I’ll touch on that below.

When referring to a particular instance, I will use a specific spelling. Otherwise the term “chkstk” refers to the family. If you’d like to skip ahead to the source for libchkstk: libchkstk.S.

A gradually committed stack

The header of a Windows executable lists two stack sizes: a reserve size and an initial commit size. The first is the largest the main thread stack can grow, and the second is the amount committed when the program starts. A program gradually commits stack pages as needed up to the reserve size. Binutils objdump option -p lists the sizes. Typical output for a Mingw-w64 program:

$ objdump -p example.exe | grep SizeOfStack
SizeOfStackReserve      0000000000200000
SizeOfStackCommit       0000000000001000

The values are in hexadecimal, and this indicates 2MiB reserved and 4KiB initially committed. With the Binutils linker, ld, you can set them at link time using --stack. Via gcc, use -Xlinker. For example, to reserve an 8MiB stack and commit half of it:

$ gcc -Xlinker --stack=$((8<<20)),$((4<<20)) ...

MSVC link.exe similarly has /stack.

The purpose of this mechanism is to avoid paying the commit charge for unused stack. It made sense 30 years ago when stacks were a potentially large portion of physical memory. These days it’s a rounding error and silly we’re still dealing with it. Using the above options you can choose to commit the entire stack up front, at which point a chkstk helper is no longer needed (-mno-stack-arg-probe, /Gs2147483647). This requires link-time control of the main module, which isn’t always an option, like when supplying a DLL for someone else to run.

The program grows the stack by touching the singular guard page mapped between the committed and uncommitted portions of the stack. This action triggers a page fault, and the default fault handler commits the guard page and maps a new guard page just below. In other words, the stack grows one page at a time, in order.

In most cases nothing special needs to happen. The guard page mechanism is transparent and in the background. However, if a function stack frame exceeds the page size then there’s a chance that it might leap over the guard page, crashing the program. To prevent this, compilers insert a chkstk call in the function prologue. Before local variable allocation, chkstk walks down the stack — that is, towards lower addresses — nudging the guard page with each step. (As a side effect it provides stack clash protection — the only security aspect of chkstk.) For example:

void callee(char *);

void example(void)
{
    char large[1<<20];
    callee(large);
}

Compiled with 64-bit gcc -O:

example:
    movl    $1048616, %eax
    call    ___chkstk_ms
    subq    %rax, %rsp
    leaq    32(%rsp), %rcx
    call    callee
    addq    $1048616, %rsp
    ret

I used GCC, but this is practically identical to the code generated by MSVC and Clang. Note the call to ___chkstk_ms in the function prologue before allocating the stack frame (subq). Also note that it sets eax. As a volatile register, this would normally accomplish nothing because it’s done just before a function call, but recall that ___chkstk_ms has a custom ABI. That’s the argument to chkstk. Further note that it uses rax on the return. That’s not the value returned by chkstk, but rather that x64 chkstk preserves all registers.

Well, maybe. The official documentation says that registers r10 and r11 are volatile, but that information conflicts with Microsoft’s own implementation. Just in case, I choose a conservative interpretation that all registers are preserved.

Implementing chkstk

In a high level language, chkstk might look something like so:

// NOTE: hypothetical implementation
void ___chkstk_ms(ptrdiff_t frame_size)
{
    volatile char frame[frame_size];  // NOTE: variable-length array
    for (ptrdiff_t i = frame_size - PAGE_SIZE; i >= 0; i -= PAGE_SIZE) {
        frame[i] = 0;  // touch the guard page
    }
}

This wouldn’t work for a number of reasons, but if it did, volatile would serve two purposes. First, forcing the side effect to occur. The second is more subtle: The loop must happen in exactly this order, from high to low. Without volatile, loop iterations would be independent — as there are no dependencies between iterations — and so a compiler could reverse the loop direction.

The store can happen anywhere within the guard page, so it’s not necessary to align frame to the page. Simply touching at least one byte per page is enough. This is essentially the definition of libgcc ___chkstk_ms.

How many iterations occur? In example above, the stack frame will be around 1MiB (2²⁰). With pages of 4KiB (2¹²) that’s 256 iterations. The loop happens unconditionally, meaning every function call requires 256 iterations of this loop. Wouldn’t it be better if the loop ran only as needed, i.e. the first time? MSVC x64 __chkstk skips iterations if possible, and the same goes for my new ___chkstk_ms. Much like the command line string, the low address of the current thread’s guard page is accessible through the Thread Information Block (TIB). A chkstk can cheaply query this address, only looping during initialization or so. (In contrast to Linux, a thread’s stack is fundamentally managed by the operating system.)

Taking that into account, an improved algorithm:

Push registers that will be used
Compute the low address of the new stack frame (F)
Retrieve the low address of the committed stack (C)
Go to 7
Subtract the page size from C
Touch memory at C
If C > F, go to 5
Pop registers to restore them and return

A little unusual for an unconditional forward jump in pseudo-code, but this closely matches my assembly. The loop causes page faults, and it’s the slow, uncommon path. The common, fast path never executes 5–6. I’d also chose smaller instructions in order to keep the function small and reduce instruction cache pressure. My x64 implementation as of this writing:

___chkstk_ms:
    push %rax              // 1.
    push %rcx              // 1.
    neg  %rax              // 2. rax = frame low address
    add  %rsp, %rax        // 2. "
    mov  %gs:(0x10), %rcx  // 3. rcx = stack low address
    jmp  1f                // 4.
0:  sub  $0x1000, %rcx     // 5.
    test %eax, (%rcx)      // 6. page fault (very slow!)
1:  cmp  %rax, %rcx        // 7.
    ja   0b                // 7.
    pop  %rcx              // 8.
    pop  %rax              // 8.
    ret                    // 8.

I’ve labeled each instruction with its corresponding pseudo-code. Step 6 is unusual among chkstk implementations: It’s not a store, but a load, still sufficient to fault the page. That test instruction is just two bytes, and unlike other two-byte options, doesn’t write garbage onto the stack — which would be allowed — nor use an extra register. I searched through single byte instructions that can page fault, all of which involve implicit addressing through rdi or rsi, but they increment rdi or rsi, and would would require another instruction to correct it.

Because of the return address and two push operations, the low stack frame address is technically too low by 24 bytes. That’s fine. If this exhausts the stack, the program is really cutting it close and the stack is too small anyway. I could be more precise — which, as we’ll soon see, is required for x86 __chkstk — but it would cost an extra instruction byte.

On x64, ___chkstk_ms and __chkstk have identical semantics, so name it __chkstk — which I’ve done in libchkstk — and it works with MSVC. The only practical difference between my chkstk and MSVC __chkstk is that mine is smaller: 36 bytes versus 48 bytes. Largest of all, despite lacking the optimization, is libgcc ___chkstk_ms, weighing 50 bytes, or in practice, due to an unfortunate Binutils default of padding sections, 64 bytes.

I’m no assembly guru, and I bet this can be even smaller without hurting the fast path, but this is the best I could come up with at this time.

Update: Stefan Kanthak, who has extensively explored this topic, points out that large stack frame requests might overflow my low frame address calculation at (3), effectively disabling the probe. Such requests might occur from alloca calls or variable-length arrays (VLAs) with untrusted sizes. As far as I’m concerned, such programs are already broken, but it only cost a two-byte instruction to deal with it. I have not changed this article, but the source in w64devkit has been updated.

32-bit chkstk

On x86 ___chkstk_ms has identical semantics to x64. Mine is a copy-paste of my x64 chkstk but with 32-bit registers and an updated TIB lookup. GCC was ahead of the curve on this design.

However, x86 __chkstk is bonkers. It not only commits the stack, but also allocates the stack frame. That is, it returns with a different stack pointer. The return pointer is initially inside the new stack frame, so chkstk must retrieve it and return by other means. It must also precisely compute the low frame address.

__chkstk:
    push %ecx               // 1.
    neg  %eax               // 2.
    lea  8(%esp,%eax), %eax // 2.
    mov  %fs:(0x08), %ecx   // 3.
    jmp  1f                 // 4.
0:  sub  $0x1000, %ecx      // 5.
    test %eax, (%ecx)       // 6. page fault (very slow!)
1:  cmp  %eax, %ecx         // 7.
    ja   0b                 // 7.
    pop  %ecx               // 8.
    xchg %eax, %esp         // ?. allocate frame
    jmp  *(%eax)            // 8. return

The main differences are:

eax is treated as volatile, so it is not saved
The low frame address is precisely computed with lea (2)
The frame is allocated at step (?) by swapping F and the stack pointer
Post-swap F now points at the return address, so jump through it

MSVC x86 __chkstk does not query the TIB (3), and so unconditionally runs the loop. So there’s an advantage to my implementation besides size.

libgcc x86 ___chkstk has this behavior, and so it’s also a suitable __chkstk aside from the misspelling. Strangely, libgcc x64 ___chkstk also allocates the stack frame, which is never how chkstk was supposed to work on x64. I can only conclude it’s never been used.

Optimization in practice

Does the skip-the-loop optimization matter in practice? Consider a function using a large-ish, stack-allocated array, perhaps to process environment variables or long paths, each of which max out around 64KiB.

_Bool path_contains(wchar_t *name, wchar *path)
{
    wchar_t var[1<<15];
    GetEnvironmentVariableW(name, var, countof(var));
    // ... search for path in var ...
}

int64_t getfilesize(char *path)
{
    wchar_t wide[1<<15];
    MultiByteToWideChar(CP_UTF8, 0, path, -1, wide, countof(wide));
    // ... look up file size via wide path ...
}

void example(void)
{
    if (path_contains(L"PATH", L"c:\\windows\\system32")) {
        // ...
    }

    int64_t size = getfilesize("π.txt");
    // ...
}

Each call to these functions with such large local arrays is also a call to chkstk. Though with a 64KiB frame, that’s only 16 iterations; barely detectable in a benchmark. If the function touches the file system, which is likely when processing paths, then chkstk doesn’t matter at all. My starting example had a 1MiB array, or 256 chkstk iterations. That starts to become measurable, though it’s also pushing the limits. At that point you ought to be using a scratch arena.

So ultimately after writing an improved ___chkstk_ms I could only measure a tiny difference in contrived programs, and none in any real application. Though there’s still one more benefit I haven’t yet mentioned…

“The first thing we do, let’s kill all the lawyers”.

My original motivation for this project wasn’t the optimization — which I didn’t even discover until after I had started — but licensing. I hate software licenses, and the tools I’ve written for w64devkit are dedicated to the public domain. Both source and binaries (as distributed). I can do so because I don’t link runtime components, not even libgcc. Not even header files. Every byte of code in those binaries is my work or the work of my collaborators.

Every once in awhile ___chkstk_ms rears its ugly head, and I have to make a decision. Do I re-work my code to avoid it? Do I take the reigns of the linker and disable stack probes? I haven’t necessarily allocated a large local array: A bit of luck with function inlining can combine several smaller stack frames into one that’s just large enough to require chkstk.

Since libgcc falls under the GCC Runtime Library Exception, if it’s linked into my program through an “Eligible Compilation Process” — which I believe includes w64devkit — then the GPL-licensed functions embedded in my binary are legally siloed and the GPL doesn’t infect the rest of the program. These bits are still GPL in isolation, and if someone were to copy them out of the program then they’d be normal GPL code again. In other words, it’s not a 100% public domain binary if libgcc was linked!

(If some FSF lawyer says I’m wrong, then this is an escape hatch through which anyone can scrub the GPL from GCC runtime code, and then ignore the runtime exception entirely.)

MSVC is worse. Hardly anyone follows its license, but fortunately for most the license is practically unenforced. Its chkstk, which currently resides in a loose chkstk.obj, falls into what Microsoft calls “Distributable Code.” Its license requires “external end users to agree to terms that protect the Distributable Code.” In other words, if you compile a program with MSVC, you’re required to have a EULA including the relevant terms from the Visual Studio license. You’re not legally permitted to distribute software in the manner of w64devkit — no installer, just a portable zip distribution — if that software has been built with MSVC. At least not without special care which nobody does. (Don’t worry, I won’t tell.)

How to use libchkstk

To avoid libgcc entirely you need -nostdlib. Otherwise it’s implicitly offered to the linker, and you’d need to manually check if it picked up code from libgcc. If ld complains about a missing chkstk, use -lchkstk to get a definition. If you use -lchkstk when it’s not needed, nothing happens, so it’s safe to always include.

I also recently added a libmemory to w64devkit, providing tiny, public domain definitions of memset, memcpy, memmove, memcmp, and strlen. All compilers fabricate calls to these five functions even if you don’t call them yourself, which is how they were selected. (Not because I like them. I really don’t.). If a -nostdlib build complains about these, too, then add -lmemory.

$ gcc -nostdlib ... -lchkstk -lmemory

In MSVC the equivalent option is /nodefaultlib, after which you may see missing chkstk errors, and perhaps more. libchkstk.a is compatible with MSVC, and link.exe doesn’t care that the extension is .a rather than .lib, so supply it at link time. Same goes for libmemory.a if you need any of those, too.

$ cl ... /link /nodefaultlib libchkstk.a libmemory.a

While I despise licenses, I still take them seriously in the software I distribute. With libchkstk I have another tool to get it under control.

Big thanks to Felipe Garcia for reviewing and correcting mistakes in this article before it was published!

Two handy GDB breakpoint tricks

2024-01-28T21:56:07Z

Over the past couple months I’ve discovered a couple of handy tricks for working with GDB breakpoints. I figured these out on my own, and I’ve not seen either discussed elsewhere, so I really ought to share them.

Continuable assertions

The assert macro in typical C implementations leaves a lot to be desired, as does raise and abort, so I’ve suggested alternative definitions that behave better under debuggers:

#define assert(c)  while (!(c)) __builtin_trap()
#define assert(c)  while (!(c)) __builtin_unreachable()
#define assert(c)  while (!(c)) *(volatile int *)0 = 0

Each serves a slightly different purpose but still has the most important property: Immediately halt the program directly on the defect. None have an occasionally useful secondary property: Optionally allow the program to continue through the defect. If the program reaches the body of any of these macros then there is no reliable continuation. Even manually nudging the instruction pointer over the assertion isn’t enough. Compilers assume that the program cannot continue through the condition and generate code accordingly.

The MSVC ecosystem has a solution for this on x86: int3. The portable name is __debugbreak, a name I’ve borrowed elsewhere.

#define assert(c)  do if (!(c)) __debugbreak(); while (0)

On x86 it inserts an int3 instruction, which fires an interrupt, trapping in the attached debugger, or otherwise abnormally terminating the program. Because it’s an interrupt, it’s expected that the program might continue. It even leaves the instruction pointer on the next instruction. As of this writing, GCC has no matching intrinsic, but Clang recently added __builtin_debugtrap. In GCC you need some less portable inline assembly: asm ("int3").

However, regardless of how you get an int3 in your program, GDB does not currently understand it. The problem is that feature I mentioned: The instruction pointer does not point at the int3 but the next instruction. This confuses GDB, causing it to break in the wrong places, possibly even in the wrong scope. For example:

for (int i = 0; i < n; i++) {
    // ...
    int3_assert(...);
}

With int3 at the very end of the loop, GDB will break at the top of the next loop iteration, because that’s where the instruction pointer lands by the time GDB is involved. It’s a similar story when placed at the end of a function, leaving GDB to break in the caller. To resolve this, we need the instruction pointer to still be “inside” the breakpoint after the interrupt fires. Easy! Add a nop:

#define breakpoint()  asm ("int3; nop")

This behaves beautifully, eliminating all the problems GDB has with a plain int3. Not only is this a solid basis for a continuable assertion, it’s also useful as a fast conditional breakpoint, where conventional conditional breakpoints are far too slow.

for (int i = 0; i < 1000000000; i++) {
    if (/* rare condition */) breakpoint();
    // ...
}

Could GDB handle int3 better? Yes! Visual Studio, for instance, does not require the nop instruction. As far as I know there is no ARM equivalent compatible with GDB (or even LLDB). The closest instruction, brk #0x1, does not behave as needed.

Named positions

GDB’s built-in user interface understands three classes of breakpoint positions: symbols, context-free line numbers, and absolute addresses. When you set some breakpoints and (re)start a program under GDB, each kind of breakpoint is handled differently:

Resolve each symbol, placing a breakpoint on its run-time address.
Map each file+lineno tuple to a run-time address, and place a breakpoint on that address. If the line does not exist (i.e. the file is shorter), skip it.
Place breakpoints exactly on each absolute address. If it’s not a mapped address, don’t start the program.

The first is the best case because it adapts to program changes. Modify the code, recompile, and the breakpoint generally remains where you want it.

The third is the least useful. These breakpoints rarely survive across rebuilds, and sometimes not even across reruns.

The second is in the middle between useful and useless. If you edit the source file which has the breakpoint — likely, because you placed the breakpoint there for a reason — chances are high that the line number is no longer correct. Instead it drifts, requiring manual replacement. This is tedious and GDB ought to do better. Think that’s unreasonable? The Visual Studio debugger does exactly that quite effectively through external code edits! GDB front ends tend to handle it better, especially when they’re also the code editor and so directly observe all edits.

As a workaround we can get the first kind by temporarily naming a line number. This requires editing the source, but remember, the very reason we need it is because the source in question is actively changing. How to name a line? C and C++ labels give a name to program position:

void example(double *nums, int n, ...)
{
    for (int i = 0; i < n; i++) {
        loop:  // named position at the start of the loop
        // ...
    }
}

The name loop is local to example, but the qualified example:loop is a global name, as suitable as any other symbol. I could, say, reliably trace the progress of this loop despite changes to its position in the source.

(gdb) dprintf example:loop,"nums[%d] = %g\n",i,nums[i]

One downside is dealing with -Wunused-label (enabled by -Wall), and so I’ve considered disabling the warning in my defaults. Update: Matthew Fernandez pointed out that the unused label attribute eliminates the warning, solving my problem:

    for (int i = 0; i < n; i++) {
        loop: __attribute((unused))
        // ...
    }

More often I use an assembly label, usually named b for convenience:

    for (int i = 0; i < n; i++) {
        asm ("b:");
        // ...
    }

Like int3, sometimes it’s necessary to give it a nop so that GDB has something on which to break. “Enabling” it at any time is quick:

(gdb) b b

Because it’s not .globl, it’s a weak symbol, and I can place up to one per translation unit, all covered by the same GDB breakpoint item (less useful than it sounds). I haven’t actually checked, but I probably more often use dprintf with such named lines than actual breakpoints.

If you have similar tips and tricks of your own, I’d like to learn about them!

So you want custom allocator support in your C library

2023-12-17T17:52:26Z

This article was discussed on Hacker News and on reddit.

Users of mature C libraries conventionally get to choose how memory is allocated — that is, when it cannot be avoided entirely. The C standard never laid down a convention — perhaps for the better — so each library re-invents an allocator interface. Not all are created equal, and most repeat a few fundamental mistakes. Often the interface is merely a token effort, to check off that it’s “supported” without actual consideration to its use. This article describes the critical features of a practical allocator interface, and demonstrates why they’re important.

Before diving into the details, here’s the checklist for library authors:

All allocation functions accept a user-defined context pointer.
The “free” function accepts the original allocation size.
The “realloc” function accepts both old and new size.

Context pointer

The standard library allocator keeps its state in global variables. This makes for a simple interface, but comes with significant performance and complexity costs. These costs likely motivate custom allocator use in the first place, in which case slavishly duplicating the standard interface is essentially the worst possible option. Unfortunately this is typical:

#define LIB_MALLOC  malloc
#define LIB_FREE    free

I could observe the library’s allocations, and I could swap in a library functionality equivalent to the standard library allocator — jemalloc, mimalloc, etc. — but that’s about it. Better than nothing, I suppose, but only just so. Function pointer callbacks are slightly better:

typedef struct {
    void *(*malloc)(size_t);
    void  (*free)(void *);
} allocator;

session *session_new(..., allocator);

At least I could use different allocators at different times, and there are even tricks to bind a context pointer to the callback. It also works when the library is dynamically linked.

Either case barely qualifies as custom allocator support, and they’re useless when it matters most. Only a small ingredient is needed to make these interfaces useful: a context pointer.

// NOTE: Better, but still not great
typedef struct {
    void *(*malloc)(size_t, void *ctx);
    void  (*free)(void *, void *ctx);
    void   *ctx;
} allocator;

Users can choose from where the library will allocate at at given time. It liberates the allocator from global variables (or janky workarounds), and multithreading woes. The default can still hook up to the standard library through stubs that fit these interfaces.

static void *lib_malloc(size_t size, void *ctx)
{
    (void)ctx;
    return malloc(size);
}

static void *lib_free(void *ptr, void *ctx)
{
    (void)ctx;
    free(ptr);
}

static allocator lib_allocator = {lib_malloc, lib_free, 0};

Note that the context pointer came after the “standard” arguments. All things being equal, “extra” arguments should go after standard ones. But don’t sweat it! In the most common calling conventions this allows stub implementations to be merely an unconditional jump. It’s as though the stubs are a kind of subtype of the original functions.

lib_malloc:
        jmp malloc
lib_free:
        jmp free

Typically the decision is completely arbitrary, and so this minutia tips the balance.

Context pointer example

So what’s the big deal? It means we can trivially plug in, say, a tiny arena allocator. To demonstrate, consider this fictional string set and partial JSON API, each of which supports a custom allocator. For simplicity — I’m attempting to balance substance and brevity — they share an allocator interface. (Note: Because subscripts and sizes should be signed, and we’re now breaking away from the standard library allocator, I will use ptrdiff_t for the rest of the examples.)

typedef struct {
    void *(*malloc)(ptrdiff_t, void *ctx);
    void  (*free)(void *, void *ctx);
    void   *ctx;
} allocator;

typedef struct set set;
set  *set_new(allocator *);
set  *set_free(set *);
bool  set_add(set *, char *);

typedef struct json json;
json     *json_load(char *buf, ptrdiff_t len, allocator *);
json     *json_free(json *);
ptrdiff_t json_length(json *);
json     *json_subscript(json *, ptrdiff_t i);
json     *json_getfield(json *, char *field);
double    json_getnumber(json *);
char     *json_getstring(json *);

set and json objects retain a copy of the allocator object for all allocations made through that object. Given nothing, they default to the standard library using the pass-through definitions above. Used together with the standard library allocator:

typedef struct {
    double sum;
    bool   ok;
} sum_result;

sum_result sum_unique(char *json, ptrdiff_t len)
{
    sum_result r = {0};
    json *namevals = json_load(json, len, 0);
    if (!namevals) {
        return r;  // parse error
    }

    ptrdiff_t arraylen = json_length(namevals);
    if (arraylen < 0) {
        json_free(namevals);
        return r;  // not an array
    }

    set *seen = set_new(0);
    for (ptrdiff_t i = 0; i < arraylen; i++) {
        json *element = json_subscript(namevals, i);
        char *name    = json_getfield(element, "name");
        char *value   = json_getfield(element, "value");
        if (!name || !value) {
            set_free(set);
            json_free(namevals);
            return r;  // invalid element
        } else if (set_add(set, name)) {
            r.sum += json_getnumber(value);
        }
    }

    set_free(set);
    json_free(namevals);
    r.ok = 1;
    return r;
}

Which given as JSON input:

[
    {"name": "foo", "value":  123},
    {"name": "bar", "value":  456},
    {"name": "foo", "value": 1000}
]

Would return 579.0. Because it’s using standard library allocation, it must carefully clean up before returning. There’s also no out-of-memory handling because, in practice, programs typically do not get to observe and respond to the standard allocator running out of memory.

We can improve and simplify it with an arena allocator:

typedef struct {
    char    *beg;
    char    *end;
    jmp_buf *oom;
} arena;

void *arena_malloc(ptrdiff_t size, void *ctx)
{
    arena *a = ctx;
    ptrdiff_t available = a->end - a->beg;
    ptrdiff_t alignment = -size & 15;
    if (size > available-alignment) {
        longjmp(*a->oom);
    }
    return a->end -= size + alignment;
}

void arena_free(void *ptr, void *ctx)
{
    // nothing to do (yet!)
}

I’m allocating from the end rather than the beginning because it will make a later change simpler. Applying that to the function:

sum_result sum_unique(char *json, ptrdiff_t len, arena scratch)
{
    sum_result r = {0};

    allocator a = {0};
    a.malloc = arena_malloc;
    a.free = arena_free;
    a.ctx = &scratch;

    json *namevals = json_load(json, len, &a);
    if (!namevals) {
        return r;  // parse error
    }

    ptrdiff_t arraylen = json_length(namevals);
    if (arraylen < 0) {
        return r;  // not an array
    }

    set *seen = set_new(&a);
    for (ptrdiff_t i = 0; i < arraylen; i++) {
        json *element = json_subscript(namevals, i);
        char *name    = json_getfield(element, "name");
        char *value   = json_getfield(element, "value");
        if (!name || !value) {
            return r;  // invalid element
        } else if (set_add(set, name)) {
            r.sum += json_getnumber(value);
        }
    }
    r.ok = 1;
    return r;
}

Calls to set_free and json_free are no longer necessary because the arena automatically frees these on any return, in O(1). I almost feel bad the library authors bothered to write them! It also handles allocation failure without introducing it to sum_unique. We may even deliberately restrict the memory available to this function — perhaps because the input is untrusted, and we want to quickly abort denial-of-service attacks — by giving it a small arena, relying on out-of-memory to reject pathological inputs.

There are so many possibilities unlocked by the context pointer.

Provide the original allocation size when freeing

When an application frees an object it always has the original, requested allocation size on hand. After all, it’s a necessary condition to use the object correctly. In the simplest case it’s the size of the freed object’s type: a static quantity. If it’s an array, then it’s a multiple of the tracked capacity: a dynamic quantity. In any case the size is either known statically or tracked dynamically by the application.

Yet free() does not accept a size, meaning that the allocator must track the information redundantly! That’s a needless burden on custom allocators, and with a bit of care a library can lift it.

This was noticed in C++, and WG21 added sized deallocation in C++14. It’s now the default on two of the three major implementations (and probably not the two you’d guess). In other words, object size is so readily available that it can mostly be automated away. Notable exception: operator new[] and operator delete[] with trivial destructors. With non-trivial destructors, operator new[] must track the array length for its its own purposes on top of libc bookkeeping. In other words, array allocations have their size stored in at least three different places!

That means the “free” interface should look like this:

void *lib_free(void *ptr, ptrdiff_t len, void *ctx);

And calls inside the library might look like:

lib_free(p, sizeof(*p), ctx);
lib_free(a, sizeof(*a)*len, ctx);

Now that arena_free has size information, it can free an allocation if it was the most recent:

void arena_free(void *ptr, ptrdiff_t size, void *ctx)
{
    arena *a = ctx;
    if (ptr == a->end) {
        ptrdiff_t alignment = -size & 15;
        a->end += size + alignment;
    }
}

If the library allocates short-lived objects to compute some value, then discards in reverse order, the memory can be reused. The arena doesn’t have to do anything special. The library merely needs to share its knowledge with the allocator.

Beyond arena allocation, an allocator could use the size to locate the allocation’s size class and, say, push it onto a freelist of its size class. Size-class freelists compose well with arenas, and an implementation is short and simple when the caller of “free” communicates object size.

Another idea: During testing, use a debug allocator that tracks object size and validates the reported size against its own bookkeeping. This can help catch mistakes sooner.

Provide the old size when resizing an allocation

Resizing an allocation requires a lot from an allocator, and it should be avoided if possible. At the very least it cannot be done at all without knowing the original allocation size. An allocator can’t simply no-op it like it can with “free.” With the standard library interface, allocators have no choice but to redundantly track object sizes when “realloc” is required.

So, just as with “free,” the allocator should be given the old object size!

void *lib_realloc(void *ptr, ptrdiff_t old, ptrdiff_t new, void *ctx);

At the very least, an allocator could implement “realloc” with “malloc” and memcpy:

void arena_realloc(void *ptr, ptrdiff_t old, ptrdiff_t new, void *ctx)
{
    assert(new > old);
    void *r = arena_malloc(new, ctx);
    return memcpy(r, ptr, old);
}

Of the three checklist items, this is the most neglected. Exercise for the reader: The last-allocated object can be resized in place, instead using memmove. If this is frequently expected, allocate from the front, adjust arena_free as needed, and extend the allocation in place as discussed a previous addendum, without any copying.

Real world examples

Let’s examine real world examples to see how well they fit the checklist. First up is uthash, a popular, easy-to-use, intrusive hash table:

#define uthash_malloc(sz) my_malloc(sz)
#define uthash_free(ptr, sz) my_free(ptr)

No “realloc” so it trivially checks (3). It optionally provides the old size to “free” which checks (2). However it misses (1) which is the most important, greatly limiting its usefulness.

Next is the venerable zlib. It has function pointers with these prototypes on its z_stream object.

void *zlib_malloc(void *ctx, unsigned items, unsigned size);
void  zlib_free(void *ctx, void *ptr);

The context pointer checks (1), and I can confirm from experience that it’s genuinely useful with a custom allocator. No “realloc” so it passes (3) automatically. It misses (2), but in practice this hardly matters: It allocates everything up front, and frees at the very end, meaning a no-op “free” is quite sufficient.

Finally there’s the Lua programming language with this economical, single-function interface:

void *lua_Alloc(void *ctx, void *ptr, size_t old, size_t new);

It packs all three allocator functions into one function. It includes a context pointer (1), a free size (2), and two realloc sizes (3). It’s a simple allocator’s best friend!

My personal C coding style as of late 2023

2023-10-08T23:30:57Z

This article was discussed on Hacker News and on reddit.

This has been a ground-breaking year for my C skills, and paradigm shifts in my technique has provoked me to reconsider my habits and coding style. It’s been my largest personal style change in years, so I’ve decided to take a snapshot of its current state and my reasoning. These changes have produced significant productive and organizational benefits, so while most is certainly subjective, it likely includes a few objective improvements. I’m not saying everyone should write C this way, and when I contribute code to a project I follow their local style. This is about what works well for me.

Primitive types

Starting with the fundamentals, I’ve been using short names for primitive types. The resulting clarity was more than I had expected, and it’s made my code more enjoyable to review. These names appear frequently throughout a program, so conciseness pays. Also, now that I’ve gone without, _t suffixes are more visually distracting than I had realized.

typedef uint8_t   u8;
typedef char16_t  c16;
typedef int32_t   b32;
typedef int32_t   i32;
typedef uint32_t  u32;
typedef uint64_t  u64;
typedef float     f32;
typedef double    f64;
typedef uintptr_t uptr;
typedef char      byte;
typedef ptrdiff_t size;
typedef size_t    usize;

Some people prefer an s prefix for signed types. I prefer i, plus as you’ll see, I have other designs for s. For sizes, isize would be more consistent, and wouldn’t hog the identifier, but signed sizes are the way and so I want them in a place of privilege. usize is niche, mainly for interacting with external interfaces where it might matter.

b32 is a “32-bit boolean” and communicates intent. I could use _Bool, but I’d rather stick to a natural word size and stay away from its weird semantics. To beginners it might seem like “wasting memory” by using a 32-bit boolean, but in practice that’s never the case. It’s either in a register (return value, local variable) or would be padded anyway (struct field). When it actually matters, I pack booleans into a flags variable, and a 1-byte boolean rarely important.

While UTF-16 might seem niche, it’s a necessary evil when dealing with Win32, so c16 (“16-bit character”) has made a frequent appearance. I could have based it on uint16_t, but putting the name char16_t in its “type hierarchy” communicates to debuggers, particularly GDB, that for display purposes these variables hold character data. Officially Win32 uses a type named wchar_t, but I like being explicit about UTF-16.

u8 is for octets, usually UTF-8 data. It’s distinct from byte, which represents raw memory and is a special aliasing type. In theory these can be distinct types with differing semantics, though I’m not aware of any implementation that does so (yet?). For now it’s about intent.

What about systems that don’t support fixed width types? That’s academic, and far too much time has been wasted worrying about it. That includes time wasted on typing out int_fast32_t and similar nonsense. Virtually no existing software would actually work correctly on such systems — I’m certain nobody’s testing it after all — so it seems nobody else cares either.

I don’t intend to use these names in isolation, such as in code snippets (outside of this article). If I did, examples would require the typedefs to give readers the complete context. That’s not worth extra explanation. Even in the most recent articles I’ve used ptrdiff_t instead of size.

Macros

Next, some “standard” macros:

#define countof(a)    (size)(sizeof(a) / sizeof(*(a)))
#define lengthof(s)   (countof(s) - 1)
#define new(a, t, n)  (t *)alloc(a, sizeof(t), _Alignof(t), n)

While I still prefer ALL_CAPS for constants, I’ve adopted lowercase for function-like macros because it’s nicer to read. They don’t have the same namespace problems as other macro definitions: I can have a macro named new() and also variables and fields named new because they don’t look like function calls.

For GCC and Clang, my favorite assert macro now looks like this:

#define assert(c)  while (!(c)) __builtin_unreachable()

It has useful properties beyond the usual benefits:

It does not require separate definitions for debug and release builds. Instead it’s controlled by the presence of Undefined Behavior Sanitizer (UBSan), which is already present/absent in these circumstances. That includes fuzz testing.
libubsan provides a diagnostic printout with a file and line number.
In release builds it turns into a practical optimization hint.

To enable assertions in release builds, put UBSan in trap mode with -fsanitize-trap and then enable at least -fsanitize=unreachable. In theory this can also be done with -funreachable-traps, but as of this writing it’s been broken for the past few GCC releases.

Parameters and functions

No const. It serves no practical role in optimization, and I cannot recall an instance where it caught, or would have caught, a mistake. I held out for awhile as prototype documentation, but on reflection I found that good parameter names were sufficient. Dropping const has made me noticeably more productive by reducing cognitive load and eliminating visual clutter. I now believe its inclusion in C was a costly mistake.

(One small exception: I still like it as a hint to place static tables in read-only memory closer to the code. I’ll cast away the const if needed. This is only of minor importance.)

Literal 0 for null pointers. Short and sweet. This is not new, but a style I’ve used for about 7 years now, and has appeared all over my writing since. There are some theoretical edge cases where it may cause defects, and lots of ink has been spilled on the subject, but after a couple 100K lines of code I’ve yet to see it happen.

restrict when necessary, but better to organize code so that it’s not, e.g. don’t write to “out” parameters in loops, or don’t use out parameters at all (more on that momentarily). I don’t bother with inline because I compile everything as one translation unit anyway.

typedef all structures. I used to shy away from it, but eliminating the struct keyword makes code easier to read. If it’s a recursive structure, use a forward declaration immediately above so that such fields can use the short name:

typedef struct map map;
struct map {
    map *child[4];
    // ...
};

Declare all functions static except for entry points. Again, with everything compiled as a single translation unit there’s no reason to do otherwise. It was probably a mistake for C not to default to static, though I don’t have a strong opinion on the matter. With the clutter eliminated through short types, no const, no struct, etc. functions fit comfortably on the same line as their return type. I used to break them apart so that the function name began on its own line, but that’s no longer necessary.

In my writing I sometimes omit static to simplify, and because outside the context of a complete program it’s mostly irrelevant. However, I will use it below to emphasize this style.

For awhile I capitalized type names as that effectively put them in a kind of namespace apart from variables and functions, but I eventually stopped. I may try this idea in different way in the future.

Strings

One of my most productive changes this year has been the total rejection of null terminated strings — another of those terrible mistakes — and the embrace of this basic string type:

#define s8(s) (s8){(u8 *)s, lengthof(s)}
typedef struct {
    u8  *data;
    size len;
} s8;

I’ve used a few names for it, but this is my favorite. The s is for string, and the 8 is for UTF-8 or u8. The s8 macro (sometimes just spelled S) wraps a C string literal, making a s8 string out of it. A s8 is handled like a fat pointer, passed and returned by copy. s8 makes for a great function prefix, unlike str, all of which are reserved. Some examples:

static s8   s8span(u8 *, u8 *);
static b32  s8equals(s8, s8);
static size s8compare(s8, s8);
static u64  s8hash(s8);
static s8   s8trim(s8);
static s8   s8clone(s8, arena *);

Then when combined with the macro:

    if (s8equals(tagname, s8("body"))) {
        // ...
    }

You might be tempted to use a flexible array member to pack the size and array together as one allocation. Tried it. Its inflexibility is totally not worth whatever benefits it might have. Consider, for instance, how you’d create such a string out of a literal, and how it would be used.

A few times I’ve thought, “This program is simple enough that I don’t need a string type for this data.” That thought is nearly always wrong. Having it available helps me think more clearly, and makes for simpler programs. (C++ got it only a few years ago with std::string_view and std::span.)

It has a natural UTF-16 counterpart, s16:

#define s16(s) (s16){u##s, lengthof(u##s)}
typedef struct {
    c16 *data;
    size len;
} s16;

I’m not entirely sold on gluing u to the literal in the macro, versus writing it out on the string literal.

More structures

Another change has been preferring structure returns instead of out parameters. It’s effectively a multiple value return, though without destructuring. A great organizational change. For example, this function returns two values, a parse result and a status:

typedef struct {
    i32 value;
    b32 ok;
} i32parsed;

static i32parsed i32parse(s8);

Worried about the “extra copying?” Have no fear, because in practice calling conventions turn this into a hidden, restrict-qualified out parameter — if it’s not inlined such that any return value overhead would be irrelevant anyway. With this return style I’m less tempted to use in-band signals like special null returns to indicate errors, which is less clear.

It’s also led to a style of defining a zero-initialized return value at the top of the function, i.e. ok is false, and then use it for all return statements. On error, it can bail out with an immediate return. The success path sets ok to true before the return.

static i32parsed i32parse(s8 s)
{
    i32parsed r = {0};
    for (size i = 0; i < s.len; i++) {
        u8 digit = s.data[i] - '0';
        // ...
        if (overflow) {
            return r;
        }
        r.value = r.value*10 + digit;
    }
    r.ok = 1;
    return r;
}

Aside from static data, I’ve also moved away from initializers except the conventional zero initializer. (Notable exception: s8 and s16 macros.) This includes designated initializers. Instead I’ve been initializing with assignments. For example, this buffered output “constructor”:

typedef struct {
    u8 *buf;
    i32 len;
    i32 cap;
    i32 fd;
    b32 err;
} u8buf;

static u8buf newu8buf(arena *perm, i32 cap, i32 fd)
{
    u8buf r = {0};
    r.buf = new(perm, u8, cap);
    r.cap = cap;
    r.fd  = fd;
    return r;
}

I like how this reads, but it also eliminates a cognitive burden: The assignments are separated by sequence points, giving them an explicit order. It doesn’t matter here, but in other cases it does:

    example e = {
        .name = randname(&rng),
        .age  = randage(&rng),
        .seat = randseat(&rng),
    };

There are 6 possible values for e from the same seed. I like no longer thinking about these possibilities.

Odds and ends

Prefer __attribute to __attribute__. The __ suffix is excessive and unnecessary.

__attribute((malloc, alloc_size(2, 4)))

For Win32 systems programming, which typically only requires a modest number of declarations and definitions, rather than include windows.h, write the prototypes out by hand using custom types. It reduces build times, declutters namespaces, and interfaces more cleanly with the program (no more DWORD/BOOL/ULONG_PTR, but u32/b32/uptr).

#define W32(r) __declspec(dllimport) r __stdcall
W32(void)   ExitProcess(u32);
W32(i32)    GetStdHandle(u32);
W32(byte *) VirtualAlloc(byte *, usize, u32, u32);
W32(b32)    WriteConsoleA(uptr, u8 *, u32, u32 *, void *);
W32(b32)    WriteConsoleW(uptr, c16 *, u32, u32 *, void *);

For inline assembly, treat the outer parentheses like braces, put a space before the opening parenthesis, just like if, and start each constraint line with its colon.

static u64 rdtscp(void)
{
    u32 hi, lo;
    asm volatile (
        "rdtscp"
        : "=d"(hi), "=a"(lo)
        :
        : "cx", "memory"
    );
    return (u64)hi<<32 | lo;
}

There’s surely a lot more to my style than this, but unlike the above, those details haven’t changed this year. To see most of the mentioned items in action in a small program, see wordhist.c, one of my testing grounds for hash-tries, or for a slightly larger program, asmint.c, a mini programming language implementation.

A simple, arena-backed, generic dynamic array for C

2023-10-05T23:05:57Z

Previously I presented an arena-friendly hash map applicable to any programming language where one might use arena allocation. In this third article I present a generic, arena-backed dynamic array. The details are specific to C, as the most appropriate mechanism depends on the language (e.g. templates, generics). Just as in the previous two articles, the goal is to demonstrate an idea so simple that a full implementation fits on one terminal pager screen — a concept rather than a library.

Unlike a hash map or linked list, a dynamic array — a data buffer with a size that varies during run time — is more difficult to square with arena allocation. They’re contiguous by definition, and we cannot resize objects in the middle of an arena, i.e. realloc. So while convenient, they come with trade-offs. At least until they stop growing, dynamic arrays are more appropriate for shorter-lived, temporary contexts, where you would use a scratch arena. On average they consume about twice the memory of a fixed array of the same size.

As before, I begin with a motivating example of its use. The guts of the generic dynamic array implementation are tucked away in a push() macro, which is essentially the entire interface.

typedef struct {
    int32_t  *data;
    ptrdiff_t len;
    ptrdiff_t cap;
} int32s;

int32s fibonacci(int32_t max, arena *perm)
{
    static int32_t init[] = {0, 1};
    int32s fib = {0};
    fib.data = init;
    fib.len = fib.cap = countof(init);

    for (;;) {
        int32_t a = fib.data[fib.len-2];
        int32_t b = fib.data[fib.len-1];
        if (a+b > max) {
            return fib;
        }
        *push(&fib, perm) = a + b;
    }
}

Anyone familiar with Go will quickly notice a pattern: int32s looks an awful lot like a Go slice. That was indeed my inspiration, and there is enough context that you could infer similar semantics. I will even call these “slice headers.” Initially I tried a design based on stretchy buffers, but I didn’t like the macros nor the ergonomics.

I wouldn’t write a fibonacci this way in practice, but it’s useful for highlighting certain features. Of particular note:

The dynamic array initially wraps a static array, yet I can append to it as though it were a dynamic allocation. If I don’t append at all, it still works. (Though of course the caller then shouldn’t modify the elements.)
push() operates on any object which is slice-shaped. That is it has a pointer field named data, a ptrdiff_t length field named len, a ptrdiff_t capacity field named cap, and all in that order.
push() evaluates to a pointer to the newly-pushed element. In my example I immediately dereference and assign a value.
An element is zero-initialized the first time it’s pushed. I say “first time” because you can truncate an array by reducing len, and “pushing” afterward will simply reveal the original elements.
The name int32s is intended to evoke plurality. I’ll use this convention again in a moment.
The arena passed to push() is only used if the array needs to grow. The new backing array will be allocated out of this arena regardless of the original backing array.
Resizes always change the backing array address, and the old array remains valid. This is also just like slices in Go.
Despite the name perm, I expect it points to the caller’s scratch arena. It’s “permanent” only relative to the fibonacci call. Otherwise I might build the array in a scratch arena, then create a final copy in a permanent arena.

For a slightly more realistic example: rendering triangles. Suppose we need data in array format for OpenGL, but we don’t know the number of vertices ahead of time. A dynamic array is convenient, especially if we discard the array as soon as OpenGL is done with it. We could build up entire scenes like this for each display frame.

typedef struct {
     GLfloat x, y, z;
} GLvert;

typedef struct {
    GLvert   *data;
    ptrdiff_t len;
    ptrdiff_t cap;
} GLverts;

void renderobj(char *buf, ptrdiff_t len, arena scratch)
{
    GLverts vs = {0};
    objparser parser = newobjparser(buf, len);
    for (...) {
        *push(&vs, &scratch) = nextvert(&parser);
    }
    glVertexPointer(3, GL_FLOAT, 0, vs.data);
    glDrawArrays(GL_TRIANGLES, 0, vs.len);
}

As before, GLverts is slice-shaped. This time it’s zero-initialized, which is a valid empty dynamic array. As with maps, that means any object with such a field comes with a ready-to-use empty dynamic array. Putting it together, here’s an example that gradually appends vertices to named dynamic arrays, randomly accessed by string name:

typedef struct {
    map    *child[4];
    str     name;
    GLverts verts;
} map;

verts *upsert(map **, str, arena *);  // from the last article

map *example(..., arena *perm)
{
    map *m = 0;
    for (...) {
        str name = ...;
        vert v = ...;
        verts *vs = upsert(&m, name, perm);
        *push(vs, perm) = v;
    }
    return m;
}

That’s what Go would call map[str][]vert, but allocated entirely out of an arena. Ever thought C could do this so simply and conveniently? The memory allocator (~15 lines), map (~30 lines), dynamic array (~30 lines), constructors (0 lines), and destructors (0 lines) that power this total to ~75 lines of zero-dependency code!

Implementation details

I despise macro abuse, and programs substantially implemented in macros are annoying. They’re difficult to understand and debug. A good dynamic array implementation will require a macro, and one of my goals was to keep it as simple and minimal as possible. The macro’s job is to:

Check the capacity and maybe grow the array via function call.
Smuggle type information (i.e. sizeof) to that function.
Compute a pointer of the proper type to the new element.

Here’s what I came up with:

#define push(s, arena) \
    ((s)->len >= (s)->cap \
        ? grow(s, sizeof(*(s)->data), arena), \
          (s)->data + (s)->len++ \
        : (s)->data + (s)->len++)

The macro will be used as an expression, so it cannot use statements like if. The condition is therefore a ternary operator. If it’s full, it calls the supporting grow function. In either case, it computes the result from data. In particular, note that the grow branch uses a comma operator to sequence growth before pointer derivation, as grow will change the value of data as a side effect.

To be generic, the grow function uses memcpy-based type punning:

static void grow(void *slice, ptrdiff_t size, arena *a)
{
    struct {
        void     *data;
        ptrdiff_t len;
        ptrdiff_t cap;
    } replica;
    memcpy(&replica, slice, sizeof(replica));

    replica.cap = replica.cap ? replica.cap : 1;
    ptrdiff_t align = 16;
    void *data = alloc(a, 2*size, align, replica.cap);
    replica.cap *= 2;
    if (replica.len) {
        memcpy(data, replica.data, size*replica.len);
    }
    replica.data = data;

    memcpy(slice, &replica, sizeof(replica));
}

The slice header is copied over a local replica, avoiding conflicts with strict aliasing. This is the archetype slice header. It still requires that different pointers have identical memory representation. That’s virtually always true, and certainly true anywhere I’d use an arena.

If the capacity was zero, it behaves as though it was one, and so, through doubling, zero-capacity arrays become capacity-2 arrays on the first push. It’s better to let alloc — whose definition, you may recall, included an overflow check — handle size overflow so that it can invoke the out of memory policy, so instead of doubling cap, which would first require an overflow check, it doubles the object size. This is a small constant (i.e. from sizeof), so doubling it is always safe.

Copying over old data includes a special check for zero-length inputs, because, quite frustratingly, memcpy does not accept null even when the length is zero. I check for zero length instead of null so that it’s more sensitive to defects. If the pointer is null with a non-zero length, it will trip Undefined Behavior Sanitizer, or at least crash the program, rather than silently skip copying.

Finally the updated replica is copied over the original slice header, updating it with the new data pointer and capacity. The original backing array is untouched but is no longer referenced through this slice header. Old slice headers will continue to function with the old backing array, such as when the arena is reset to a point where the dynamic array was smaller.

    int32s vals = {0};
    *push(&vals, &scratch) = 1;  // resize: cap=2
    *push(&vals, &scratch) = 2;
    *push(&vals, &scratch) = 3;  // resize: cap=4
    {
        arena tmp = scratch;  // scoped arena
        int32s extended = vals;
        *push(&extended, &tmp) = 4;
        *push(&extended, &tmp) = 5;  // resize: cap=8
        example(extended);
    }
    // vals still works, cap=4, extension freed

In practice, a dynamic array comes from old backing arrays whose total size adds up just shy of the current array capacity. For example, if the current capacity is 16, old arrays are size 2+4+8 = 14.

If you’re worried about misuse, such as slice header fields being in the wrong order, a couple of assertions can quickly catch such mistakes at run time, typically under the lightest of testing. In fact, I planned for this by using the more-sensitive len>=cap instead of just len==cap, so that it would direct execution towards assertions in grow:

    assert(replica.len >= 0);
    assert(replica.cap >= 0);
    assert(replica.len <= replica.cap);

This also demonstrates another benefit of signed sizes: Exactly half the range is invalid and so defects tend to quickly trip these assertions.

Alignment

Alignment is unfortunately fixed, and I picked a “safe” value of 16. In my new() macro I used _Alignof to pass type information to alloc. Due to an oversight, unlike sizeof, _Alignof cannot be applied to expressions, and so it cannot be used in dynamic arrays. GCC and Clang support _Alignof on expressions just like sizeof, as it’s such an obvious idea, but Microsoft chose to strictly follow the oversight in the standard. To support MSVC, I’ve deliberately limited the capabilities of push. If that doesn’t matter, fixing it is easy:

--- a/example.c
+++ b/example.c
@@ -2,3 +2,3 @@
     ((s)->len >= (s)->cap \
-        ? grow(s, sizeof(*(s)->data), arena), \
+        ? grow(s, sizeof(*(s)->data), _Alignof(*(s)->data), arena), \
           (s)->data + (s)->len++ \
@@ -6,3 +6,3 @@
 
-static void grow(void *slice, ptrdiff_t size, arena *a)
+static void grow(void *slice, ptrdiff_t size, ptrdiff_t align, arena *a)
 {
@@ -16,3 +16,2 @@
     replica.cap = replica.cap ? replica.cap : 1;
-    ptrdiff_t align = 16;
     void *data = alloc(a, 2*size, align, replica.cap);

Though while you’re at it, if you’re already using extensions you might want to switch push to a statement expression so that the slice header s does not get evaluated more than once — i.e. so that upsert() in my example above could be used inside the push() expession.

#define push(s, a) ({ \
    typeof(s) s_ = (s); \
    typeof(a) a_ = (a); \
    if (s_->len >= s_->cap) { \
        grow(s_, sizeof(*s_->data), _Alignof(*s_->data), a_); \
    } \
    s_->data + s_->len++; \
})

So far this approach to dynamic arrays has been useful on a number of occasions, and I’m quite happy with the results. As with arena-friendly hash maps, I’ve no doubt they’ll become a staple in my C programs.

Addendum: extend the last allocation

Dennis Schön suggests a check if the array ends at the next arena allocation and, if so, extend the array into the arena in place. grow() already has the necessary information on hand, so it needs only the additional check:

static void grow(void *slice, ptrdiff_t size, ptrdiff_t align, arena *a)
{
    struct {
        char     *data;
        ptrdiff_t len;
        ptrdiff_t cap;
    } replica;
    memcpy(&replica, slice, sizeof(replica));

    if (!replica.data) {
        replica.cap = 1;
        replica.data = alloc(a, 2*size, align, replica.cap);
    } else if (a->beg == replica.data + size*replica.cap) {
        alloc(a, size, 1, replica.cap);
    } else {
        void *data = alloc(a, 2*size, align, replica.cap);
        memcpy(data, replica.data, size*replica.len);
        replica.data = data;
    }

    replica.cap *= 2;
    memcpy(slice, &replica, sizeof(replica));
}

Because that’s yet another check for null, I’ve split it out into an independent third case:

If the data pointer is null, make an initial allocation.
If the array ends at the next arena allocation, extend it.
Otherwise allocate a fresh array and copy.

Not quite as simple, but it improves the most common case.

An easy-to-implement, arena-friendly hash map

2023-09-30T23:18:40Z

My last article had tips for for arena allocation. This next article demonstrates a technique for building bespoke hash maps that compose nicely with arena allocation. In addition, they’re fast, simple, and automatically scale to any problem that could reasonably be solved with an in-memory hash map. To avoid resizing — both to better support arenas and to simplify implementation — they have slightly above average memory requirements. The design, which we’re calling a hash-trie, is the result of fruitful collaboration with NRK, whose sibling article includes benchmarks. It’s my new favorite data structure, and has proven incredibly useful. With a couple well-placed acquire/release atomics, we can even turn it into a lock-free concurrent hash map.

I’ve written before about MSI hash tables, a simple, very fast map that can be quickly implemented from scratch as needed, tailored to the problem at hand. The trade off is that one must know the upper bound a priori in order to size the base array. Scaling up requires resizing the array — an impedance mismatch with arena allocation. Search trees scale better, as there’s no underlying array, but tree balancing tends to be finicky and complex, unsuitable to rapid, on-demand implementation. We want the ease of an MSI hash table with the scaling of a tree.

I’ll motivate the discussion with example usage. Suppose we have an array of pointer+length strings, as defined last time:

typedef struct {
    uint8_t  *data;
    ptrdiff_t len;
} str;

And we need a function that removes duplicates in place, but (for the moment) we’re not worried about preserving order. This could be done naively in quadratic time. Smarter is to sort, then look for runs. Instead, I’ve used a hash map to track seen strings. It maps str to bool, and it is represented as type strmap and one insert+lookup function, upsert.

// Insert/get bool value for given str key.
bool *upsert(strmap **, str key, arena *);

ptrdiff_t unique(str *strings, ptrdiff_t len, arena scratch)
{
    ptrdiff_t count = 0;
    strmap *seen = 0;
    while (count < len) {
        bool *b = upsert(&seen, strings[count], &scratch);
        if (*b) {
            // previously seen (discard)
            strings[count] = strings[--len];
        } else {
            // newly-seen (keep)
            count++;
            *b = 1;
        }
    }
    return count;
}

In particular, note:

A null pointer is an empty hash map and initialization is trivial. As discussed in the last article, one of my arena allocation principles is default zero-initializion. Put together, that means any data structure containing a map comes with a ready-to-use, empty map.
The map is allocated out of the scratch arena so it’s automatically freed upon any return. It’s as care-free as garbage collection.
The map directly uses strings in the input array as keys, without making copies nor worrying about ownership. Arenas own objects, not references. If I wanted to carve out some fixed keys ahead of time, I could even insert static strings.
upsert returns a pointer to a value. That is, a pointer into the map. This is not strictly required, but usually makes for a simple interface. When an entry is new, this value will be false (zero-initialized).

So, what is this wonderful data structure? Here’s the basic shape:

typedef struct {
    hashmap *child[4];
    keytype  key;
    valtype  value;
} hashmap;

They child and key fields are essential to the map. Adding a child to any data structure turns it into a hash map over whatever field you choose as the key. In other words, a hash-trie can serve as an intrusive hash map. In several programs I’ve combined intrusive lists and hash maps to create an insert-ordered hash map. Going the other direction, omitting value turns it into a hash set. (Which is what unique really needs!)

As you probably guessed, this hash-trie is a 4-ary tree. It can easily be 2-ary (leaner but slower) or 8-ary (bigger and usually no faster), but 4-ary strikes a good balance, if a bit bulky. In the example above, keytype would be str and valtype would be bool. The most general form of upsert looks like this:

valtype *upsert(hashmap **m, keytype key, arena *perm)
{
    for (uint64_t h = hash(key); *m; h <<= 2) {
        if (equals(key, (*m)->key)) {
            return &(*m)->value;
        }
        m = &(*m)->child[h>>62];
    }
    if (!perm) {
        return 0;
    }
    *m = new(perm, hashmap);
    (*m)->key = key;
    return &(*m)->value;
}

This will take some unpacking. The first argument is a pointer to a pointer. That’s the destination for any newly-allocated element. As it travels down the tree, this points into the parent’s child array. If it points to null, then it’s an empty tree which, by definition, does not contain the key.

We need two “methods” for keys: hash and equals. The hash function should return a uniformly distributed integer. As is usually the case, less uniform fast hashes generally do better than highly-uniform slow hashes. For hash maps under ~100K elements a 32-bit hash is fine, but larger maps should use a 64-bit hash state and result. Hash collisions revert to linear, linked list performance and, per the birthday paradox, that will happen often with 32-bit hashes on large hash maps.

If you’re worried about pathological inputs, add a seed parameter to upsert and hash. Or maybe even use the address m as a seed. The specifics depend on your security model. It’s not an issue for most hash maps, so I don’t demonstrate it here.

The top two bits of the hash are used to select a branch. These tend to be higher quality for multiplicative hash functions. At each level two bits are shifted out. This is what gives it its name: a trie of the hash bits. Though it’s un-trie-like in the way it deposits elements at the first empty spot. To make it 2-ary or 8-ary, use 1 or 3 bits at a time.

I initially tried a Multiplicative Congruential Generator (MCG) to select the next branch at each trie level, instead of bit shifting, but NRK noticed it was consistently slower than shifting.

While “delete” could be handled using gravestones, many deletes would not work well. After all, the underlying allocator is an arena. A combination of uniformly distributed branching and no deletion means that rebalancing is unnecessary. This is what grants it its simplicity!

If no arena is provided, it reverts to a lookup and returns null when the key is not found. It allows one function to flexibly serve both modes. In unique, pure lookups are unneeded, so this condition could be skipped in its strmap.

Sometimes it’s useful to return the entire hashmap object itself rather than an internal pointer, particularly when it’s intrusive. Use whichever works best for the situation. Regardless, exploit zero-initialization to detect newly-allocated elements when possible.

In some cases we may deep copy the key in its arena before inserting it into the map. The provided key may be a temporary (e.g. sprintf) which the map outlives, and the caller doesn’t want to allocate a longer-lived key unless it’s needed. It’s all part of tailoring the map to the problem, which we can do because it’s so short and simple!

Fleshing it out

Putting it all together, unique could look like the following, with strmap/upsert renamed to strset/ismember:

uint64_t hash(str s)
{
    uint64_t h = 0x100;
    for (ptrdiff_t i = 0; i < s.len; i++) {
        h ^= s.data[i];
        h *= 1111111111111111111u;
    }
    return h;
}

bool equals(str a, str b)
{
    return a.len==b.len && !memcmp(a.data, b.data, a.len);
}

typedef struct {
    strset *child[4];
    str     key;
} strset;

bool ismember(strset **m, str key, arena *perm)
{
    for (uint64_t h = hash(key); *m; h <<= 2) {
        if (equals(key, (*m)->key)) {
            return 1;
        }
        m = &(*m)->child[h>>62];
    }
    *m = new(perm, strset);
    (*m)->key = key;
    return 0;
}

ptrdiff_t unique(str *strings, ptrdiff_t len, arena scratch)
{
    ptrdiff_t count = 0;
    for (strset *seen = 0; count < len;) {
        if (ismember(&seen, strings[count], &scratch)) {
            strings[count] = strings[--len];
        } else {
            count++;
        }
    }
    return count;
}

The FNV hash multiplier is 19 ones, my favorite prime. I don’t bother with an xorshift finalizer because the bits are used most-significant first. Exercise for the reader: Support retaining the original input order using an intrusive linked list on strset.

Relative pointers?

As mentioned, four pointers per entry — 32 bytes on 64-bit hosts — makes these hash-tries a bit heavier than average. It’s not an issue for smaller hash maps, but has practical consequences for huge hash maps.

In attempt to address this, I experimented with relative pointers (example: markov.c). That is, instead of pointers I use signed integers whose value indicates an offset relative to itself. Because relative pointers can only refer to nearby memory, a custom allocator is imperative, and arenas fit the bill perfectly. Range can be extended by exploiting memory alignment. In particular, 32-bit relative pointers can reference up to 8GiB in either direction. Zero is reserved to represent a null pointer, and relative pointers cannot refer to themselves.

As a bonus, data structures built out of relative pointers are position independent. A collection of them — perhaps even a whole arena — can be dumped out to, say, a file, loaded back at a different position, then continue to operate as-is. Very cool stuff.

Using 32-bit relative pointers on 64-bit hosts cuts the hash-trie overhead in half, to 16 bytes. With an arena no larger than 8GiB, such pointers are guaranteed to work. No object is ever too far away. It’s a compounding effect, too. Smaller map nodes means a larger number of them are in reach of a relative pointer. Also very cool.

However, as far as I know, no generally available programming language implementation supports this concept well enough to put into practice. You could implement relative pointers with language extension facilities, such as C++ operator overloads, but no tools will understand them — a major bummer. You can no longer use a debugger to examine such structures, and it’s just not worth that cost. If only arena allocation was more popular…

As a concurrent hash map

For the finale, let’s convert upsert into a concurrent, lock-free hash map. That is, multiple threads can call upsert concurrently on the same map. Each must still have its own arena, probably per-thread arenas, and so no implicit locking for allocation.

The structure itself requires no changes! Instead we need two atomic operations: atomic load (acquire), and atomic compare-and-exchange (acquire/release). They operate only on child array elements and the tree root. To illustrate I will use GCC atomics, also supported by Clang.

valtype *upsert(map **m, keytype key, arena *perm)
{
    for (uint64_t h = hash(key);; h <<= 2) {
        map *n = __atomic_load_n(m, __ATOMIC_ACQUIRE);
        if (!n) {
            if (!perm) {
                return 0;
            }
            arena rollback = *perm;
            map *new = new(perm, map, 1);
            new->key = key;
            int pass = __ATOMIC_RELEASE;
            int fail = __ATOMIC_ACQUIRE;
            if (__atomic_compare_exchange_n(m, &n, new, 0, pass, fail)) {
                return &new->value;
            }
            *perm = rollback;
        }
        if (equals(n->key, key)) {
            return &n->value;
        }
        m = n->child + (h>>62);
    }
}

First an atomic load retrieves the current node. If there is no such node, then attempt to insert one using atomic compare-and-exchange. The ABA problem is not an issue thanks again to lack of deletion: Once set, a pointer never changes. Before allocating a node, take a snapshot of the arena so that the allocation can be reverted on failure. If another thread got there first, continue tumbling down the tree as though a null was never observed.

On compare-and-swap failure, it turns into an acquire load, just as it began. On success, it’s a release store, synchronizing with acquire loads on other threads.

The key field does not require atomics because it’s synchronized by the compare-and-swap. That is, the assignment will happen before the node is inserted, and keys do not change after insertion. The same goes for any zeroing done by the arena.

Loads and stores through the returned pointer are the caller’s responsibility. These likely require further synchronization. If valtype is a shared counter then an atomic increment is sufficient. In other cases, upsert should probably be modified to accept an initial value to be assigned alongside the key so that the entire key/value pair inserted atomically. Alternatively, break it into two steps. The details depend on the needs of the program.

On small trees there will much contention near the root of the tree during inserts. Fortunately, a contentious tree will not stay small for long! The hash function will spread threads around a large tree, generally keeping them off each other’s toes.

A complete demo you can try yourself: concurrent-hash-trie.c. It returns a value pointer like above, and store/load is synchronized by the thread join. Each thread is given a per-thread subarena allocated out of the main arena, and the final tree is built from these subarenas.

For a practical example: a multithreaded rainbow table to find hash function collisions. Threads are synchronized solely through atomics in the shared hash-trie.

A complete fast, concurrent, lock-free hash map in under 30 lines of C sounds like a sweet deal to me!